Skip to content
ComPDF

OCR

Overview

OCR (Optical Character Recognition) is the process of converting images of typed, handwritten, or printed text into machine-encoded text.

OCR is commonly used for text recognition and extraction from scanned PDF files, photographs of documents, scene photos, invoices, receipts, and other image-based documents.

The following features support OCR:

  • PDF to Word
  • PDF to Excel
  • PDF to PowerPoint
  • PDF to HTML
  • PDF to RTF
  • PDF to TXT
  • PDF to Searchable PDF
  • PDF to OFD
  • Extract PDF to JSON
  • Extract PDF to Markdown

Set OCR Language

Use languages to specify OCR languages. The value is an array of numeric OCR language constants.

js
const OCRLanguage = {
  CHINESE: 1,
  ENGLISH: 3,
  AUTO: 16
};

const options = {
  enableOcr: true,
  languages: [OCRLanguage.ENGLISH, OCRLanguage.CHINESE]
};

OCR Options

Use ocrOption to control OCR processing scope.

js
const OCROption = {
  INVALID_CHARACTER: 0,
  SCAN_PAGE: 1,
  INVALID_CHARACTER_AND_SCAN_PAGE: 2,
  ALL: 3
};

options.ocrOption = OCROption.ALL;

Preserve Page Background

When OCR is enabled, use containPageBackgroundImage to control whether page background images are preserved.

js
options.containPageBackgroundImage = true;

Sample

js
sdk.setDocumentAIModel("/path/to/documentai.model", -1);

const options = {
  enableOcr: true,
  languages: [3],
  ocrOption: 3
};

const result = sdk.startPDFToWord(inputFilePath, "", outputFilePath, options);