OCR
Overview
OCR (Optical Character Recognition) is the process of converting images of typed, handwritten, or printed text into machine-encoded text.
OCR is commonly used for text recognition and extraction from scanned PDF files, photographs of documents, scene photos, invoices, receipts, and other image-based documents.
The following features support OCR:
- PDF to Word
- PDF to Excel
- PDF to PowerPoint
- PDF to HTML
- PDF to RTF
- PDF to TXT
- PDF to Searchable PDF
- PDF to OFD
- Extract PDF to JSON
- Extract PDF to Markdown
Set OCR Language
Use languages to specify OCR languages. The value is an array of numeric OCR language constants.
js
const OCRLanguage = {
CHINESE: 1,
ENGLISH: 3,
AUTO: 16
};
const options = {
enableOcr: true,
languages: [OCRLanguage.ENGLISH, OCRLanguage.CHINESE]
};OCR Options
Use ocrOption to control OCR processing scope.
js
const OCROption = {
INVALID_CHARACTER: 0,
SCAN_PAGE: 1,
INVALID_CHARACTER_AND_SCAN_PAGE: 2,
ALL: 3
};
options.ocrOption = OCROption.ALL;Preserve Page Background
When OCR is enabled, use containPageBackgroundImage to control whether page background images are preserved.
js
options.containPageBackgroundImage = true;Sample
js
sdk.setDocumentAIModel("/path/to/documentai.model", -1);
const options = {
enableOcr: true,
languages: [3],
ocrOption: 3
};
const result = sdk.startPDFToWord(inputFilePath, "", outputFilePath, options);