VitePress

Overview

OCR (Optical Character Recognition) converts images of typed, handwritten, or printed text into machine-encoded text.

OCR is commonly used for text recognition and extraction from the following types of documents:

Non-editable scanned PDF files.
Photographs of documents.
Scene photos such as advertising layouts and signboards.
Identification cards, passports, vehicle license plates, invoices, bills, and receipts.

The following features support OCR:

PDF to Word
PDF to Excel
PDF to PowerPoint
PDF to HTML
PDF to RTF
PDF to TXT
PDF to CSV
PDF to Searchable PDF
PDF to OFD
Extract PDF to JSON
Extract PDF to Markdown

OCR Language

Pass OCR languages through CConvertOption.languages and set CConvertOption.language_count for each conversion task.

COCRLanguage languages[] = {e_CENGLISH};

CConvertOption option = CPDF_DefaultConvertOption();
option.enable_ocr = true;
option.languages = languages;
option.language_count = 1;

CPDF_StartPDFToWord(CPDF_TEXT("word.pdf"), CPDF_TEXT("password"), CPDF_TEXT("path/output.docx"), option, NULL);

Supported C enum values include:

Enum	Description
`e_CCHINESE`	Chinese Simplified
`e_CCHINESE_TRA`	Chinese Traditional
`e_CENGLISH`	English
`e_CKOREAN`	Korean
`e_CJAPANESE`	Japanese
`e_CLATIN`	Latin script languages
`e_CDEVANAGARI`	Devanagari
`e_CCYRILLIC`	Cyrillic
`e_CARABIC`	Arabic
`e_CTAMIL`	Tamil
`e_CTELUGU`	Telugu
`e_CKANNADA`	Kannada
`e_CTHAI`	Thai
`e_CGREEK`	Greek
`e_CEslav`	Eslav
`e_CAUTO`	Automatically select language

OCR Options

Different OCR options can be selected according to actual needs:

e_CInvalidCharacter: Recognizes invalid or garbled characters in the PDF document through OCR, while normal characters are not processed by OCR.
e_CScanPage: Recognizes scanned pages in the PDF document through OCR, while editable pages are not processed by OCR.
e_CInvalidCharacterAndScanPage: Recognizes both invalid characters and scanned pages in the PDF document through OCR.
e_CAll: Recognizes all pages and characters in the PDF document through OCR.

Preserve Page Background

When OCR is enabled, you can enable contain_page_background_image to preserve the original page background image of the PDF. If it is disabled, the image result detected during page layout analysis will be retained.

Notice

The quality of the OCR result depends on the quality of the input image. A good rule of thumb is that the more pixels in the character shapes, the better. The ideal image is a grayscale image with a resolution around 300 DPI.
When performing OCR, make sure the OCR language setting matches the language in the PDF document to achieve the best OCR conversion quality.
OCR functionality currently does not support operating systems lower than Windows 10.

Converting Images to Other Document Formats

The OCR function also supports converting input images into Word, Excel, PowerPoint, HTML, CSV, RTF, TXT, JSON, and other formats.

CPDF_SetDocumentAIModel(CPDF_TEXT("path/documentai.model"), -1);

COCRLanguage languages[] = {e_CENGLISH};

CConvertOption option = CPDF_DefaultConvertOption();
option.enable_ocr = true;
option.languages = languages;
option.language_count = 1;

CPDF_StartPDFToWord(CPDF_TEXT("input.png"), CPDF_TEXT(""), CPDF_TEXT("path/output.docx"), option, NULL);

Overview ​

OCR Language ​

OCR Options ​

Preserve Page Background ​