Skip to content
ComPDF

Overview

OCR (Optical Character Recognition) converts images of typed, handwritten, or printed text into machine-encoded text.

OCR is commonly used for text recognition and extraction from the following types of documents:

  • Non-editable scanned PDF files.
  • Photographs of documents.
  • Scene photos such as advertising layouts and signboards.
  • Identification cards, passports, vehicle license plates, invoices, bills, and receipts.

The following features support OCR:

  • PDF to Word
  • PDF to Excel
  • PDF to PowerPoint
  • PDF to HTML
  • PDF to RTF
  • PDF to TXT
  • PDF to CSV
  • PDF to Searchable PDF
  • PDF to OFD
  • Extract PDF to JSON
  • Extract PDF to Markdown

OCR Language

Pass OCR languages through CConvertOption.languages and set CConvertOption.language_count for each conversion task.

c
COCRLanguage languages[] = {e_CENGLISH};

CConvertOption option = CPDF_DefaultConvertOption();
option.enable_ocr = true;
option.languages = languages;
option.language_count = 1;

CPDF_StartPDFToWord(CPDF_TEXT("word.pdf"), CPDF_TEXT("password"), CPDF_TEXT("path/output.docx"), option, NULL);

Supported C enum values include:

EnumDescription
e_CCHINESEChinese Simplified
e_CCHINESE_TRAChinese Traditional
e_CENGLISHEnglish
e_CKOREANKorean
e_CJAPANESEJapanese
e_CLATINLatin script languages
e_CDEVANAGARIDevanagari
e_CCYRILLICCyrillic
e_CARABICArabic
e_CTAMILTamil
e_CTELUGUTelugu
e_CKANNADAKannada
e_CTHAIThai
e_CGREEKGreek
e_CEslavEslav
e_CAUTOAutomatically select language

OCR Options

Different OCR options can be selected according to actual needs:

  • e_CInvalidCharacter: Recognizes invalid or garbled characters in the PDF document through OCR, while normal characters are not processed by OCR.
  • e_CScanPage: Recognizes scanned pages in the PDF document through OCR, while editable pages are not processed by OCR.
  • e_CInvalidCharacterAndScanPage: Recognizes both invalid characters and scanned pages in the PDF document through OCR.
  • e_CAll: Recognizes all pages and characters in the PDF document through OCR.

Preserve Page Background

When OCR is enabled, you can enable contain_page_background_image to preserve the original page background image of the PDF. If it is disabled, the image result detected during page layout analysis will be retained.

Notice

  • The quality of the OCR result depends on the quality of the input image. A good rule of thumb is that the more pixels in the character shapes, the better. The ideal image is a grayscale image with a resolution around 300 DPI.
  • When performing OCR, make sure the OCR language setting matches the language in the PDF document to achieve the best OCR conversion quality.
  • OCR functionality currently does not support operating systems lower than Windows 10.

Converting Images to Other Document Formats

The OCR function also supports converting input images into Word, Excel, PowerPoint, HTML, CSV, RTF, TXT, JSON, and other formats.

c
CPDF_SetDocumentAIModel(CPDF_TEXT("path/documentai.model"), -1);

COCRLanguage languages[] = {e_CENGLISH};

CConvertOption option = CPDF_DefaultConvertOption();
option.enable_ocr = true;
option.languages = languages;
option.language_count = 1;

CPDF_StartPDFToWord(CPDF_TEXT("input.png"), CPDF_TEXT(""), CPDF_TEXT("path/output.docx"), option, NULL);