Overview
OCR (Optical Character Recognition) converts images of typed, handwritten, or printed text into machine-encoded text.
OCR is commonly used for text recognition and extraction from the following types of documents:
- Non-editable scanned PDF files.
- Photographs of documents.
- Scene photos such as advertising layouts and signboards.
- Identification cards, passports, vehicle license plates, invoices, bills, and receipts.
The following features support OCR:
- PDF to Word
- PDF to Excel
- PDF to PowerPoint
- PDF to HTML
- PDF to RTF
- PDF to TXT
- PDF to CSV
- PDF to Searchable PDF
- PDF to OFD
- Extract PDF to JSON
- Extract PDF to Markdown
OCR Language
Pass OCR languages through CConvertOption.languages and set CConvertOption.language_count for each conversion task.
COCRLanguage languages[] = {e_CENGLISH};
CConvertOption option = CPDF_DefaultConvertOption();
option.enable_ocr = true;
option.languages = languages;
option.language_count = 1;
CPDF_StartPDFToWord(CPDF_TEXT("word.pdf"), CPDF_TEXT("password"), CPDF_TEXT("path/output.docx"), option, NULL);Supported C enum values include:
| Enum | Description |
|---|---|
e_CCHINESE | Chinese Simplified |
e_CCHINESE_TRA | Chinese Traditional |
e_CENGLISH | English |
e_CKOREAN | Korean |
e_CJAPANESE | Japanese |
e_CLATIN | Latin script languages |
e_CDEVANAGARI | Devanagari |
e_CCYRILLIC | Cyrillic |
e_CARABIC | Arabic |
e_CTAMIL | Tamil |
e_CTELUGU | Telugu |
e_CKANNADA | Kannada |
e_CTHAI | Thai |
e_CGREEK | Greek |
e_CEslav | Eslav |
e_CAUTO | Automatically select language |
OCR Options
Different OCR options can be selected according to actual needs:
e_CInvalidCharacter: Recognizes invalid or garbled characters in the PDF document through OCR, while normal characters are not processed by OCR.e_CScanPage: Recognizes scanned pages in the PDF document through OCR, while editable pages are not processed by OCR.e_CInvalidCharacterAndScanPage: Recognizes both invalid characters and scanned pages in the PDF document through OCR.e_CAll: Recognizes all pages and characters in the PDF document through OCR.
Preserve Page Background
When OCR is enabled, you can enable contain_page_background_image to preserve the original page background image of the PDF. If it is disabled, the image result detected during page layout analysis will be retained.
Notice
- The quality of the OCR result depends on the quality of the input image. A good rule of thumb is that the more pixels in the character shapes, the better. The ideal image is a grayscale image with a resolution around 300 DPI.
- When performing OCR, make sure the OCR language setting matches the language in the PDF document to achieve the best OCR conversion quality.
- OCR functionality currently does not support operating systems lower than Windows 10.
Converting Images to Other Document Formats
The OCR function also supports converting input images into Word, Excel, PowerPoint, HTML, CSV, RTF, TXT, JSON, and other formats.
CPDF_SetDocumentAIModel(CPDF_TEXT("path/documentai.model"), -1);
COCRLanguage languages[] = {e_CENGLISH};
CConvertOption option = CPDF_DefaultConvertOption();
option.enable_ocr = true;
option.languages = languages;
option.language_count = 1;
CPDF_StartPDFToWord(CPDF_TEXT("input.png"), CPDF_TEXT(""), CPDF_TEXT("path/output.docx"), option, NULL);