OCR

Overview

OCR (Optical Character Recognition) is the process of converting images of typed, handwritten, or printed text into machine-encoded text.

OCR is commonly used for text recognition and extraction from the following types of documents:

Non-editable scanned PDF files
Photographs of documents.
Scene photos such as advertising layouts, signboards, etc.
Identification cards, passports, vehicle license plates, and other official plates.
Invoices, bills, receipts, and other financial documents.

The following features support OCR:

PDF to Word
PDF to Excel
PDF to PowerPoint (PPT)
PDF to HTML
PDF to Rich Text Format (RTF)
PDF to Text (TXT)
PDF to CSV
PDF to Searchable PDF
PDF to OFD
Extract PDF to JSON
Extract PDF to Markdown

OCR Language Support of ComPDF Conversion SDK:

Script / Notes	Language (Native)	Language (In English)
Latn; American	English	English
Latn; Canadian	Français canadien	French
Hans/Hant	中文简体	Chinese (Simplified)
Hans/Hant	中文繁体	Chinese (Traditional)
Jpan	日本語	Japanese
Kore	한국어	Korean
Latn	Deutsch	German
Latn	Српски (латиница)	Serbian (latin)
Latn	Occitan, lenga d'òc, provençal	Occitan
Latn	Dansk	Danish
Latn	Italiano	Italian
Latn; European	Español	Spanish
Latn; European	Português (Portugal)	Portuguese
Latn	Te reo Māori	Maori
Latn	Bahasa Melayu	Malay
Latn	Malti	Maltese
Latn	Nederlands	Dutch
Latn; Bokmål	Norsk	Norwegian
Latn	Polski	Polish
Latn	Română	Romanian
Latn	Slovenčina	Slovak
Latn	Slovenščina	Slovenian
Latn	shqip	Albanian
Latn	Svenska	Swedish
Latn	Swahili	Swahili
Latn	Wikang Tagalog	Tagalog
Latn	Türkçe	Turkish
Latn	oʻzbekcha	Uzbek
Latn	Tiếng Việt	Vietnamese
Latn	Afrikaans	Afrikaans
Latn	Azərbaycan	Azerbaijani
Latn	Bosanski	Bosnian
Latn	Čeština	Czech
Latn	Cymraeg	Welsh
Latn	Eesti keel	Estonian
Latn	Gaeilge	Irish
Latn	Hrvatski	Croatian
Latn	Magyar	Hungarian
Latn	Bahasa Indonesia	Indonesian
Latn	Íslenska	Icelandic
Latn	Kurdî	Kurdish
Latn	Lietuvių	Lithuanian
Latn	Latviešu	Latvian

Set OCR Language

In the current mainline version, OCR languages should be passed through Languages for each conversion task, rather than through a separate global interface.

string inputFilePath = "***";
string password = "***";
string outputFileName = "***";
WordOptions wordOptions = new WordOptions();
wordOptions.ContainImage = true;
wordOptions.ContainAnnotation = true;
// Enable OCR option.
wordOptions.EnableOCR = true;
wordOptions.Languages = new List<OCRLanguage> { OCRLanguage.e_ENGLISH };
ErrorCode error = CPDFConversion.StartPDFToWord(inputFilePath, password, outputFileName, wordOptions);

OCR Options

Different OCR options can be selected according to actual needs. Below are the currently supported OCR options.

OCROption.e_InvalidCharacter: Recognizes invalid or garbled characters in the PDF document through OCR, while normal characters are not processed by OCR.
OCROption.e_ScanPage: Recognizes scanned pages in the PDF document through OCR, while editable pages are not processed by OCR.
OCROption.e_InvalidCharacterAndScanPage: Recognizes both invalid characters and scanned pages in the PDF document through OCR.
OCROption.e_All: Recognizes all pages and characters in the PDF document through OCR.

Preserve Page Background

When OCR is enabled, you can choose whether to enable the ContainPageBackgroundImage option. If this option is enabled, the original page background image of the PDF will be preserved. If it is disabled, the image result detected during page layout analysis will be retained.

Notice

The quality of the OCR result depends on the quality of the input image. If the input image has a low resolution, the OCR result quality will be affected. A good rule of thumb is that the more pixels in the character shapes, the better. If the character bounding box is smaller than 20x20 pixels, OCR quality will drop exponentially. The ideal image is a grayscale image with a resolution around 300 DPI.
When performing OCR, make sure the OCR language setting matches the language in the PDF document to achieve the best OCR conversion quality.
OCR functionality is currently not supported on operating systems lower than Windows 10.

Convert Images to Other Document Formats

The OCR function also supports converting input images into Word, Excel, PPT, HTML, CSV, RTF, TXT, JSON, and other formats. This sample demonstrates how to use the ComPDF OCR function to convert image files to a DOCX file.

// Supports jpg, jpeg, png, bmp, tiff, webp formats
string inputFilePath = "***";
string password = "***";
string outputFileName = "***";
WordOptions wordOptions = new WordOptions();
wordOptions.ContainImage = true;
wordOptions.ContainAnnotation = true;
// Enable OCR option.
wordOptions.EnableOCR = true;
wordOptions.Languages = new List<OCRLanguage> { OCRLanguage.e_ENGLISH };
ErrorCode error = CPDFConversion.StartPDFToWord(inputFilePath, password, outputFileName, wordOptions);

Sample

This Sample demonstrates how to use the ComPDF OCR function to convert a PDF to DOCX file.

string inputFilePath = "***";
string password = "***";
string outputFileName = "***";
WordOptions wordOptions = new WordOptions();
wordOptions.ContainImage = true;
wordOptions.ContainAnnotation = true;
// Enable OCR option.
wordOptions.EnableOCR = true;
wordOptions.Languages = new List<OCRLanguage> { OCRLanguage.e_ENGLISH };
ErrorCode error = CPDFConversion.StartPDFToWord(inputFilePath, password, outputFileName, wordOptions);

OCR ​

Overview ​

Set OCR Language ​

OCR Options ​

Preserve Page Background ​

Notice ​

Convert Images to Other Document Formats ​

Sample ​