OCR
Overview
OCR (Optical Character Recognition) is the process of converting images of typed, handwritten, or printed text into machine-encoded text.
OCR is commonly used for text recognition and extraction from the following types of documents:
- Non-editable scanned PDF files
- Photographs of documents.
- Scene photos such as advertising layouts, signboards, etc.
- Identification cards, passports, vehicle license plates, and other official plates.
- Invoices, bills, receipts, and other financial documents.
The following features support OCR:
- PDF to Word
- PDF to Excel
- PDF to PowerPoint (PPT)
- PDF to HTML
- PDF to Rich Text Format (RTF)
- PDF to Text (TXT)
- PDF to CSV
- PDF to Searchable PDF
- PDF to OFD
- Extract PDF to JSON
- Extract PDF to Markdown
OCR Language Support of ComPDF Conversion SDK:
| Script / Notes | Language (Native) | Language (In English) |
|---|---|---|
| Latn; American | English | English |
| Latn; Canadian | Français canadien | French |
| Hans/Hant | 中文简体 | Chinese (Simplified) |
| Hans/Hant | 中文繁体 | Chinese (Traditional) |
| Jpan | 日本語 | Japanese |
| Kore | 한국어 | Korean |
| Latn | Deutsch | German |
| Latn | Српски (латиница) | Serbian (latin) |
| Latn | Occitan, lenga d'òc, provençal | Occitan |
| Latn | Dansk | Danish |
| Latn | Italiano | Italian |
| Latn; European | Español | Spanish |
| Latn; European | Português (Portugal) | Portuguese |
| Latn | Te reo Māori | Maori |
| Latn | Bahasa Melayu | Malay |
| Latn | Malti | Maltese |
| Latn | Nederlands | Dutch |
| Latn; Bokmål | Norsk | Norwegian |
| Latn | Polski | Polish |
| Latn | Română | Romanian |
| Latn | Slovenčina | Slovak |
| Latn | Slovenščina | Slovenian |
| Latn | shqip | Albanian |
| Latn | Svenska | Swedish |
| Latn | Swahili | Swahili |
| Latn | Wikang Tagalog | Tagalog |
| Latn | Türkçe | Turkish |
| Latn | oʻzbekcha | Uzbek |
| Latn | Tiếng Việt | Vietnamese |
| Latn | Afrikaans | Afrikaans |
| Latn | Azərbaycan | Azerbaijani |
| Latn | Bosanski | Bosnian |
| Latn | Čeština | Czech |
| Latn | Cymraeg | Welsh |
| Latn | Eesti keel | Estonian |
| Latn | Gaeilge | Irish |
| Latn | Hrvatski | Croatian |
| Latn | Magyar | Hungarian |
| Latn | Bahasa Indonesia | Indonesian |
| Latn | Íslenska | Icelandic |
| Latn | Kurdî | Kurdish |
| Latn | Lietuvių | Lithuanian |
| Latn | Latviešu | Latvian |
Set OCR Language
In the current mainline version, OCR languages should be passed through the Languages for each conversion task, rather than through a separate global interface.
inputFilePath := "***"
password := "***"
outputFileName := "***"
wordOptions := compdf.NewWordOptions()
wordOptions.EnableOCR = true
wordOptions.Languages = []compdf.OCRLanguage{compdf.OCRLangEnglish}
err := compdf.StartPDFToWord(inputFilePath, password, outputFileName, wordOptions, nil)OCR Options
Different OCR options can be selected according to actual needs. Below are the currently supported OCR options.
- OCRInvalidCharacter: Recognizes invalid or garbled characters in the PDF document through OCR, while normal characters are not processed by OCR.
- OCRScanPage: Recognizes scanned pages in the PDF document through OCR, while editable pages are not processed by OCR.
- OCRInvalidCharacterAndScanned: Recognizes both invalid characters and scanned pages in the PDF document through OCR.
- OCRAll: Recognizes all pages and characters in the PDF document through OCR.
Preserve Page Background
When OCR is enabled, you can choose whether to enable the ContainPageBackgroundImage option. If this option is enabled, the original page background image of the PDF will be preserved. If it is disabled, the image result detected during page layout analysis will be retained.
Notice
- The quality of the OCR result depends on the quality of the input image. If the input image has a low resolution, the OCR result quality will be affected. A good rule of thumb is that the more pixels in the character shapes, the better. If the character bounding box is smaller than 20x20 pixels, OCR quality will drop exponentially. The ideal image is a grayscale image with a resolution around 300 DPI.
- When performing OCR, make sure the OCR language setting matches the language in the PDF document to achieve the best OCR conversion quality.
- OCR functionality is currently not supported on operating systems lower than Windows 10.
Convert Images to Other Document Formats
The OCR function also supports converting input images into Word, Excel, PPT, HTML, CSV, RTF, TXT, JSON, and other formats. This sample demonstrates how to use the ComPDF OCR function to convert image files to a DOCX file.
inputFilePath := "***"
password := "***"
outputFileName := "***"
wordOptions := compdf.NewWordOptions()
wordOptions.EnableOCR = true
wordOptions.Languages = []compdf.OCRLanguage{compdf.OCRLangEnglish}
err := compdf.StartPDFToWord(inputFilePath, password, outputFileName, wordOptions, nil)Sample
This Sample demonstrates how to use the ComPDF OCR function to convert a PDF to DOCX file.
inputFilePath := "***"
password := "***"
outputFileName := "***"
wordOptions := compdf.NewWordOptions()
wordOptions.EnableOCR = true
wordOptions.Languages = []compdf.OCRLanguage{compdf.OCRLangEnglish}
err := compdf.StartPDFToWord(inputFilePath, password, outputFileName, wordOptions, nil)