Skip to content
Guides

OCR

Overview

OCR (Optical Character Recognition) is the process of converting images of typed, handwritten, or printed text into machine-encoded text.

OCR is commonly used for text recognition and extraction from the following types of documents:

  • Non-editable scanned PDF files
  • Photographs of documents.
  • Scene photos such as advertising layouts, signboards, etc.
  • Identification cards, passports, vehicle license plates, and other official plates.
  • Invoices, bills, receipts, and other financial documents.

The following features support OCR:

  • PDF to Word
  • PDF to Excel
  • PDF to PowerPoint (PPT)
  • PDF to HTML
  • PDF to Rich Text Format (RTF)
  • PDF to Text (TXT)
  • Text extraction from PDF
  • Table extraction from PDF

OCR Language Support of ComPDFKit Conversion SDK:

Script / NotesLanguage (Native)Language (In English)
Latn; AmericanEnglishEnglish
Latn; CanadianFrançais canadienFrench
Hans/Hant中文简体Chinese (Simplified)
Hans/Hant中文繁体Chinese (Traditional)
Jpan日本語Japanese
Kore한국어Korean
LatnDeutschGerman
LatnСрпски (латиница)Serbian (latin)
LatnOccitan, lenga d'òc, provençalOccitan
LatnDanskDanish
LatnItalianoItalian
Latn; EuropeanEspañolSpanish
Latn; EuropeanPortuguês (Portugal)Portuguese
LatnTe reo MāoriMaori
LatnBahasa MelayuMalay
LatnMaltiMaltese
LatnNederlandsDutch
Latn; BokmålNorskNorwegian
LatnPolskiPolish
LatnRomânăRomanian
LatnSlovenčinaSlovak
LatnSlovenščinaSlovenian
LatnshqipAlbanian
LatnSvenskaSwedish
LatnSwahiliSwahili
LatnWikang TagalogTagalog
LatnTürkçeTurkish
LatnoʻzbekchaUzbek
LatnTiếng ViệtVietnamese
LatnAfrikaansAfrikaans
LatnAzərbaycanAzerbaijani
LatnBosanskiBosnian
LatnČeštinaCzech
LatnCymraegWelsh
LatnEesti keelEstonian
LatnGaeilgeIrish
LatnHrvatskiCroatian
LatnMagyarHungarian
LatnBahasa IndonesiaIndonesian
LatnÍslenskaIcelandic
LatnKurdîKurdish
LatnLietuviųLithuanian
LatnLatviešuLatvian

Whether to include OCR background image

When the OCR function is enabled and the target conversion format is Word, PPT, RTF, or HTML, you need to pay attention to whether to set the IsContainOCRBgImage option. If the IsContainOCRBgImage option is selected, a large image will be written in the target document as a background image. Text and tables will be displayed on this background image. If the IsContainOCRBgImage option is not selected, the images on the PDF page will be extracted and written into the target document.

Convert images to other document formats

The OCR function also supports converting input images into Word, Excel, PPT, HTML, CSV, RTF, TXT and other formats. This sample demonstrates how to use the ComPDFKit OCR function to convert image files to DOCX file.

objective-c
// Support jpg, jpeg, png, bmp formats
NSString *inputFilePath = @"...";
// Get the path to the Word file.
NSString *outputPath = @"...";
CPDFConvertWordOptions *options = [[CPDFConvertWordOptions alloc] init];
// Set the OCR language, which takes effect only when IsAllowOCR is true.
// Whether to contain images when converting,which takes effect only when IsAllowOCR is false.
[options setIsAllowOCR:YES];
[options setIsContainImages:YES];
// Set whether to contain background images, which takes effect only when IsAllowOCR is true. 
[options setIsContainOCRBgImage:YES];
// Whether to contain annotations when converting.
[options setIsContainAnnotations:YES];
 // PDF to Word conversion parameter object (derived class of CPDFConvertOptions)Layout Options:CPDFConvertRetainPageLayout: Retain the same layout as your original file by splitting the text into multiple text boxes accoring to its layout.
[options setLayoutOptions:CPDFConvertRetainPageLayout];
CPDFConverterWord *converter = [[CPDFConverterWord alloc] initWithURL:[NSURL fileURLWithPath:inputFilePath] password:nil];
[converter convertToFilePath:outputPath pageIndexs:nil options:options];

Notice

  • The quality of OCR results is related to the quality of the input image. If the input image resolution is lower, then the quality of the OCR results will also be affected. A good way is that the more pixels in the glyph, the better. If the glyph bounding box is smaller than 20x20 pixels, the OCR quality will begin to decline exponentially. The ideal image is a grayscale image with a resolution of around 300 DPI.
  • When performing OCR recognition, you need to pay attention to setting the OCR language and ensure that the selected OCR language is consistent with the language of the PDF document to obtain the best OCR conversion quality.
  • When the OCR option is enabled, the IsContainImages option will no longer work. At this time, the pictures in PDF are controlled by the IsContainOCRBgImage.
  • When using the image conversion to other document format function, please pay attention to the input image format support: JPG, JPEG, PNG, BMP.

Sample

This Sample demonstrates how to use the ComPDFKit OCR function to convert a PDF to DOCX file.

objective-c
// Get the path of the PDF file.
NSString *pdfPath = @"...";
// Get the path to the Word file.
NSString *outputPath = @"...";
CPDFConvertWordOptions *options = [[CPDFConvertWordOptions alloc] init];
// Sets the OCR language, which takes effect only when IsAllowOCR is true.
[options setIsAllowOCR:YES];
 // PDF to Word conversion parameter object (derived class of CPDFConvertOptions)Layout Options:CPDFConvertRetainPageLayout: Retain the same layout as your original file by splitting the text into multiple text boxes accoring to its layout.
[options setLayoutOptions:CPDFConvertRetainPageLayout];
// OCR language is English.
[options setLanguage:COCRLanguageEnglish];
CPDFConverterWord *converter = [[CPDFConverterWord alloc] initWithURL:[NSURL fileURLWithPath:pdfPath] password:nil];
[converter convertToFilePath:outputPath pageIndexs:nil options:options];