Skip to content
Guides

Extraction Guides

Unleash the Power of Data with ComPDFKit Conversion SDK's Data Extraction to detect, recognize, analyze, and extract the PDF text, image, table, etc.

Extract Text from PDFs

Overview

To extract text content from a PDF document.

Note

  • When we use the CPDFConverterTextToJson class to access the content streams from a PDF document, we are often faced with fragmented data. For example, let us say that we are attempting to extract a sentence that says "This is a sample sentence." from a PDF document. You may end up retrieving parts of it as separate content streams like "This" and "is a sample sentence.". This occurs because text objects in PDFs are not always cleanly organized into words sentences, or paragraphs. When OCR is unenabled, the CPDFConverterTextToJson class will return Text objects exactly as they are defined in the PDF page content streams.

Sample

Full sample code which illustrates the text extraction capabilities.

kotlin
val cPDFConvert = CPDFConverterTextToJson(context, uri, "")

val params = CPDFConvertTextToJsonOptions()

val result: ConvertError = cPDFConvert.convert(outputDir, outputFilenameNoSuffix, params, pageArrays, 
onHandle = onHandleCal, 
onProgress = onProgressCal, 
onPost = onPostCal)