Guides

Data Extraction Guides

Unleash the Power of Data with ComPDFKit Conversion SDK's Data Extraction to detect, recognize, analyze, and extract the PDF text, image, table, etc.

Extract PDF to JSON

Overview

Extract text, tables and images from PDF documents to Json file.

Table Extraction Option

ComPDFKit Conversion SDK supports the option json_contain_table, when enabled, will extract table content from PDFs and output the table structure; otherwise, table content will be treated as regular text.

Standard table and non-standard table

Commonly, tables can be divided into two categories: standard tables and non-standard tables. The specific definitions are as follows:

Standard table: The table border and the inner lines of the table are complete and clear. There is no need to manually add table lines to divide the table content.
Non-Standard Tables: Tables lacking borders or clear inner lines, requiring manual additions of table lines to separate contents.

Notice

Without enabling AI layout analysis or OCR options, tables in the original PDF cannot be extracted. It is recommended to enable AI layout analysis or OCR for high-precision table recognition.

Sample

Full sample code which illustrates the text extraction capabilities.

c++

ConvertOptions opt;
// Extract PDF to JSON.
CPDFConversion::StartPDFToJson("json.pdf", "password", "path/output.json", opt);

Data Extraction Guides ​

Extract PDF to JSON ​

Overview ​

Table Extraction Option ​

Standard table and non-standard table ​

Notice ​