Tutorials

C# Extract Data from PDF Files (Beginner Tutorial)

By ComPDFKit | Fri. 29 Mar. 2024
Conversion SDKC#Data Extraction

In today's fast-paced work environment, extracting data from PDF files is essential for saving time on manual input and improving work efficiency. However, due to the unstructured format and diverse elements in PDFs, accurately extracting data can be challenging. This article aims to address this challenge by introducing developers to the ComPDFKit library, enabling them to extract text, tables, and images from PDF files seamlessly using C#.

 

How to Extract Data from PDF in C#

1. Download C# PDF Data Extraction Library

2. Create a New Project in Visual Studio

3. Install Library to Your Project

4. Apply your License Key

5. Extract  Data from PDFs in C#

    • Extract Text from PDF in C#

    • Extract Table from PDF in C#

    • Extract All Content from PDF in C#

 

By following these steps, developers can effectively leverage the ComPDFKit library to extract data from PDF files using C#. This not only streamlines data extraction processes but also enhances overall productivity by automating manual tasks.

 

C# extract data from PDF files

 

C# PDF Data Extraction Library

ComPDFKit, a powerful and comprehensive C# PDF library, empowers developers to seamlessly handle a variety of PDF tasks, including viewing, editing, annotating, signing, and converting documents. Whether your project involves invoices, bank statements, construction blueprints, research papers, or business reports, ComPDFKit offers a versatile solution tailored to your specific needs.

 

One of the standout features of the ComPDFKit library is its exceptional capability for extracting data from PDF files. Here's how it excels:

    • Extract All Page Elements: ComPDFKit can fully recognize and extract characters, words, fonts, form fields, images, positions, and data from PDF documents, providing structured output in formats such as JSON or XML for further processing.

    • Analyze Document Structure: The library analyzes the structure of PDF files by categorizing headings, tables, headers, footers, and paragraphs in natural reading order, ensuring structural coherence with the original document.

    • High Accuracy Extraction: Leveraging advanced AI technology, ComPDFKit's Document AI enhances the accuracy of information extraction, layout analysis, image classification, and Visual Question Answering (VQA) effectiveness.

    • Cross-platform Compatibility: ComPDFKit seamlessly integrates with various platforms, supporting deployment on PCs, mobile devices, and cross-platform frameworks. Developers can easily deploy the library locally or utilize online APIs for data extraction tasks.

 

In this article, we'll delve into extracting data using ComPDFKit in C#. To begin, you can conveniently access the SDK by contacting our sales team. Next, to create a new project in Visual Studio and integrate the ComPDFKit library for PDF data extraction in C#, follow these steps:

 

Step 1: Create a New Project in Visual Studio

1. Open Visual Studio and navigate to the "File" menu.

2. Select "New" and then "Project...".

3. Choose "Visual C#" -> "Windows Desktop" -> "Console App (.NET Framework)".

 

Step 2: Configure Your New Project

1. Specify a name and location for your project.

2. Ensure that ".NET Framework 4.6.1" is selected as the programming framework.

3. Click on the "OK" button to create your console application project.

 

Step 3: Install the ComPDFKit C# PDF Library

1. Copy all files in the "lib" folder from the ComPDFKit package to the project folder.

2. Add the ComPDFKit Conversion SDK dynamic library to References:

    • Right-click the project in the Solution Explorer and select "Add" -> "Reference...".

    • In the Add Reference dialog, go to the "Browse" tab.

    • Navigate to the project folder and select "ComPDFKit_Conversion.dll".

    • Click "OK" to add the reference.

3. Add the "x64" and "x86" folders from the ComPDFKit package to the project.

    • Ensure that the properties "Copy to Output Directory" of "CPDFConverterNative.dll" and "opencv_world420.dll" are set to "Copy if newer".

4. Copy the "resource" folder from the ComPDFKit package to the project folder.

    • Set the properties "Copy to Output Directory" of all files in the "resource" folder to "Copy if newer".

 

Step 4: Apply the License Key

Before using the PDF extraction API, initialize the ComPDFKit library with a valid license key obtained from our sales team:

string resPath = "***"; // Path to the resource folder
string libPath = "***"; // Path to the ComPDFKit library
string license = "***"; // Your license key

CPDFConverter.InitLibrary(libPath);
CPDFConverter.InitResource(resPath);
CPDFConverter.LicenseVerify(license);

 

Now that your project is set up and the license key is applied, you're ready to extract data from PDFs using ComPDFKit in your C# application.

 

Extract Text from PDF in C#

When dealing with PDF files, extracting important data elements such as order numbers, dates, totals, and other fields from invoices is crucial for gaining insight into the information contained within the files. Manual processing of large volumes of PDF documents to locate this information can be time-consuming and resource-intensive. However, this process can be streamlined and automated using text extraction methods.

 

Extracting All Text from the Entire PDF Document:

string inputFilePath = "***"; // Path to the input PDF file
string outputFolderPath = "***"; // Path to the output folder
string outputFileName = "***"; // Name of the output file

// Create a converter instance for JSON text extraction
CPDFConverterJsonText converter = CPDFConvertFactroy.CreateConverter(CPDFConvertType.CPDFConvertTypeJsonText, inputFilePath) as CPDFConverterJsonText;

// Specify JSON conversion options
CPDFConvertJsonOptions jsonOptions = new CPDFConvertJsonOptions();
jsonOptions.IsAllowOCR = false; // Disable OCR during conversion

// Initialize error variable
ConvertError error = ConvertError.ERR_UNKNOWN;

// Convert the entire PDF document to JSON text
converter.Convert(outputFolderPath, ref outputFileName, jsonOptions, ref error);

 

Extracting Text from Specific Pages:

string inputFilePath = "***"; // Path to the input PDF file
string outputFolderPath = "***"; // Path to the output folder
string outputFileName = "***"; // Name of the output file

// Create a converter instance for JSON text extraction
CPDFConverterJsonText converter = CPDFConvertFactroy.CreateConverter(CPDFConvertType.CPDFConvertTypeJsonText, inputFilePath) as CPDFConverterJsonText;

// Specify JSON conversion options
CPDFConvertJsonOptions jsonOptions = new CPDFConvertJsonOptions();
jsonOptions.IsAllowOCR = false; // Disable OCR during conversion

// Get the total number of pages in the PDF document
int pageCount = converter.GetPagesCount();

// Create an array to specify the pages from which to extract text
int[] pageArray = new int[pageCount];
for (int i = 0; i < pageCount; i++)
{
  pageArray[i] = i + 1; // Pages are indexed starting from 1
}

// Initialize error variable
ConvertError error = ConvertError.ERR_UNKNOWN;

// Convert text from specific pages of the PDF document to JSON text
converter.Convert(outputFolderPath, ref outputFileName, jsonOptions, pageArray, ref error);

 

These code examples highlight how ComPDFKit significantly simplifies the extraction of text data from PDF files, facilitating automation and efficiency in various data processing tasks. Whether you need to extract text from the entire document or specific pages, ComPDFKit provides a comprehensive solution to streamline your workflow and derive valuable insights from PDF documents.

 

By utilizing ComPDFKit, you can effortlessly extract text from PDF documents and save the extracted text as JSON files. Furthermore, the conversion process can be customized to disable OCR, as specified in the IsAllowOCR option. This flexibility ensures that your text extraction process aligns with your specific requirements and preferences, enhancing the overall efficiency and accuracy of your data processing tasks.

 

Extract Table from PDF in C#

Table extraction from PDF files enhances data readability and programmability, facilitating streamlined business processes and workflow automation. Converting PDFs into structured JSON data enables seamless integration into various enterprise applications and systems.

 

Below is a code example illustrating the process of extracting tables from PDF files:

string inputFilePath = "***"; // Path to the input PDF file
string outputFolderPath = "***"; // Path to the output folder
string outputFileName = "***"; // Name of the output file

// Create a converter instance for table extraction
CPDFConverterJsonTable converter = CPDFConvertFactroy.CreateConverter(CPDFConvertType.CPDFConvertTypeJsonTable, inputFilePath) as CPDFConverterJsonTable;

// Specify JSON conversion options
CPDFConvertJsonOptions jsonOptions = new CPDFConvertJsonOptions();
jsonOptions.IsAllowOCR = false; // Disable OCR during conversion
jsonOptions.IsAILayoutAnalysis = false; // Disable AI layout analysis

// Initialize error variable
ConvertError error = ConvertError.ERR_UNKNOWN;

// Convert tables from the PDF document to JSON format
converter.Convert(outputFolderPath, ref outputFileName, jsonOptions, ref error);

 

By utilizing this code with ComPDFKit, you can efficiently extract tables from PDF files and convert them into structured JSON data, facilitating seamless integration and automation of enterprise workflows. Additionally, options such as disabling OCR and AI layout analysis provide flexibility in customizing the conversion process to suit your specific requirements.

 

Extract All Content from PDF in C#

Below is the full sample code demonstrating how to extract all content, including text, tables, and images, from a PDF document simultaneously using ComPDFKit:

string inputFilePath = "***"; // Path to the input PDF file
string outputFolderPath = "***"; // Path to the output folder
string outputFileName = "***"; // Name of the output file

// Create a converter instance for extracting text, tables, and images
CPDFConverterJsonPDF converter = CPDFConvertFactroy.CreateConverter(CPDFConvertType.CPDFConvertTypeJsonPDF, inputFilePath) as CPDFConverterJsonPDF;

// Specify JSON conversion options
CPDFConvertJsonOptions jsonOptions = new CPDFConvertJsonOptions();
jsonOptions.IsAllowOCR = false; // Disable OCR during conversion
jsonOptions.IsAILayoutAnalysis = false; // Disable AI layout analysis

// Initialize error variable
ConvertError error = ConvertError.ERR_UNKNOWN;

// Convert text, tables, and images from the PDF document to JSON format
converter.Convert(outputFolderPath, ref outputFileName, jsonOptions, ref error);

 

This code utilizes ComPDFKit to extract text, table, and image content from a PDF document simultaneously. By specifying the appropriate options in the CPDFConvertJsonOptions object, you can customize the conversion process according to your requirements. In this example, OCR and AI layout analysis are disabled to streamline the extraction process.

 

Conclusion

In summary, ComPDFKit offers comprehensive support for extracting text, tables, and images from PDF documents, enabling streamlined data extraction and workflow automation. With customizable options for OCR and layout analysis, developers can tailor the extraction process to meet specific requirements. 

 

ComPDFKit provides a free development version, and a free trial for production is available without upfront payment, making it an accessible choice for businesses and developers. Download ComPDFKit to start extracting data from PDF today!