Tutorials

How to Extract PDF Table to JSON by ComPDFKit API

By ComPDFKit | Tue. 12 Nov. 2024
PDF APIData Extraction

This article will guide you step-by-step on how to extract form data from a PDF and convert it into structured JSON format using ComPDFKit API



Register and Start Your Requests

 

Before diving into the extraction process, you need to register for ComPDFKit API. Once registered, you can obtain your license from the ComPDFKit Dashboard, which will be used to authenticate API calls and access the necessary resources. ComPDFKit API is supported by all languages that allow you to make HTTP requests.

 

 

Make HTTP Requests to Extract PDF Tables to JSON 

 

The processing workflow of the ComPDFKit API is very simple. It consists of four basic request instructions: create a task, upload a file, execute a task, and download a result file. Through these four requests, you can select the corresponding PDF tool to extract the tables in your PDF file and obtain the download link of the result JSON file.

 

1. Authentication

 

You need to replace public_key and secret_key with accessToken in the publicKey and secretKey authentication return values you get from the console.

 

   import java.io.*;
   import okhttp3.*;
   public class main {
     public static void main(String []args) throws IOException{
       OkHttpClient client = new OkHttpClient().newBuilder()
         .build();
       MediaType mediaType = MediaType.parse("text/plain");
       RequestBody body = RequestBody.create(mediaType, "{\n    \"publicKey\": \"{{public_key}}\",\n    \"secretKey\": \"{{secret_key}}\"\n}");
       Request request = new Request.Builder()
         .url("https://api-server.compdf.com/server/v1/oauth/token")
         .method("POST", body)
         .build();
       Response response = client.newCall(request).execute();
     }
   }

 

2. Create Task

 

You need to replace the accessToken which was obtained from the previous step, and replace the language type you want to display the error information. After replacing them, you will get the taskId in the response data.

 

   import java.io.*;
   import okhttp3.*;
   public class main {
     public static void main(String []args) throws IOException{
       OkHttpClient client = new OkHttpClient().newBuilder()
         .build();
       MediaType mediaType = MediaType.parse("text/plain");
       RequestBody body = RequestBody.create(mediaType, "");
       Request request = new Request.Builder()
         .url("https://api-server.compdf.com/server/v1/task/pdf/json?language={{language}}")
         .method("GET", body)
         .addHeader("Authorization", "Bearer {{accessToken}}")
         .build();
       Response response = client.newCall(request).execute();
     }
   }

 

3. Upload Files

 

Replace the file you want to convert, the taskId obtained in the previous step, the language type you want to display the error information, and the accessToken obtained in the first step.

 

PDF table is the content in PDFs, when you want to extract PDF tables, just choose the "type:"1" parameter. If the PDF tables are contained in images, please pass in the parameter "isAllowOcr":"1".

 

   import java.io.*;
   import okhttp3.*;
   public class main {
     public static void main(String []args) throws IOException{
       OkHttpClient client = new OkHttpClient().newBuilder()
         .build();
       MediaType mediaType = MediaType.parse("text/plain");
       RequestBody body = new MultipartBody.Builder().setType(MultipartBody.FORM)
         .addFormDataPart("file","{{file}}",
                          RequestBody.create(MediaType.parse("application/octet-stream"),
                                             new File("")))
         .addFormDataPart("taskId","{{taskId}}")
         .addFormDataPart("language","{{language}}")
         .addFormDataPart("password","")
         .addFormDataPart("parameter","{  \"type\":1, \"isAllowOcr\":0, \"isContainOcrBg\":0}")
         .build();
       Request request = new Request.Builder()
         .url("https://api-server.compdf.com/server/v1/file/upload")
         .method("POST", body)
         .addHeader("Authorization", "Bearer {{accessToken}}")
         .build();
       Response response = client.newCall(request).execute();
     }
   }

 

4. Process Files

 

Replace the taskId you obtained from the Create task, and the accessToken obtained in the first step, and replace the language type you want to display the error information.

 

   import java.io.*;
   import okhttp3.*;
   public class main {
    public static void main(String []args) throws IOException{
      OkHttpClient client = new OkHttpClient().newBuilder()
        .build();
      MediaType mediaType = MediaType.parse("text/plain");
      RequestBody body = RequestBody.create(mediaType, "");
      Request request = new Request.Builder()
        .url("https://api-server.compdf.com/server/v1/execute/start?taskId={{taskId}}&language={{language}}")
        .method("GET", body)
        .addHeader("Authorization", "Bearer {{accessToken}}")
        .build();
      Response response = client.newCall(request).execute();
    }
   }

 

5. Get Task Information

 

Replace taskId with the taskId you obtained from the step "Create the task", access_token replaced by access_token obtained in the first step.

 

   import java.io.*;
   import okhttp3.*;
   public class main {
     public static void main(String []args) throws IOException{
       OkHttpClient client = new OkHttpClient().newBuilder()
         .build();
       MediaType mediaType = MediaType.parse("text/plain");
       RequestBody body = RequestBody.create(mediaType, "");
       Request request = new Request.Builder()
         .url("https://api-server.compdf.com/server/v1/task/taskInfo?taskId={{taskId}}")
         .method("GET", body)
         .addHeader("Authorization", "Bearer {{accessToken}}")
         .build();
       Response response = client.newCall(request).execute();
     }



Conclusion

 

You have learned how to make HTTP requests to call ComPDFKit API to extract PDF tables to JSON format. ComPDFKit also provides API Libraries in various programming languages, including Java, PHP, Python, C#.NET, Swift, and C++. This ensures that developers can utilize the API regardless of their preferred programming language directly. 

 

If you're interested in learning about the technologies and challenges related to PDF form extraction, you can refer to this article.