Deploying Cost-Effective PDF Text Extraction Pipeline using AWS Textract and PyMuPDF

Vachan Anand
8 min readMay 2, 2022

--

In the present world, where data is one of the most valuable asset, organisations are looking into ways to tap into information previously locked.

As a consultant, I had an opportunity to have a closer look into the extent of information locked into formats that are not easily accessible by the employees, or more importantly, would require a tremendous amount of time and resources to use the information appropriately.

Photo by Pietro Jeng on Unsplash

Problem Statement

As an organisation with millions of PDF files, we would like to tap into the information locked into PDFs. We would like to extract the text from each PDF that can be used for

  • Cataloguing of PDF files, i.e file classification
  • Searching PDF based on keywords for easy access to PDF at the time of need

For the current problem, we are not interested in the structure of text within each PDF i.e for instance, if a P.D.F file consists of tabular data, we are not interested in preserving the relationship between the rows and columns. As part of this blog, we are only interested in extracting the text irrespective of the underlying structure.

The text would be good enough for us to either create a document classification model to catalogue PDFs based on topics. Additionally, the data could be fed into a system like elastic search that can be used to pinpoint PDFs based on search keywords.

Constraints

While working for any business, we would have business constraints within which the problem needs to be solved.

  • Quick extraction of text with minimal human effort
  • Reducing the cost of PDF extraction
  • Automation of extraction process

Solution

As we are interested in extracting text from PDFs, we will be working on the following solution

The solution for text extraction is divided into two components.

Firstly, in blue we have Native PDF Text. This component is used to extract text from simple PDFs that are written in a standard format. For instance, we use this component to extract text from research papers downloaded from Springers such as the image below.

Secondly, we have the Handwritten PDF Text, which makes use of A.W.S. Textract service to extract text from the PDF. We make use of this component to extract text from complicated PDFs such as the image below.

The PDF above was created by first clicking a picture and then converting the image into a PDF format. The sample of PDFs used to text the text extraction can be found here.

Since Textract is a paid service, that charges by the number of pages using it for text extractions, splitting the extraction process into the two components mentioned above act as a big cost-saver, especially when we are required to extract text from millions of PDFs (a reasonable number for large organisations)

Before we go into the nitty-gritty of the solutions, let’s us start by discussing a bit about the two text extraction services/library we are using

PyMuPDF

PyMuPDF is a lightweight library that is used for viewing, rendering and editing PDFs, XPS, and E-book. Although it supports multiple purposes, we use it for extracting text from PDFs.

This library/toolkit serves our purpose because it can extract metadata from the PDF files that can be used to determine if the text can be successfully extracted from the PDFs using the library or not.

We make use of file metadata that gives us information such as linked below

This is one of the very small use of PyMuPDF as it has vast number of usecases and functionalities. However, since it serves our purpose, we make use of this library.

Textract

Amazon Textract is a machine learning service that automatically extracts text, handwriting, and data from scanned documents. It performs optical character recognition to extract text. It has a range of services to analyse or extract text from various kinds of documents however, for our usecase, we will just be using Textract text detection as we are not interested in the structure of text with PDF or any analysis.

Moreover, Textract supports two kinds of operations:

  • Synchronous: The handle waits till text extraction/analysis is completed before moving to the next task
  • Asynchronous: The handle triggers the text extraction/analysis operation and moves to the next task. Once the job is completed, we can call textract to get the results

NOTE : Textract is a paid service that charges by the number of pages using the service.

The pricing for using Textract can be found here.

Dataflow:

The flow of data for our current application is as follows

  1. The pipeline is triggered when a file lands in the landing bucket in S3
  2. A trigger notification in the landing bucket sends a message to S.Q.S indicating a new file that needs to be processed
  3. The first lambda function reads the message from SQS and performs the following tasks
  • Checks if the PDF file that landed can be extracted by PyMuPDF (a free python library). This is based on metrics that we are going to discuss later in this blog.
  • If PyMuPDF can successfully extract text then the function extracts the text and save the text in a json format to the result S3 bucket.
  • If PyMuPDF can not successfully extract the text, then it triggers a textract API call to start document extraction.

4. Textract, on completion of the text extraction process publishes a message to an SNS topic

5.The SNS topic is subscribed by an SQS queue that receives a message when Textract has successfully completed the extraction process

6. The SQS queue triggers the second lambda function that is responsible to get the text extracted by Textract and store it in the result S3 bucket

Application Code:

Infrastructure

Firstly, we start by building the infrastructure that would be required for the solution. We make use of Terraform, an infrastructure as a code service that is used to create resources via code which makes it easy for us to manage, maintain and replicate the resources as and when required. Moreover, it helps us version control our infrastructure, a very desirable feature necessary for large projects.

Although any resource or service that we use as part of the solution can be created and used via AWS console, it is still recommended to make use of an infrastructure as a code service for the reasons mentioned above.

We build the following infrastructure using terraform for the reasons explained below.

The code for the infrastructure can be found here (github).

Photo by Jacek Dylag on Unsplash

We make use of the following services to build the proposed solution

  • Simple Storage Solution (S3)
  • Lambda Functions
  • Textract
  • Simple Notification Service (S.N.S)
  • Simple Queue Service (S.Q.S)
  • Elastic Container Registry (E.C.R.)
  • Identity Access Management (I.A.M.)

Additionally, we make use of

  • Docker
  • Terraform

Simple Storage Service (S3)

We start by building two S3 buckets.

  • Landing Bucket : This bucket is where our raw PDFs land. The bucket has a trigger notification to start the PDF extraction process anytime a file lands (or is created) in this bucket.
Trigger notification for landing bucket
  • Result Bucket: This bucket stores the processed results in JSON format.

Simple Notification Service (S.N.S)

Boto3 API for Textract makes use of S.N.S. topic to publish a message anytime a text extraction job is completed.

Boto3 API to start text detection requiring Notification Channel

Note : The SNS topic name should start with AmazonTextract as suggested by the documentation here.

Simple Queue Service (S.Q.S)

We make use of S.Q.S. for the following

  • Firstly, to deliver the message to a lambda function when a PDF lands in the landing bucket
  • Secondly, to trigger a lambda function to get results from Textract when a text extraction job is completed.

Elastic Container Registry (E.C.R.)

We make use of two repositories for the following purpose:

  • PyMuPDF Extraction : The repository hosts code used by the lambda functions to extract the text using PyMuPDF.
  • Textract Extraction : The repository hosts code used by the lambda functions to get text once the text extraction process is completed

Identity Access Management (I.A.M.)

IAM or Identity access management is a service used to grant access to user so that they are authorized to perform certain tasks.

We create two roles to make the current pipeline work:

  1. Lambda Role : This role is used by the lambda functions to perform tasks such as
  • Getting data from S3 or writing data to S3
  • Trigger a Textract job
  • Publish messages to SNS and SQS

2. Textract Role : This role is used to authorize Textract to publish messages to S.N.S.

Lambda Function

Lambda function in AWS is a serverless computing service. We make use of the lambda function for the following operations:

PyMuPDF Lambda: The first lambda function performs a number of tasks before using PyMuPDF for extraction text

  • Firstly, since cost saving is one of the primary objectives of the pipeline, the lambda function checks if the PDF file is extractable by PyMuPDF.

We can customise the condition based on criteria that fits our solution. For the purpose of demonstration, we use the file metadata that can be extracted by PyMuPDF to filter out PDFs that are extractable by the library. Secondly, we the number of words ‘words_threshold’ as another factor to determine the quality of extraction.

  • Next, if the PDF file is extractable by the library, we extract the text and save it in S3 as a json object
  • Incase the file is not extractable by PyMuPDF, we make a call to Textract to start asynchronous text extraction.

Textract Lambda: The lambda function is triggered when Textract job is completed and is used to process the job result before saving the results to S3.

Application Code:

References :

--

--

Vachan Anand

A consultant with an interest in Data Science, Data Engineering and Cloud Technology.