Digital and Non-Digital PDF segregator — Python

Photo by Austin Distel on Unsplash

There is one problem with the pdf extraction method. When we are using normal libraries like PyPDF2, Pdfplumber, pdftotext, etc. to extract text from PDF documents which contain scan images in it, we get an error.

(PyPDF2 result)

So the question is how to solve this error?

Step 1: Check whether the PDFdocument contains images in it. (Digital PDF)

Step 2: If yes then go for the pytesseract method to extract text from scan images.

Step 3: Else go with normal PDF extraction libraries like PyPDF2, pdftotext or pdfplumber, etc.

Step 4: Enjoy

Digital and Non-Digital PDF segregator in Python.

segregator.py

In every pdf document, we have one property that is ‘Resources’. If that resource contains ‘font’ as a resource in it then that page contains text data else pretty obvious that page contains scan image.

In the above code, we are using PyPDF2 library to extract the ‘Resources’ of every page and check if ‘Font’ resource present in it or not. Based on this result we can call the appropriate function to extract text from PDF.

We can also find weather pdf documents contains mix type (one-page image and second-page text).

That's it.

Thank you.

Data Science Engineer