Digital and Non-Digital PDF segregator — Python

Kishan Tongrao
2 min readJun 20, 2020

Photo by Austin Distel on Unsplash

There is one problem with the pdf extraction method. When we are using normal libraries like PyPDF2, Pdfplumber, pdftotext, etc. to extract text from PDF documents which contain scan images in it, we get an error.

(PyPDF2 result)

So the question is how to solve this error?

Step 1: Check whether the PDFdocument contains images in it. (Digital PDF)

Step 2: If yes then go for the pytesseract method to extract text from scan images.

Step 3: Else go with normal PDF extraction libraries like PyPDF2, pdftotext or pdfplumber, etc.

Step 4: Enjoy

Digital and Non-Digital PDF segregator in Python.

segregator.py

In every pdf document, we have one property that is ‘Resources’. If that resource contains ‘font’ as a resource in it then that page contains text data else pretty obvious that page contains scan image.

In the above code, we are using PyPDF2 library to extract the ‘Resources’ of every page and check if ‘Font’ resource present in it or not. Based on this result we can call the appropriate function to extract text from PDF.

We can also find weather pdf documents contains mix type (one-page image and second-page text).

That's it.

Thank you.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Kishan Tongrao
Kishan Tongrao

No responses yet

Write a response