Digital and Non-Digital PDF segregator — Python

2 min readJun 20, 2020

There is one problem with the pdf extraction method. When we are using normal libraries like PyPDF2, Pdfplumber, pdftotext, etc. to extract text from PDF documents which contain scan images in it, we get an error.

(PyPDF2 result)

So the question is how to solve this error?

Step 1: Check whether the PDFdocument contains images in it. (Digital PDF)

Step 2: If yes then go for the pytesseract method to extract text from scan images.

Step 3: Else go with normal PDF extraction libraries like PyPDF2, pdftotext or pdfplumber, etc.

Step 4: Enjoy

Digital and Non-Digital PDF segregator in Python.

segregator.py

In every pdf document, we have one property that is ‘Resources’. If that resource contains ‘font’ as a resource in it then that page contains text data else pretty obvious that page contains scan image.

In the above code, we are using PyPDF2 library to extract the ‘Resources’ of every page and check if ‘Font’ resource present in it or not. Based on this result we can call the appropriate function to extract text from PDF.

We can also find weather pdf documents contains mix type (one-page image and second-page text).

That's it.

Thank you.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Written by Kishan Tongrao

23 Followers

2 Following

Data Science Engineer

No responses yet

Write a response

What are your thoughts?

Also publish to my profile

Recommended from Medium

5 Powerful F-String Tricks Every Python Developer Should Know!

The Pythoneers

Aashish Kumar

5 Powerful F-String Tricks Every Python Developer Should Know!

Learn five powerful f-string techniques to write cleaner, faster, and more readable Python code.

4d ago

121

[Python-Doc] Efficient Text Replacement in Word Documents

Amazing lifestyle

[Python-Doc] Efficient Text Replacement in Word Documents

Python script is designed to replace specific text in a Word document using the python-docx library. Here’s a detailed breakdown of how the…

Nov 7, 2024

Lists

Coding & Development

11 stories1033 saves

Predictive Modeling w/ Python

20 stories1856 saves

Practical Guides to Machine Learning

10 stories2225 saves

ChatGPT

21 stories991 saves

How I Learned to Love `__init__.py`: A Simple Guide😊

Python in Plain English

Dhruv Ahuja

How I Learned to Love `init.py`: A Simple Guide😊

💡 Heads Up! Click here to unlock this article for free if you’re not a Medium member!

Feb 3

802

PyMuPDF4LLM is all You Need for Extracting Data from PDFs

Shravan Kumar

PyMuPDF4LLM is all You Need for Extracting Data from PDFs

This package converts the pages of a PDF to text in Markdown format using PyMuPDF. Standard text and tables are detected, brought in the…

Nov 1, 2024

340

How I Am Using a Lifetime 100% Free Server

Harendra

How I Am Using a Lifetime 100% Free Server

Get a server with 24 GB RAM + 4 CPU + 200 GB Storage + Always Free

Oct 26, 2024

9.4K

170

3D Photo Magic | Convert Any Picture to 3D with Python

Eran Feit

3D Photo Magic | Convert Any Picture to 3D with Python

Hi,

Sep 11, 2024

See more recommendations

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams

Digital and Non-Digital PDF segregator — Python

So the question is how to solve this error?

Step 1: Check whether the PDFdocument contains images in it. (Digital PDF)

Step 2: If yes then go for the pytesseract method to extract text from scan images.

Step 3: Else go with normal PDF extraction libraries like PyPDF2, pdftotext or pdfplumber, etc.

Step 4: Enjoy

Digital and Non-Digital PDF segregator in Python.

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Kishan Tongrao

No responses yet

More from Kishan Tongrao

A/B Testing | Kaggle Dataset

Hello my name is Kishan Tongrao. Today we are going to see A/B testing on Kaggle dataset.

S2E4: Unpack Variables & Format FString

Hey, my name is Kishan Tongrao. Today, we are going to learn/talk about simple concepts in the Python programming language. This is going…

S1E7 : What are Compiler and Interpreter? Linking-Loading Method and Interpretation method.

Hello, my name is Kishan Tongrao. Today we are going into another basic concept.

Mango Classification | Kaggle

Hello, my name is Kishan Tongrao. Today we are going to see the Image Classification problem on Kaggle.

Recommended from Medium

5 Powerful F-String Tricks Every Python Developer Should Know!

Learn five powerful f-string techniques to write cleaner, faster, and more readable Python code.

[Python-Doc] Efficient Text Replacement in Word Documents

Python script is designed to replace specific text in a Word document using the python-docx library. Here’s a detailed breakdown of how the…

Lists

Coding & Development

Predictive Modeling w/ Python

Practical Guides to Machine Learning

ChatGPT

How I Learned to Love `init.py`: A Simple Guide😊

💡 Heads Up! Click here to unlock this article for free if you’re not a Medium member!

PyMuPDF4LLM is all You Need for Extracting Data from PDFs

This package converts the pages of a PDF to text in Markdown format using PyMuPDF. Standard text and tables are detected, brought in the…

How I Am Using a Lifetime 100% Free Server

Get a server with 24 GB RAM + 4 CPU + 200 GB Storage + Always Free

3D Photo Magic | Convert Any Picture to 3D with Python

Hi,

Digital and Non-Digital PDF segregator — Python

So the question is how to solve this error?

Step 1: Check whether the PDFdocument contains images in it. (Digital PDF)

Step 2: If yes then go for the pytesseract method to extract text from scan images.

Step 3: Else go with normal PDF extraction libraries like PyPDF2, pdftotext or pdfplumber, etc.

Step 4: Enjoy

Digital and Non-Digital PDF segregator in Python.

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Kishan Tongrao

No responses yet

More from Kishan Tongrao

A/B Testing | Kaggle Dataset

Hello my name is Kishan Tongrao. Today we are going to see A/B testing on Kaggle dataset.

S2E4: Unpack Variables & Format FString

Hey, my name is Kishan Tongrao. Today, we are going to learn/talk about simple concepts in the Python programming language. This is going…

S1E7 : What are Compiler and Interpreter? Linking-Loading Method and Interpretation method.

Hello, my name is Kishan Tongrao. Today we are going into another basic concept.

Mango Classification | Kaggle

Hello, my name is Kishan Tongrao. Today we are going to see the Image Classification problem on Kaggle.

Recommended from Medium

5 Powerful F-String Tricks Every Python Developer Should Know!

Learn five powerful f-string techniques to write cleaner, faster, and more readable Python code.

[Python-Doc] Efficient Text Replacement in Word Documents

Python script is designed to replace specific text in a Word document using the python-docx library. Here’s a detailed breakdown of how the…

Lists

Coding & Development

Predictive Modeling w/ Python

Practical Guides to Machine Learning

ChatGPT

How I Learned to Love `__init__.py`: A Simple Guide😊

💡 Heads Up! Click here to unlock this article for free if you’re not a Medium member!

PyMuPDF4LLM is all You Need for Extracting Data from PDFs

This package converts the pages of a PDF to text in Markdown format using PyMuPDF. Standard text and tables are detected, brought in the…

How I Am Using a Lifetime 100% Free Server

Get a server with 24 GB RAM + 4 CPU + 200 GB Storage + Always Free

3D Photo Magic | Convert Any Picture to 3D with Python

Hi,

How I Learned to Love `init.py`: A Simple Guide😊