So the question is how to solve this error?
Step 1: Check whether the PDFdocument contains images in it. (Digital PDF)
Step 2: If yes then go for the pytesseract method to extract text from scan images.
Step 3: Else go with normal PDF extraction libraries like PyPDF2, pdftotext or pdfplumber, etc.
Step 4: Enjoy
Digital and Non-Digital PDF segregator in Python.
In every pdf document, we have one property that is ‘Resources’. If that resource contains ‘font’ as a resource in it then that page contains text data else pretty obvious that page contains scan image.
In the above code, we are using PyPDF2 library to extract the ‘Resources’ of every page and check if ‘Font’ resource present in it or not. Based on this result we can call the appropriate function to extract text from PDF.
We can also find weather pdf documents contains mix type (one-page image and second-page text).