Document’s language identifier — Adv NLP and Python

Kishan Tongrao
1 min readJun 21, 2020
Photo by Dmitry Ratushny on Unsplash

There are different input files we have to work on in NLP based applications. So what about the identification of documents language. We here worked on .png, .jpg, .jpeg, .tif, .htm, .html, .doc, .docx, .pdf, .txt and .msg input documents.

How to identify the language of the input document?

Approach :

As we already know that in every language we have some common set of words that we used in a particular language. So we take that set of common word set we called it stopwords, and extract the common words from an input source. Then compare from which set of language it belongs max.

Steps:

Step 1: Read input files from the shared folder.

Step 2: Find the extension of each file.

Step 3: Depending upon file extension call the respective function or method to extract text from it.

Step 4: Find the common words from the input file.

Step 5: Compare the common words result with predefined common words from a different language.

Step 6: From this which set of language represents a max value that is the language of that file.

Python code :

(Python code)

Visit: https://gist.github.com/thetongs/a67ac78954c308a0906ff2673b351152

That’s it, guys.

Thank you.

--

--