Text Preprocessing Complete Pipeline

3 min readApr 4, 2022

Whenever we are solving work on Natural Language Processing tasks or concepts, we must know the various preprocessing notions. We are implementing a complete preprocessing pipeline for any Natural Language Processing tasks.

Let’s start with a quote.
“Words may be false and full of art; Sighs are the natural language of the heart” — Thomas Shadwell.

Pipes in Text Preprocessing Pipeline

Convert to Lowercase
Remove Punctuations
Remove Stopwords
Perform Stemming
Perform Lemmatization
Remove Emoji
Remove Emoticon
Convert Emoticon
Convert Emoji
Remove URL’s
Remove HTML Tags
Convert Chat Words
Spell Correction

1. Convert to Lowercase

Your model might treat a word with a capital letter different from the same word with all lowercase characters. Generally, it is your first step of NLP preprocessing.

2. Remove Punctuations

Punctuations confuse AI models because they are either, translated into Unicode, which is not helpful.
Punctuation also creates noise which we don’t need while building NLP tasks.
If you want to keep them in your data, the punctuation is not necessarily attached to fixed words. So ultimately, it becomes unmeaningful.

3. Remove Stopwords

Stopwords are high-frequency words present in a language.
Stopwords have less lexical content and do not hold a powerful meaning.
It depends on the specific task in NLP. Suppose we are working on a classification problem and, we want our model to more focus on keywords than stopwords.

4. Stemming

Stemming and Lemmatization are word normalization techniques.
Removing prefixes and suffixes and converting words into simpler forms is called Stemming.
It allows us to recognize that jumping, jumps and jumped are all rooted in the same verb (jump) and thus are referring to similar problems.

5. Lemmatization

It allows us to differentiate between present, past, and indefinite.
So, jumps and jump is grouped into the present jump, as different from all uses of jumped which are grouped as past tense, and all instances of jumping which are grouped together as the indefinite (meaning continuing/continuous).

6. Remove Emoji

An emoji is a small actual image that is used to express emotions or ideas in text messages.
For example, 😄 is an emoji.
If you don’t want emoji you can but it will remove information too.

7. Remove Emoticon

An emoticon is a facial expression representation using keyboard characters and punctuations.
‘:)’ is an emoticon that represents a happy face.
If you don’t want an emoticon you can but it will remove information too.

8. Convert Emoji

If you don’t want to remove the information and preserve the information you can convert emojis to understanding words.

9. Convert Emoticon

If you don’t want to remove the information and preserve the information you can convert emoticons to understanding words.

10. Remove URLs

It is considered noise because it’s just a URL with some meaningful and unmeaningful words combined with punctuations.

11. Remove HTML Tags

It is considered noise because it’s just meaningless words combined with punctuations.

12. Convert Chat Words

In this fast progressing world, we need fewer and fewer words for frequent phrases.
In the chat application, we use shortcuts for phrases.
If we want to use that short information, simply convert it back to its full form.

13. Spell Correction

The spell correction is very important because the spelling of the word is wrong which means it is kind of meaningless.

All the above preprocessing steps mentioned are, combined in one function with the option of using each one.

Please use below GitHub repository to get the function and all required files.

GitHub - thetongs/Text-Preprocesssing-Complete-Pipeline

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Thanks You!