Text Preprocessing Complete Pipeline

Kishan Tongrao
3 min readApr 4, 2022

Whenever we are solving work on Natural Language Processing tasks or concepts, we must know the various preprocessing notions. We are implementing a complete preprocessing pipeline for any Natural Language Processing tasks.

Let’s start with a quote.
“Words may be false and full of art; Sighs are the natural language of the heart” — Thomas Shadwell.

Photo by Nathan Dumlao on Unsplash

Pipes in Text Preprocessing Pipeline

  1. Convert to Lowercase
  2. Remove Punctuations
  3. Remove Stopwords
  4. Perform Stemming
  5. Perform Lemmatization
  6. Remove Emoji
  7. Remove Emoticon
  8. Convert Emoticon
  9. Convert Emoji
  10. Remove URL’s
  11. Remove HTML Tags
  12. Convert Chat Words
  13. Spell Correction

1. Convert to Lowercase

  • Your model might treat a word with a capital letter different from the same word with all lowercase characters. Generally, it is your first step of NLP preprocessing.

2. Remove Punctuations

  • Punctuations confuse AI models because they are either, translated into Unicode, which is not helpful.
  • Punctuation also creates noise which we don’t need while building NLP tasks.
  • If you want to keep them in your data, the punctuation is not necessarily attached to fixed words. So ultimately, it becomes unmeaningful.

3. Remove Stopwords

  • Stopwords are high-frequency words present in a language.
  • Stopwords have less lexical content and do not hold a powerful meaning.
  • It depends on the specific task in NLP. Suppose we are working on a classification problem and, we want our model to more focus on keywords than stopwords.

4. Stemming

  • Stemming and Lemmatization are word normalization techniques.
  • Removing prefixes and suffixes and converting words into simpler forms is called Stemming.
  • It allows us to recognize that jumping, jumps and jumped are all rooted in the same verb (jump) and thus are referring to similar problems.

5. Lemmatization

  • It allows us to differentiate between present, past, and indefinite.
  • So, jumps and jump is grouped into the present jump, as different from all uses of jumped which are grouped as past tense, and all instances of jumping which are grouped together as the indefinite (meaning continuing/continuous).

6. Remove Emoji

  • An emoji is a small actual image that is used to express emotions or ideas in text messages.
  • For example, 😄 is an emoji.
  • If you don’t want emoji you can but it will remove information too.

7. Remove Emoticon

  • An emoticon is a facial expression representation using keyboard characters and punctuations.
  • ‘:)’ is an emoticon that represents a happy face.
  • If you don’t want an emoticon you can but it will remove information too.

8. Convert Emoji

  • If you don’t want to remove the information and preserve the information you can convert emojis to understanding words.

9. Convert Emoticon

  • If you don’t want to remove the information and preserve the information you can convert emoticons to understanding words.

10. Remove URLs

  • It is considered noise because it’s just a URL with some meaningful and unmeaningful words combined with punctuations.

11. Remove HTML Tags

  • It is considered noise because it’s just meaningless words combined with punctuations.

12. Convert Chat Words

  • In this fast progressing world, we need fewer and fewer words for frequent phrases.
  • In the chat application, we use shortcuts for phrases.
  • If we want to use that short information, simply convert it back to its full form.

13. Spell Correction

  • The spell correction is very important because the spelling of the word is wrong which means it is kind of meaningless.

All the above preprocessing steps mentioned are, combined in one function with the option of using each one.

Please use below GitHub repository to get the function and all required files.

Thanks You!

--

--