Pre-processing of Text Data in Machine Learning

Natural language processing (NLP) refers to the branch of Artificial Intelligence concerned with the interactions between computers and human language like English, Hindi, etc. NLP enables computers to understand natural language as humans do. It has many applications in the business sector, such as language translation, document summarization, sentiment analysis, virtual assistants (like Siri and Cortana), and many more.

Text is also a kind of data, but pre-processing is one of the trickiest and most annoying parts of working on an NLP project. However, without pre-processing, one can not work on raw data. Text pre-processing ensures optimal results when executed properly. Fortunately, Python has excellent support of NLP libraries such as NLTK, spaCy, and Gensim to ease our text analysis. 

By the end of this article, you will know:

  • Real-time working on the sentiment analysis dataset.
  • Pre-processing techniques for cleaning the text data.
  • Extraction of useful information from the pre-processed text.
  • Exploratory analysis of text data.

Let’s start with text pre-processing first!

Text Preprocessing

Why do we need to clean the text? Unlike humans, machines lack the understanding of the unstructured text, and therefore, it becomes necessary to clean the text data before feeding it to any machine learning algorithm. To understand the concept better, let’s follow the “learning by doing” strategy. In this blog, we will be using the Coronavirus Tweets NLP Text Classification dataset for demonstration. 

Let’s start by loading the data!

import pandas as pd
tweets = pd.read_csv('Corona_NLP_train.csv')
# This will print the head of dataframe

Data visualization for given data

For this blog, we are only concerned with the columns of unstructured textual tweets and Sentiments. We can drop the remaining columns and rename the columns for clear understanding!

tweets = tweets[['OriginalTweet', 'Sentiment']] #extraction
tweets.columns = ['Text', 'Sentiment'] #renaming

We need to design a pre-processing pipeline (sequence-wise processing), where at each step, we will gradually clean our unstructured text.

Lowercase all the tweets

The first step is to transform the tweets into lowercase to maintain the consistent flow during the NLP tasks and text mining. For example, ‘Virus’ and ‘virus’ will be treated as two different words in any sentence, and hence, we need to make all the words in the lowercase in the tweets to prevent this duplication.

tweets['Text'] = tweets['Text'].str.lower()

Lowercased data

Remove Hyper-Links

Hyperlinks are very common in tweets and don’t add any additional information. For any other problem statement, we may need to preserve the hyperlinks. It depends upon the need for the problem statement. But for sentiment analysis, let’s remove them!

tweets['Text'] = tweets['Text'].str.replace(r"http\S+", "", regex=True)

Removed hyperlinks from data

Remove Punctuations

For most of the NLP problems, punctations do not provide additional information about the language. So we generally drop it. Similarly, punctuation symbols are not crucial for sentiment analysis. They are redundant, and the removal of punctuations before text modeling is highly recommended.

tweets['Text'] = tweets['Text'].str.replace('[^A-Za-z0-9]+',' ', regex=True)

Removed punctuations from data

Remove Stopwords

What are the stopwords?

Stopwords are English words that do not add much meaning to a sentence. They can be safely removed without sacrificing the meaning of the sentence. For instance, the words like the, he, have, etc. If we notice, stopwords are some of the most frequently appearing words in any paragraph, and they do not contribute much meaning to sentences. Hence;

Let’s remove the stopwords from the text.

import nltk
from nltk.corpus import stopwords
## NLTK library provides the set of stop words for English'stopwords')
stopwords = stopwords.words('english')
tweets['Text'] = tweets['Text'].apply(lambda words: ' '.join(word.lower() for word in words.split() if word not in stopwords))

Removed stopwords from data

Spelling Corrections

These days, Text editors are smart enough to correct your text documents. Still, spelling mistakes are widespread in text data. For the current scenario as well, spelling mistakes are pretty common while writing tweets. Fortunately, misspelled words can be treated efficiently with the help of the textblob library.

from textblob import TextBlob
tweets['Text'] = tweets['Text'].apply(lambda x: str(TextBlob(x).correct()))


Tokenization is breaking down the sentence into words and the paragraphs into sentences. These broken pieces are called tokens (either word tokens or sentence tokens), which help understand the context and create a vocabulary. It works by separating the words by spaces or punctuations.

Tokenization explaination

import nltk

word_data = "Enjoyalgorithms is a nice platform for computer science education."
nltk_tokens = nltk.word_tokenize(word_data)
print (nltk_tokens)
# Output
['Enjoyalgorithms', 'is', 'a', 'nice', 'platform', 'for', 'computer', 'science', 'education', '.']

Stemming and Lemmatization

Stemming and Lemmatization are commonly used methods while developing search engines, keyword extractions, grouping similar words together, and NLP.

Stemming vs Lemmatization

Both processes aim to reduce the word into a common base word or root word. However, these two methods follow a very different approach.

  • Stemming works by slicing the end of the word, using a list of common prefixes and suffixes like (-ing, -ed, -es). This slicing can be successful on most occasions, but not always. Pros: Faster in executing a large amount of datasets. Cons: May result in meaningless words.
  • Lemmatization takes the help of the linguistic analysis of the words. It is necessary to have detailed dictionaries that the algorithm can look through to link the form to its lemma. It takes the help of various linguistic insights of that particular word, and due to this very reason, Lemmatization is preferred over Stemming. Pros: Preserve the meaning after extracting the root word. Cons: Computationally expensive.

Note: Lemmatization is almost always preferred over stemming algorithms until and unless we need a super-fast execution on a massive corpus of text data.

Applying tokenization and Lemmatization to tweets:

import nltk'wordnet')'punkt')
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in  w_tokenizer.tokenize(text)]
tweets['lemmatized_tokens'] = tweets['text'].apply(lemmatize_text)

Lemmatization of data

We covered the text pre-processing section with this, and now we’re ready to draw insights from our data.

Text Exploratory Analysis

Let’s start by analyzing the text length for different sentiments. Create a new column having the length of text.

tweets['word_length'] = tweets['Text'].str.split().str.len()

Text length analysis

Our objective is to explore the distribution of the tweet length for different sentiments.

import seaborn as sns
import matplotlib.pyplot as plt
cmap = ["red", "green", "blue"]
labels = ["Neutral", "Positive", "Negative"]
for label,clr in zip(labels,cmap):
    sns.kdeplot(tweets.loc[(tweets['Sentiment']==label), 'word_length'], color=clr, shade=True, label=label) 
    plt.xlabel('Text Length') 

Text distribution visualization

From the above distribution plot, one can conclude that Neutral tweets have a shorter average text length than Positive and Negative tweets.

Visualizing the most frequent words

We are also interested in the most frequent words (other than the stopwords) but widespread in tweets. Let’s find those words!

import itertools
import collections
import pandas as pd
import matplotlib.pyplot as plt
lemmatized_tokens = list(tweets["lemmatized_tokens"])
token_list = list(itertools.chain(*lemmatized_tokens))
counts_no = collections.Counter(token_list)
clean_tweets = pd.DataFrame(counts_no.most_common(30),
                             columns=['words', 'count'])
fig, ax = plt.subplots(figsize=(8, 8))
ax.set_title("Most Frequently used words in Tweets")

Most frequently used words in data

Since our tweets belong to the pandemic timeline, the resultant word frequency plot is pretty intuitive.


A word cloud is a cluster of words represented in different sizes. The bigger and bolder the word appears, the more often it is mentioned within a given text data, and the more important it is.

from wordcloud import WordCloud
wordcloud = WordCloud(width = 1200, height = 800,
                    background_color ='white',
                    stopwords = stopwords,
                    min_font_size = 10).generate(str(all_words))
plt.figure(figsize = (15, 8), facecolor = None)
plt.tight_layout(pad = 0)

WordCloud visualization

Visualization of Sentimental words

We highly recommend the use of scatter-text for the visualization of sentimental words. In the attached plot, one can look for the words that describe the sentiment of the sentence. It arranges the words based on their frequency in a document and at the same time clusters them with their corresponding sentiment. In our case, red dots represent the cluster of positive words and blue dots for the cluster of negative words. Words in the yellow cluster are close to neutral sentiment.

import scattertext as st
from IPython.display import IFrame
from IPython.core.display import display, HTML
from scattertext import CorpusFromPandas, produce_scattertext_explorer
tweets = tweets.loc[(tweets['Sentiment'] == 'Extremely Negative') | (tweets['Sentiment'] == 'Extremely Positive')]
corpus = st.CorpusFromParsedDocuments(tweets.iloc[:10000,:], category_col='Sentiment', parsed_col='parsed').build()
html = st.produce_scattertext_explorer(corpus,
                         category='Extremely Negative',
file_name = 'Sentimental Words Visualization.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1000, height=700)

visualization of sentiment words

Enough for the first part of text pre-processing. We have cleaned and visualized our data, but don’t forget, computers or machines don’t understand English. They only understand numbers. So, there must be a way to convert this cleaned text into a machine-readable format, and that’s where word embeddings come into the picture. We will learn about word embeddings in our next part.

Possible Interview Questions on this topic

Text data pre-processing is one of the most important topics. If we write it in our resume and sit for NLP engineer positions or data scientists positions, we must know the pre-processing steps. Possible questions for this would be,

  1. What are the initial steps one should take to pre-process the text?
  2. What is the difference between lemmatization and stemming? When do we use lemmatization and when stemming?
  3. Why is the removal of stopwords essential?
  4. What are the different ways using which we can do the text exploration?
  5. What are the text visualization steps?


In this article, we covered one of the most crucial steps in Natural Language Processing, i.e., text pre-processing. We implemented the text cleaning steps chronologically in Python over the Twitter Covid-19 tweet sentiment analysis data for a better hands-on. Moving on, we visualized the hidden trends and significant words in the corpus! We hope you enjoyed the article.

Enjoy Learning! Enjoy Texting! Enjoy Algorithms!

Share feedback with us

More blogs to explore

Our weekly newsletter

Subscribe to get weekly content on data structure and algorithms, machine learning, system design and oops.

© 2022 Code Algorithms Pvt. Ltd.

All rights reserved.