Text Data Preprocessing in Machine Learning

Natural language processing (NLP) is a branch of artificial intelligence that deals with the interactions between computers and human languages, such as English, Hindi, and others. NLP allows computers to understand and process human language in a way that is similar to how humans do. It has many practical applications in the business world, including language translation, document summarization, sentiment analysis, and the development of virtual assistants like Siri and Cortana.

Text is a form of data, but preprocessing it can be a challenging and time-consuming task when working on an NLP project. Preprocessing allows you to work with raw data and can greatly improve the results of your analysis. Fortunately, Python has several NLP libraries, such as NLTK, spaCy, and Gensim, that can assist with text analysis and make preprocessing easier. It is important to properly preprocess your text data in order to achieve optimal results.

Concepts covered in this article

Real-time working on the sentiment analysis dataset.
Pre-processing techniques for cleaning the text data.
Extraction of useful information from the pre-processed text.
Exploratory analysis of text data.

Let's start with text pre-processing first!

Text Preprocessing

Why do we need to clean the text? Unlike humans, machines lack an understanding of the unstructured text, so cleaning the text data is necessary before feeding it to any machine learning algorithm. To understand the concept better, let's follow the "learning by doing" strategy. In this blog, we will be using the Coronavirus Tweets NLP Text Classification dataset for demonstration.

Let's start by loading the data!

import pandas as pd
tweets = pd.read_csv('Corona_NLP_train.csv')
print(tweets.head())
# This will print the head of dataframe

      UserName ScreenName Location  TweetAt        OriginalTweet                                     Sentiment
0      3799      48751     London   16-03-2020   @MeNyrbie @Phil Gahan @Chrisitv https://t.co/i.      Neutral
1      3800      48752       UK     16-03-2020   advice Talk to your neighbours family to excha.      Positive
2      3801      48753   Vagabonds  16-03-2020   Coronavirus Australia: Woolworths to give elde.      Positive
3      3802      48754      NaN     16-03-2020   My food stock is not the only one which is emp.      Positive
4      3803      48755      NaN     16-03-2020   Me, ready to go at supermarket during the #COV.  Extremely Negative

For this blog, we are only concerned with the columns of unstructured textual tweets and Sentiments. We can drop the remaining columns and rename the rest columns for clear understanding.

tweets = tweets[['OriginalTweet', 'Sentiment']] #extraction
tweets.columns = ['Text', 'Sentiment'] #renaming

We must design a pre-processing pipeline (sequence-wise processing), where we will gradually clean our unstructured text at each step.

Lowercase all the tweets

The first step is transforming the tweets into lowercase to maintain a consistent flow during the NLP tasks and text mining. For example, 'Virus' and 'virus' will be treated as two different words in any sentence, and hence, we need to make all the words lowercase in the tweets to prevent this duplication.

tweets['Text'] = tweets['Text'].str.lower()
tweets.head()

                                 Text                       Sentiment
 ------------------------------------------------------------------------
0     @menyrbie @phil_gahan @chrisitv https://t.co/i...      Neutral
1     advice talk to your neighbours family to excha...      Positive
2     coronavirus australia: woolworths to give elde...      Positive
3     my food stock is not the only one which is emp...      Positive
4     me, ready to go at supermarket during the #cov...   Extremely Negative

Remove Hyper-Links

Hyperlinks are very common in tweets and don't add any additional information. For any other problem statement, we may need to preserve the hyperlinks. It depends upon the need for the problem statement. But for sentiment analysis, let's remove them!

tweets['Text'] = tweets['Text'].str.replace(r"http\S+", "", regex=True)
tweets.head()

                         Text                                Sentiment
 ------------------------------------------------------------------------
0     @menyrbie @phil gahan @chrisitv and and                 Neutral
1    advice talk to your neighbours family to excha...        Positive
2    coronavirus australia: woolworths to give elde...        Positive
3    my food stock is not the only one which is emp...        Positive
4    me, ready to go at supermarket during the #COv...    Extremely Negative

Remove Punctuations

For most NLP problems, punctuation does not provide additional language information. So we generally drop it. Similarly, punctuation symbols are not crucial for sentiment analysis. They are redundant, and the removal of punctuation before text modeling is highly recommended.

tweets['Text'] = tweets['Text'].str.replace('[^A-Za-z0-9]+',' ', regex=True)
tweets.head()

                         Text                                 Sentiment
 ------------------------------------------------------------------------
O      menyrbie phil gahan chrisit and and                     Neutral
1      advice talk to your neighbours family to excha...       Positive
2      coronavirus australia woolworths to give elder..        Positive
3      my food stock is not the only one which is emp..        Positive
4      me ready to go at supermarket during the covid...    Extremely Negative

Remove Stopwords

Stopwords are English words that do not add much meaning to a sentence. They can be safely removed without sacrificing the meaning of the sentence. For instance, the words like the, he, have, etc. If we notice, stopwords are some of the most frequently appearing words in any paragraph and do not contribute much meaning to sentences.

Let's remove the stopwords from the text.

import nltk
from nltk.corpus import stopwords
## NLTK library provides the set of stop words for English
nltk.download('stopwords')
stopwords = stopwords.words('english')
tweets['Text'] = tweets['Text'].apply(lambda words: ' '.join(word.lower() for word in words.split() if word not in stopwords))
print(tweets.head())

                         Text                                 Sentiment
 ------------------------------------------------------------------------
O      menyrbie phil gahan chrisitv                            Neutral
1      advice talk neighbours family exchange phone n...       Positive
2      coronavirus australia woolworths give elderly           Positive
3      food stock one empty please panic enough food           Positive
4      ready go supermarket covid19 outbreak paranoid...    Extremely Negative

Spelling Corrections

These days, Text editors are smart enough to correct your text documents. Still, spelling mistakes are widespread in text data. In the current scenario, spelling mistakes are common while writing tweets. Fortunately, misspelled words can be treated efficiently with the help of the textblob library.

from textblob import TextBlob
tweets['Text'] = tweets['Text'].apply(lambda x: str(TextBlob(x).correct()))

Tokenization

Tokenization is breaking down the sentence into words and paragraphs into sentences. These broken pieces are called tokens (either word tokens or sentence tokens), which help understand the context and create a vocabulary. It works by separating the words by spaces or punctuations.

                           Text
               "The cat sat on the mat."
                            |
                           \|/
                          Tokens 
         "The", "cat", "sat", "on", "the", "mat", "."

import nltk

word_data = "Enjoyalgorithms is a nice platform for computer science education."
nltk_tokens = nltk.word_tokenize(word_data)
print (nltk_tokens)
# Output
['Enjoyalgorithms', 'is', 'a', 'nice', 'platform', 'for', 'computer', 'science', 'education', '.']

Stemming and Lemmatization

Stemming and Lemmatization are commonly used for developing search engines, keyword extractions, grouping similar words, and NLP.

What is the difference between stemming and lemmatization?

Both processes aim to reduce the word into a common base word or root word. However, these two methods follow a very different approaches.

Stemming works by slicing the end of the word, using a list of common prefixes and suffixes like (-ing, -ed, -es). This slicing can be successful on most occasions, but not always. Pros: Faster in executing a large amount of datasets. Cons: This may result in meaningless words.
Lemmatization takes the help of the linguistic analysis of the words. It is necessary to have detailed dictionaries that the algorithm can look through to link the form to its lemma. It takes the help of various linguistic insights of that particular word, and due to this very reason, Lemmatization is preferred over Stemming. Pros: Preserve the meaning after extracting the root word. Cons: Computationally expensive.

Note: Lemmatization is almost always preferred over stemming algorithms until and unless we need a super-fast execution on a massive corpus of text data.

Applying tokenization and Lemmatization to tweets:

import nltk
nltk.download('wordnet')
nltk.download('punkt')
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in  w_tokenizer.tokenize(text)]
tweets['lemmatized_tokens'] = tweets['text'].apply(lemmatize_text)
tweets.head()

                     Text                                     Sentiment            lemmatized_tokens
-----------------------------------------------------------------------------------------------------------------------------------------
0        menyrbie phil gahan chrisitv                          Neutral             [menyrbie, phil, gahan, chrisitv]
1        advice talk neighbours family exchange phone n...     Positive            [advice, talk, neighbour, family, exchange, ph...
2        coronavirus australia woolworths give elderly ...     Positive            [coronavirus, australia, woolworth, give, elde...
3        food stock one empty please panic enough food         Positive            [food, stock, one, empty, please, panic, enoug...
4        ready go supermarket covid19 outbreak paranoid...     Extremely Negative  [ready, go, supermarket, covid19, outbreak, pa...

With this, we covered the text pre-processing section, and now we're ready to draw insights from our data.

Text Exploratory Analysis

Let's start by analyzing the text length for different sentiments. Create a new column having the length of text.

tweets['word_length'] = tweets['Text'].str.split().str.len()
tweets.head()

                     Text                                     Sentiment           word length        lemmatized_tokens
-----------------------------------------------------------------------------------------------------------------------------------------
0        menyrbie phil gahan chrisitv                          Neutral                 4         [menyrbie, phil, gahan, chrisitv]
1        advice talk neighbours family exchange phone n...     Positive                27        [advice, talk, neighbour, family, exchange, ph...
2        coronavirus australia woolworths give elderly ...     Positive                13        [coronavirus, australia, woolworth, give, elde...
3        food stock one empty please panic enough food         Positive                23        [food, stock, one, empty, please, panic, enoug...
4        ready go supermarket covid19 outbreak paranoid...     Extremely Negative      21        [ready, go, supermarket, covid19, outbreak, pa...

Our objective is to explore the distribution of the tweet length for different sentiments.

import seaborn as sns
import matplotlib.pyplot as plt
sns.set(color_codes=True)
plt.figure(figsize=(15,7))
cmap = ["red", "green", "blue"]
labels = ["Neutral", "Positive", "Negative"]
for label,clr in zip(labels,cmap):
    sns.kdeplot(tweets.loc[(tweets['Sentiment']==label), 'word_length'], color=clr, shade=True, label=label) 
    plt.xlabel('Text Length') 
    plt.ylabel('Density')
    plt.legend()

How to perform Exploratory Data Analysis on text data?

From the above distribution plot, one can conclude that Neutral tweets have a shorter average text length than Positive and Negative tweets.

Visualizing the most frequent words

We are also interested in the most frequent words (other than the stopwords) but widespread in tweets. Let's find those words!

import itertools
import collections
import pandas as pd
import matplotlib.pyplot as plt
lemmatized_tokens = list(tweets["lemmatized_tokens"])
token_list = list(itertools.chain(*lemmatized_tokens))
counts_no = collections.Counter(token_list)
clean_tweets = pd.DataFrame(counts_no.most_common(30),
                             columns=['words', 'count'])
fig, ax = plt.subplots(figsize=(8, 8))
clean_tweets.sort_values(by='count').plot.barh(x='words',
                      y='count',
                      ax=ax,
                      color="blue")
ax.set_title("Most Frequently used words in Tweets")
plt.show()

How to analyse the word frequency in the text dataset?

Since our tweets belong to the pandemic timeline, the resultant word frequency plot is pretty intuitive.

WordCloud

A word cloud is a cluster of words represented in different sizes. The bigger and bolder the word appears, the more often it is mentioned within a given text data, and the more important it is.

from wordcloud import WordCloud
wordcloud = WordCloud(width = 1200, height = 800,
                    background_color ='white',
                    stopwords = stopwords,
                    min_font_size = 10).generate(str(all_words))
                      
plt.figure(figsize = (15, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
 
plt.show()

How to plot the word cloud for text data and what does it represent?

Visualization of Sentimental words

We highly recommend the use of scatter-text for the visualization of sentimental words. In the attached plot, one can look for the words that describe the sentence's sentiment. It arranges the words based on their frequency in a document and, at the same time, clusters them with their corresponding sentiment. In our case, red dots represent the cluster of positive words, and blue dots for the cluster of negative words. Words in the yellow cluster are close to neutral sentiment.

import scattertext as st
from IPython.display import IFrame
from IPython.core.display import display, HTML
from scattertext import CorpusFromPandas, produce_scattertext_explorer
tweets = tweets.loc[(tweets['Sentiment'] == 'Extremely Negative') | (tweets['Sentiment'] == 'Extremely Positive')]
corpus = st.CorpusFromParsedDocuments(tweets.iloc[:10000,:], category_col='Sentiment', parsed_col='parsed').build()
html = st.produce_scattertext_explorer(corpus,
                         category='Extremely Negative',
                         category_name='Negative',
                         not_category_name='Positive',
                         minimum_term_frequency=5,
                         width_in_pixels=1000,
                         transform=st.Scalers.log_scale_standardize)
file_name = 'Sentimental Words Visualization.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1000, height=700)

How to plot the word vectors to analyze the words present in the text data?

Enough for the first part of text pre-processing. We have cleaned and visualized our data, but computers or machines need help understanding English. They only understand numbers. So, there must be a way to convert this cleaned text into a machine-readable format, and that's where word embeddings come into the picture. We will learn about word embeddings in our next part.

Possible Interview Questions

Text data preprocessing is a crucial topic in the field of NLP. If you are applying for positions as an NLP engineer or data scientist, it is important to have a good understanding of preprocessing steps. Some possible questions that may be asked about text data preprocessing include:

What are the initial steps that should be taken to preprocess text data?
What is the difference between lemmatization and stemming, and when should each be used?
Why is the removal of stopwords important in text data preprocessing?
What are some different ways to explore text data?
What are the steps involved in text data visualization?

Conclusion

In this article, we focused on one of the most essential steps in natural language processing: text preprocessing. To give you practical experience, we applied the text cleaning steps chronologically in Python to a dataset of tweets about Covid-19 sentiment analysis. By visualizing the hidden trends and significant words in the dataset, we were able to demonstrate the importance of text preprocessing. We hope you found this article enjoyable and informative.

Next Blog: Word vector encoding

Pre-processing of Text Data in Machine Learning