Natural language processing (NLP) refers to the branch of Artificial Intelligence concerned with the interactions between computers and human language like English, Hindi, etc. NLP enables computers to understand natural language as humans do. It has many applications in the business sector, such as language translation, document summarization, sentiment analysis, virtual assistants (like Siri and Cortana), and many more.
Text is also a kind of data, but pre-processing is one of the trickiest and most annoying parts of working on an NLP project. However, without pre-processing, one can not work on raw data. Text pre-processing ensures optimal results when executed properly. Fortunately, Python has excellent support of NLP libraries such as NLTK, spaCy, and Gensim to ease our text analysis.
Let’s start with text pre-processing first!
Why do we need to clean the text? Unlike humans, machines lack the understanding of the unstructured text, and therefore, it becomes necessary to clean the text data before feeding it to any machine learning algorithm. To understand the concept better, let’s follow the “learning by doing” strategy. In this blog, we will be using the Coronavirus Tweets NLP Text Classification dataset for demonstration.
Let’s start by loading the data!
import pandas as pd tweets = pd.read_csv('Corona_NLP_train.csv') print(tweets.head()) # This will print the head of dataframe
For this blog, we are only concerned with the columns of unstructured textual tweets and Sentiments. We can drop the remaining columns and rename the columns for clear understanding!
tweets = tweets[['OriginalTweet', 'Sentiment']] #extraction tweets.columns = ['Text', 'Sentiment'] #renaming
We need to design a pre-processing pipeline (sequence-wise processing), where at each step, we will gradually clean our unstructured text.
The first step is to transform the tweets into lowercase to maintain the consistent flow during the NLP tasks and text mining. For example, ‘Virus’ and ‘virus’ will be treated as two different words in any sentence, and hence, we need to make all the words in the lowercase in the tweets to prevent this duplication.
tweets['Text'] = tweets['Text'].str.lower() tweets.head()
Hyperlinks are very common in tweets and don’t add any additional information. For any other problem statement, we may need to preserve the hyperlinks. It depends upon the need for the problem statement. But for sentiment analysis, let’s remove them!
tweets['Text'] = tweets['Text'].str.replace(r"http\S+", "", regex=True) tweets.head()
For most of the NLP problems, punctations do not provide additional information about the language. So we generally drop it. Similarly, punctuation symbols are not crucial for sentiment analysis. They are redundant, and the removal of punctuations before text modeling is highly recommended.
tweets['Text'] = tweets['Text'].str.replace('[^A-Za-z0-9]+',' ', regex=True) tweets.head()
What are the stopwords?
Stopwords are English words that do not add much meaning to a sentence. They can be safely removed without sacrificing the meaning of the sentence. For instance, the words like the, he, have, etc. If we notice, stopwords are some of the most frequently appearing words in any paragraph, and they do not contribute much meaning to sentences. Hence;
Let’s remove the stopwords from the text.
import nltk from nltk.corpus import stopwords ## NLTK library provides the set of stop words for English nltk.download('stopwords') stopwords = stopwords.words('english') tweets['Text'] = tweets['Text'].apply(lambda words: ' '.join(word.lower() for word in words.split() if word not in stopwords)) print(tweets.head())
These days, Text editors are smart enough to correct your text documents. Still, spelling mistakes are widespread in text data. For the current scenario as well, spelling mistakes are pretty common while writing tweets. Fortunately, misspelled words can be treated efficiently with the help of the textblob library.
from textblob import TextBlob tweets['Text'] = tweets['Text'].apply(lambda x: str(TextBlob(x).correct()))
Tokenization is breaking down the sentence into words and the paragraphs into sentences. These broken pieces are called tokens (either word tokens or sentence tokens), which help understand the context and create a vocabulary. It works by separating the words by spaces or punctuations.
import nltk word_data = "Enjoyalgorithms is a nice platform for computer science education." nltk_tokens = nltk.word_tokenize(word_data) print (nltk_tokens) # Output ['Enjoyalgorithms', 'is', 'a', 'nice', 'platform', 'for', 'computer', 'science', 'education', '.']
Stemming and Lemmatization are commonly used methods while developing search engines, keyword extractions, grouping similar words together, and NLP.
Both processes aim to reduce the word into a common base word or root word. However, these two methods follow a very different approach.
Note: Lemmatization is almost always preferred over stemming algorithms until and unless we need a super-fast execution on a massive corpus of text data.
Applying tokenization and Lemmatization to tweets:
import nltk nltk.download('wordnet') nltk.download('punkt') w_tokenizer = nltk.tokenize.WhitespaceTokenizer() lemmatizer = nltk.stem.WordNetLemmatizer() def lemmatize_text(text): return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)] tweets['lemmatized_tokens'] = tweets['text'].apply(lemmatize_text) tweets.head()
We covered the text pre-processing section with this, and now we’re ready to draw insights from our data.
Let’s start by analyzing the text length for different sentiments. Create a new column having the length of text.
tweets['word_length'] = tweets['Text'].str.split().str.len() tweets.head()
Our objective is to explore the distribution of the tweet length for different sentiments.
import seaborn as sns import matplotlib.pyplot as plt sns.set(color_codes=True) plt.figure(figsize=(15,7)) cmap = ["red", "green", "blue"] labels = ["Neutral", "Positive", "Negative"] for label,clr in zip(labels,cmap): sns.kdeplot(tweets.loc[(tweets['Sentiment']==label), 'word_length'], color=clr, shade=True, label=label) plt.xlabel('Text Length') plt.ylabel('Density') plt.legend()
From the above distribution plot, one can conclude that Neutral tweets have a shorter average text length than Positive and Negative tweets.
We are also interested in the most frequent words (other than the stopwords) but widespread in tweets. Let’s find those words!
import itertools import collections import pandas as pd import matplotlib.pyplot as plt lemmatized_tokens = list(tweets["lemmatized_tokens"]) token_list = list(itertools.chain(*lemmatized_tokens)) counts_no = collections.Counter(token_list) clean_tweets = pd.DataFrame(counts_no.most_common(30), columns=['words', 'count']) fig, ax = plt.subplots(figsize=(8, 8)) clean_tweets.sort_values(by='count').plot.barh(x='words', y='count', ax=ax, color="blue") ax.set_title("Most Frequently used words in Tweets") plt.show()
Since our tweets belong to the pandemic timeline, the resultant word frequency plot is pretty intuitive.
A word cloud is a cluster of words represented in different sizes. The bigger and bolder the word appears, the more often it is mentioned within a given text data, and the more important it is.
from wordcloud import WordCloud wordcloud = WordCloud(width = 1200, height = 800, background_color ='white', stopwords = stopwords, min_font_size = 10).generate(str(all_words)) plt.figure(figsize = (15, 8), facecolor = None) plt.imshow(wordcloud) plt.axis("off") plt.tight_layout(pad = 0) plt.show()
Visualization of Sentimental words
We highly recommend the use of scatter-text for the visualization of sentimental words. In the attached plot, one can look for the words that describe the sentiment of the sentence. It arranges the words based on their frequency in a document and at the same time clusters them with their corresponding sentiment. In our case, red dots represent the cluster of positive words and blue dots for the cluster of negative words. Words in the yellow cluster are close to neutral sentiment.
import scattertext as st from IPython.display import IFrame from IPython.core.display import display, HTML from scattertext import CorpusFromPandas, produce_scattertext_explorer tweets = tweets.loc[(tweets['Sentiment'] == 'Extremely Negative') | (tweets['Sentiment'] == 'Extremely Positive')] corpus = st.CorpusFromParsedDocuments(tweets.iloc[:10000,:], category_col='Sentiment', parsed_col='parsed').build() html = st.produce_scattertext_explorer(corpus, category='Extremely Negative', category_name='Negative', not_category_name='Positive', minimum_term_frequency=5, width_in_pixels=1000, transform=st.Scalers.log_scale_standardize) file_name = 'Sentimental Words Visualization.html' open(file_name, 'wb').write(html.encode('utf-8')) IFrame(src=file_name, width = 1000, height=700)
Enough for the first part of text pre-processing. We have cleaned and visualized our data, but don’t forget, computers or machines don’t understand English. They only understand numbers. So, there must be a way to convert this cleaned text into a machine-readable format, and that’s where word embeddings come into the picture. We will learn about word embeddings in our next part.
Text data pre-processing is one of the most important topics. If we write it in our resume and sit for NLP engineer positions or data scientists positions, we must know the pre-processing steps. Possible questions for this would be,
In this article, we covered one of the most crucial steps in Natural Language Processing, i.e., text pre-processing. We implemented the text cleaning steps chronologically in Python over the Twitter Covid-19 tweet sentiment analysis data for a better hands-on. Moving on, we visualized the hidden trends and significant words in the corpus! We hope you enjoyed the article.
Subscribe to get well-designed content on data structures and algorithms, machine learning, system design, oops, and mathematics. enjoy learning!
Many major companies like Google, IBM, etc., are exploring machine learning potential in this domain. Cancer classification is one such area where ML can deliver a robust predictive model to identify the cancer possibility based on given observations.