Sentiment Analysis is the use of Natural Language Processing (NLP), Text Mining, and Computational Linguistics to identify, extract, and understand the emotional inclination present in the text. With the widespread propagation of reviews, blogs, ratings, recommendations, and feedback, online opinion has turned into a gold mine for businesses looking to capture the market with their products, identify new opportunities, and manage their reputations and brand name. Sentiment Analysis has been used by almost every sector and is widely appliable to Market Research, Customer Feedback, Brand Monitoring, Voice of employees, and Social Media Monitoring.
After completing this blog, we will have the knowledge about
The conventional approach to sentiment classification involves several steps, from structuring the text data to understanding the customer sentiments. Over the years, Deep Learning has transformed Sentiment Analysis to a whole new level. With the introduction of Transformers and Transfer Learning, building a model for sentiment classification is a mere matter of minutes. However, knowing the basics of sentiment classification always comes in handy.
Let’s build a model for classifying the sentiments using the conventional approach!
In this tutorial, we will be using Kaggle’s IMDB movie review dataset for demonstration. This dataset contains more than 40,000 Reviews & sentiments, and most of the reviews are described with 200 plus words in this dataset.
Let’s load the dataset!
import pandas as pd imdb_reviews = pd.read_csv('train.csv') imdb_reviews.head()
Why do we need to clean the text? Unlike humans, machines lack the understanding of the unstructured text, and therefore, it becomes necessary to clean the text data before fitting any machine learning model to it.
Let’s build a text preprocessing pipeline where we will be applying the following operations to our movie review corpus:
def text_preprocessing_pipeline(corpus): corpus['text'] = corpus['text'].str.lower() corpus['text'] = corpus['text'].str.replace(r"http\S+", "", regex=True) corpus['text'] = corpus['text'].str.replace('[^A-Za-z0–9]+',' ', regex=True) corpus['text'] = corpus['text'].apply(lambda words: ' '.join(word.lower() for word in words.split() if word not in stopwords)) corpus['text'] = corpus['text'].apply(lambda x: str(TextBlob(x).correct())) reviews = text_preprocessing_pipeline(imdb_reviews) reviews.head()
Tokenization is the process of breaking down the sentence into words called tokens. These tokens help in understanding the context and in the creation of vocabulary. It works by separating the words by spaces or punctuations.
Whereas, Lemmatization helps in reducing the word to its common base root word. It takes the help of the linguistic analysis of the words. It is necessary to have detailed dictionaries which the algorithm can look through to link the form back to its root word.
Applying tokenization and Lemmatization to our Clean Movie Reviews:
import nltk nltk.download('wordnet') nltk.download('punkt') w_tokenizer = nltk.tokenize.WhitespaceTokenizer() lemmatizer = nltk.stem.WordNetLemmatizer() def lemmatize_text(text): return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)] reviews['lemmatized_tokens'] = reviews['text'].apply(lemmatize_text) reviews.head()
Now, we have a clean dataset ready for Exploratory data analysis.
We are also interested in the most frequent words other than the stopwords but highly frequent in reviews. Let’s find those words!
import itertools import collections import pandas as pd import matplotlib.pyplot as plt lemmatized_tokens = list(reviews["lemmatized_tokens"]) token_list = list(itertools.chain(*lemmatized_tokens)) counts_no = collections.Counter(token_list) clean_reviews = pd.DataFrame(counts_no.most_common(30), columns=['words', 'count']) fig, ax = plt.subplots(figsize=(12, 8)) clean_reviews.sort_values(by='count').plot.barh(x='words', y='count', ax=ax, color="purple") ax.set_title("Most Frequently used words in Reviews") plt.show()
Since our dataset contains movie reviews, the resultant words frequency plot is pretty intuitive.
A bigram is a sequence of two adjacent elements from a string of tokens, typically letters, syllables, or words. Let’s also check the highly frequent bigrams in our data.
bigrams = zip(token_list, token_list[1:]) counts_no = collections.Counter(bigrams)
Almost all the above bigrams make sense in our data. We could go further with trigrams, but that would not be as informative as these bigrams and unigrams.
Let’s visualize the most practical words that represent positive or negative sentiment in reviews.
import scattertext as st from IPython.display import IFrame from IPython.core.display import display, HTML from scattertext import CorpusFromPandas, produce_scattertext_explorer corpus = reviews.loc[(reviews['label'] == "Positive") | (reviews['label'] == "Negative")] corpus = st.CorpusFromParsedDocuments(corpus.iloc[:2000,:], category_col='label').build() html = st.produce_scattertext_explorer(corpus, category="Positive", category_name='Negative', not_category_name='Positive', minimum_term_frequency=5, width_in_pixels=1000, transform=st.Scalers.log_scale_standardize) file_name = 'Sentimental Words Visualization.html' open(file_name, 'wb').write(html.encode('utf-8')) IFrame(src=file_name, width = 1000, height=700)
Let’s quickly summarise our findings:
Word Embedding is a term that has been used to represent words as a numeric vector. Words are encoded in real-valued vectors such that words sharing similar meaning and context are clustered closely in vector space. In simple words, word embeddings are a form of word representation that connects the human understanding of language to that of a machine. Word embeddings are crucial for solving NLP problems.
There are several methods available for producing the word embeddings. However, their idea is primarily the same: to capture most of the contextual and semantical information. Selecting an optimal word embedding often requires empirical efforts, and generally, it is not an easy task.
Following are some popular and simple word embedding methods available for the vector representation of words:
In this tutorial, we will keep ourselves confined with the TF-IDF Vectorizer.
TF-IDF is a short notation for Term Frequency and Inverse Document Frequency. It is commonly used to transform the text into a meaningful representation of numeric vectors. Initially, it is an information retrieval method that relies on Term Frequency (TF) and Inverse Document Frequency (IDF) to measure the importance of a word in a document.
Term Frequency (TF) tracks the occurrence of words in a document; Inverse Document Frequency (IDF) assigns a weightage to each word in the corpus. The IDF weightage is high for infrequently appearing words and low for frequent words. This allows us to detect how important a word is to a document.
Let’s implement TF-IDF on our movie reviews:
tfidf_converter = TfidfVectorizer(max_features=2000) features = tfidf_converter.fit_transform(reviews['text']).toarray()
Now, we are ready to build our Sentiment Classification model, but first, we need to select a supervised classification model that satisfies our requirements.
We have a bunch of algorithms for classification tasks, and each algorithm has its pros and cons. One algorithm might fetch superior results as compared to others but might lack in terms of explainability. Even if explainability is not compromised, the deployment of such complex algorithms is a tedious task. In other words, there is a trade-off between performance, model complexity, and model explainability. An ideal algorithm must be explainable, reliable, and easy to deploy, but again, there’s nothing like a perfect algorithm.
For instance, XGBoost is a high-performance and explainable algorithm, but on the contrary, it is pretty complex and requires high computational power. On the other hand, Logistic Regression is relatively fast, easy to implement, and explainable, but the performance of logistic regression over non-linear datasets is considerably disappointing. As the number of features in the dataset progresses, Logistic Regression tends to become slower, and finally, the performance clips.
For this tutorial, we will be using the Light GBM Classifier!
Light GBM is a gradient boosting framework similar to XGBoost that uses tree-based learning algorithms. It is designed to be distributed and efficient with the following advantages:
Light GBM is an excellent alternative to XGBoost as it is roughly six times faster than XGBoost without compromising the performance. It can handle large datasets and requires low memory to run.
Let’s implement Light-GBM for Sentiment Classification:
import lightgbm as lgb from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix, classification_report target = reviews['label'] x_train, x_test, y_train, y_test = train_test_split(features, target, test_size=0.3) clf = lgb.LGBMClassifier(max_depth=20, n_estimators=25, min_child_weight=0.0016, n_jobs=-1) clf.fit(X_train, y_train) pred = clf.predict(x_test) print("Test data Accuracy is :",accuracy_score(y_test , pred)) print(classification_report(y_test, pred))
Accuracy on the Testing dataset
import seaborn as sns from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, pred) cm_matrix = create_ticks(cm) fig, ax = plt.subplots(figsize=(8, 6)) sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')
Twitter allows businesses to engage personally with consumers. With so much data available, they have developed real-time sentiment classification models to support and manage the marketing strategies of several brands. Twitter’s Sentiment analysis allows companies to understand their customers, keep track of what’s being said about their brand and competitors, discover what is trending in the market.
IBM is among the few companies now using sentiment analysis to understand employee concerns, and they are also developing programs to improve the likelihood employees will stay on the job. This helps the human-resource managers figure out how workers feel about their company and where management can make changes to improve the experience of their employees.
Nielsen relies on Sentiment Analysis to discover the market trends and find the popularity of their customer’s products. Based on sentimental trends, they also provide consultation for building marketing strategies and campaigns.
Sentiment analysis project comes under that project category, which we can commonly find in resumes of beginners. But the important thing is that we should be prepared for upcoming questions on this topic. Some of them are:
We started with a brief introduction to the Sentiment Analysis and why it is required in the industries. Moving on, we applied a text preprocessing pipeline to our movie review dataset to remove the redundant expressions from the text. We implemented tokenization and Lemmatization to understand the context of those words used in the reviews and limit the recurring words appearing in diverse forms. Further, we performed a text exploratory analysis to understand the frequent unigrams and bigrams used in the reviews, visualize the clusters of positive, negative, and neutral words available in reviews. Finally, we applied the TF-IDF vectorizer to the processed reviews, built a Light GBM model to classify the reviews, and evaluated the performance on the testing dataset. We also looked at some industrial use-cases of Sentiment analysis.
Get well-designed application and interview centirc content on ds-algorithms, machine learning, system design and oops. Content will be delivered weekly.