Sentiment Analysis for Classifying Sentiment of Movie Reviews

Sentiment Analysis is the use of Natural Language Processing (NLP), Text Mining, and Computational Linguistics to identify, extract, and understand the emotional inclination present in the text. With the widespread propagation of reviews, blogs, ratings, recommendations, and feedback, online opinion has turned into a gold mine for businesses looking to capture the market with their products, identify new opportunities, and manage their reputations and brand name. Sentiment Analysis has been used by almost every sector and is widely appliable to Market Research, Customer Feedback, Brand Monitoring, Voice of employees, and Social Media Monitoring. 

Key takeaways from this blog

After completing this blog, we will have the knowledge about

  1. How sentiment analysis is being used by different industries?
  2. Data analysis for the IMDB movie review dataset.
  3. Different steps of text processing or data processing, including tokenization, lemmatization, word embedding, and tf-idf.
  4. Building Light GBM model for prediction positive and negative reviews.
  5. Twitter, and IBM use cases and how they use sentiment analysis?
  6. Possible interview questions on this project.

The conventional approach to sentiment classification involves several steps, from structuring the text data to understanding the customer sentiments. Over the years, Deep Learning has transformed Sentiment Analysis to a whole new level. With the introduction of Transformers and Transfer Learning, building a model for sentiment classification is a mere matter of minutes. However, knowing the basics of sentiment classification always comes in handy. 

Let’s build a model for classifying the sentiments using the conventional approach! 

Sentiment analysis image 1

Data Analysis

In this tutorial, we will be using Kaggle’s IMDB movie review dataset for demonstration. This dataset contains more than 40,000 Reviews & sentiments, and most of the reviews are described with 200 plus words in this dataset.

Let’s load the dataset!

import pandas as pd
imdb_reviews = pd.read_csv('train.csv')
imdb_reviews.head()

IMDB dataset snippet

Text Preprocessing

Why do we need to clean the text? Unlike humans, machines lack the understanding of the unstructured text, and therefore, it becomes necessary to clean the text data before fitting any machine learning model to it.

Let’s build a text preprocessing pipeline where we will be applying the following operations to our movie review corpus:

  • Lowering the text
  • Removing URLs from text
  • Removing Punctuations from text
  • Removing Stopwords from text
  • Correcting the misspelled words
def text_preprocessing_pipeline(corpus):
 corpus['text'] = corpus['text'].str.lower()
 corpus['text'] = corpus['text'].str.replace(r"http\S+", "", regex=True)
 corpus['text'] = corpus['text'].str.replace('[^A-Za-z0–9]+',' ', regex=True)
 corpus['text'] = corpus['text'].apply(lambda words: ' '.join(word.lower() for word in words.split() if word not in stopwords))
 corpus['text'] = corpus['text'].apply(lambda x: str(TextBlob(x).correct()))

reviews = text_preprocessing_pipeline(imdb_reviews)
reviews.head()

Positive and negative label on data

Tokenization & Lemmatization

Tokenization is the process of breaking down the sentence into words called tokens. These tokens help in understanding the context and in the creation of vocabulary. It works by separating the words by spaces or punctuations. 

Tokenization of data

Tokenization

Whereas, Lemmatization helps in reducing the word to its common base root word. It takes the help of the linguistic analysis of the words. It is necessary to have detailed dictionaries which the algorithm can look through to link the form back to its root word.

Lemmatization

Lemmatization

Applying tokenization and Lemmatization to our Clean Movie Reviews:

import nltk
nltk.download('wordnet')
nltk.download('punkt')
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]
reviews['lemmatized_tokens'] = reviews['text'].apply(lemmatize_text)
reviews.head()

Lemmatized tokens snippet on IMDB movie review data

Now, we have a clean dataset ready for Exploratory data analysis. 

Text Exploratory Analysis

We are also interested in the most frequent words other than the stopwords but highly frequent in reviews. Let’s find those words!

import itertools
import collections
import pandas as pd
import matplotlib.pyplot as plt
lemmatized_tokens = list(reviews["lemmatized_tokens"])
token_list = list(itertools.chain(*lemmatized_tokens))
counts_no = collections.Counter(token_list)
clean_reviews = pd.DataFrame(counts_no.most_common(30),
                             columns=['words', 'count'])
fig, ax = plt.subplots(figsize=(12, 8))
clean_reviews.sort_values(by='count').plot.barh(x='words',
                         y='count',
                         ax=ax,
                         color="purple")
ax.set_title("Most Frequently used words in Reviews")
plt.show()

Frequency of different words on IMDB movie review data

Since our dataset contains movie reviews, the resultant words frequency plot is pretty intuitive.

Bigrams

A bigram is a sequence of two adjacent elements from a string of tokens, typically letters, syllables, or words. Let’s also check the highly frequent bigrams in our data.

bigrams = zip(token_list, token_list[1:])
counts_no = collections.Counter(bigrams)

Bigram of IMDB movie review data

Almost all the above bigrams make sense in our data. We could go further with trigrams, but that would not be as informative as these bigrams and unigrams.

Visualization of Sentimental Words

Let’s visualize the most practical words that represent positive or negative sentiment in reviews.

import scattertext as st
from IPython.display import IFrame
from IPython.core.display import display, HTML
from scattertext import CorpusFromPandas, produce_scattertext_explorer
corpus = reviews.loc[(reviews['label'] == "Positive") | (reviews['label'] == "Negative")]
corpus = st.CorpusFromParsedDocuments(corpus.iloc[:2000,:], category_col='label').build()
html = st.produce_scattertext_explorer(corpus,
                         category="Positive",
                         category_name='Negative',
                         not_category_name='Positive',
                         minimum_term_frequency=5,
                         width_in_pixels=1000,
                         transform=st.Scalers.log_scale_standardize)
file_name = 'Sentimental Words Visualization.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1000, height=700)

visualization of sentiment words in IMDB review data

Let’s quickly summarise our findings:

  • The red cluster represents the words that have been used in most of the positive sentiments. Words farthest from the yellow shade have even higher positive sentimental context. 
  • On the contrary, the blue cluster represents the words that have appeared majorly in the negative sentiments. The farther they are from the yellow shade, the higher will be negative sentimental context.
  • The thin yellow shaded cluster represents the neutral words.
  • The words on the extreme right side more frequently appear in the reviews than those appearing on the extreme left.

Word Embeddings

Word Embedding is a term that has been used to represent words as a numeric vector. Words are encoded in real-valued vectors such that words sharing similar meaning and context are clustered closely in vector space. In simple words, word embeddings are a form of word representation that connects the human understanding of language to that of a machine. Word embeddings are crucial for solving NLP problems. 

word-embedding

Source: ResearchGate

There are several methods available for producing the word embeddings. However, their idea is primarily the same: to capture most of the contextual and semantical information. Selecting an optimal word embedding often requires empirical efforts, and generally, it is not an easy task.

Following are some popular and simple word embedding methods available for the vector representation of words:

  • Word2Vec
  • GloVe
  • Bag-of-words
  • TF-IDF
  • ELMO (Embeddings for Language models)

In this tutorial, we will keep ourselves confined with the TF-IDF Vectorizer. 

TF-IDF Vectorizer

TF-IDF is a short notation for Term Frequency and Inverse Document Frequency. It is commonly used to transform the text into a meaningful representation of numeric vectors. Initially, it is an information retrieval method that relies on Term Frequency (TF) and Inverse Document Frequency (IDF) to measure the importance of a word in a document.

tf-idf vectorization formulae

Term Frequency (TF) tracks the occurrence of words in a document; Inverse Document Frequency (IDF) assigns a weightage to each word in the corpus. The IDF weightage is high for infrequently appearing words and low for frequent words. This allows us to detect how important a word is to a document.

Let’s implement TF-IDF on our movie reviews:

tfidf_converter = TfidfVectorizer(max_features=2000)
features = tfidf_converter.fit_transform(reviews['text']).toarray()

Model Building

Now, we are ready to build our Sentiment Classification model, but first, we need to select a supervised classification model that satisfies our requirements.  

We have a bunch of algorithms for classification tasks, and each algorithm has its pros and cons. One algorithm might fetch superior results as compared to others but might lack in terms of explainability. Even if explainability is not compromised, the deployment of such complex algorithms is a tedious task. In other words, there is a trade-off between performance, model complexity, and model explainability. An ideal algorithm must be explainable, reliable, and easy to deploy, but again, there’s nothing like a perfect algorithm.

For instance, XGBoost is a high-performance and explainable algorithm, but on the contrary, it is pretty complex and requires high computational power. On the other hand, Logistic Regression is relatively fast, easy to implement, and explainable, but the performance of logistic regression over non-linear datasets is considerably disappointing. As the number of features in the dataset progresses, Logistic Regression tends to become slower, and finally, the performance clips. 

For this tutorial, we will be using the Light GBM Classifier!

Light Gradient Boosting Machine (Light GBM)

Light GBM is a gradient boosting framework similar to XGBoost that uses tree-based learning algorithms. It is designed to be distributed and efficient with the following advantages:

  • Faster training speed and higher efficiency.
  • Lower memory usage.
  • Better accuracy.
  • Support of parallel and GPU learning.
  • Capable of large-scale handling data.

Light GBM is an excellent alternative to XGBoost as it is roughly six times faster than XGBoost without compromising the performance. It can handle large datasets and requires low memory to run. 

Let’s implement Light-GBM for Sentiment Classification:

import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
target = reviews['label']
x_train, x_test, y_train, y_test = train_test_split(features, target, test_size=0.3)
clf = lgb.LGBMClassifier(max_depth=20,
                         n_estimators=25,
                         min_child_weight=0.0016,
                         n_jobs=-1)
clf.fit(X_train, y_train)
pred = clf.predict(x_test)
print("Test data Accuracy is :",accuracy_score(y_test , pred))
print(classification_report(y_test, pred))

Accuracy figures of the model

Accuracy on the Testing dataset

Accuracy figures of the model on test data of IMDB movie data review

Classification Report

import seaborn as sns
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, pred)
cm_matrix = create_ticks(cm)
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')

confusion matrix of the developed model

Confusion Matrix

Industrial Use Cases

Twitter

Twitter allows businesses to engage personally with consumers. With so much data available, they have developed real-time sentiment classification models to support and manage the marketing strategies of several brands. Twitter’s Sentiment analysis allows companies to understand their customers, keep track of what’s being said about their brand and competitors, discover what is trending in the market. 

IBM

IBM is among the few companies now using sentiment analysis to understand employee concerns, and they are also developing programs to improve the likelihood employees will stay on the job. This helps the human-resource managers figure out how workers feel about their company and where management can make changes to improve the experience of their employees. 

Nielsen Holdings Inc.

Nielsen relies on Sentiment Analysis to discover the market trends and find the popularity of their customer’s products. Based on sentimental trends, they also provide consultation for building marketing strategies and campaigns.

Possible Interview Questions

Sentiment analysis project comes under that project category, which we can commonly find in resumes of beginners. But the important thing is that we should be prepared for upcoming questions on this topic. Some of them are:

  1. What are the steps you took to pre-process the data?
  2. Why did you perform lemmatization instead of stemming?
  3. How did you convert the text into the machine-readable or trainable format?
  4. How does Light GBM work? What are the hyperparameters involved with the Light GBM model?
  5. How can you say that your model is better, and what things can be done to improve accuracy further?

Conclusion 

We started with a brief introduction to the Sentiment Analysis and why it is required in the industries. Moving on, we applied a text preprocessing pipeline to our movie review dataset to remove the redundant expressions from the text. We implemented tokenization and Lemmatization to understand the context of those words used in the reviews and limit the recurring words appearing in diverse forms. Further, we performed a text exploratory analysis to understand the frequent unigrams and bigrams used in the reviews, visualize the clusters of positive, negative, and neutral words available in reviews. Finally, we applied the TF-IDF vectorizer to the processed reviews, built a Light GBM model to classify the reviews, and evaluated the performance on the testing dataset. We also looked at some industrial use-cases of Sentiment analysis. 

Enjoy Learning! Enjoy Thinking! Enjoy Algorithms!

We welcome your comments

Subscribe Our Newsletter

Get well-designed application and interview centirc content on ds-algorithms, machine learning, system design and oops. Content will be delivered weekly.