Movie Review Sentiment Analysis Using Machine Learning

Sentiment Analysis is a technique that uses Natural Language Processing (NLP), Text Mining, and Computational Linguistics to identify and extract the emotions present in the text. It has become increasingly valuable in today's digital age, as the proliferation of reviews, blogs, ratings, and feedback on the internet has created a wealth of information for businesses looking to understand their customers, identify new opportunities, and manage their reputation.

This technique has a wide range of applications and is used by many different industries, such as Market Research, Customer Feedback, Brand Monitoring, Employee Engagement, and Social Media Monitoring. By analyzing the emotions expressed in customer feedback, for example, businesses can gain insight into how their products or services are perceived and make improvements accordingly.

Key takeaways from this blog

In this blog, we will explore the following topics:

  • How are different industries using sentiment analysis?
  • Data analysis of the IMDB movie review dataset.
  • The various steps involved in text processing or data processing, such as tokenization, lemmatization, word embedding, and tf-idf.
  • The building of a Light GBM model for predicting positive and negative reviews.
  • Real-world examples of sentiment analysis in use by companies such as Twitter and IBM.
  • Potential interview questions related to this project.

Traditionally, sentiment classification involves a multi-step process that includes organizing text data and understanding customer emotions. However, with the arrival of deep learning, sentiment analysis has been revolutionized. The introduction of advanced techniques such as Transformers and Transfer Learning has made it possible to quickly build models for sentiment classification.

While the new deep-learning approaches have greatly simplified the process, it is still beneficial to have a basic understanding of sentiment classification. This understanding can help to fine-tune and improve the model, as well as provide a deeper understanding of customer sentiment.

Let’s build a model for classifying the sentiments using the conventional approach!

Data Analysis

In this tutorial, we will be using Kaggle’s IMDB movie review dataset for demonstration. This dataset contains more than 40,000 Reviews & sentiments, and most of the reviews are described in 200-plus words in this dataset.

Let’s load the dataset!

import pandas as pd
imdb_reviews = pd.read_csv('train.csv')
imdb_reviews.head()
                    TEXT                                 |  LABEL
---------------------------------------------------------------------
0   grew up (b. 1965) watching and loving the Th...      |     0
1   When I put this movie in my DVD player, and sa...    |     0
2   Why do people who do not know what a particula...    |     0
3   Even though I have great interest in Biblical...     |     0
4   Im a die hard Dads Army fan and nothing will e...    |     1

Text Preprocessing

It is important to clean text data before applying machine learning models to it because machines cannot understand the unstructured text. To prepare the text data, we will create a text preprocessing pipeline that includes the following operations on our movie review corpus:

  1. Converting the text to lowercase
  2. Removing any URLs from the text
  3. Removing punctuation marks from the text
  4. Removing common words (stopwords) from the text
  5. Correcting any misspelt words
def text_preprocessing_pipeline(corpus):
  corpus['text'] = corpus['text'].str.lower()
  corpus['text'] = corpus['text'].str.replace(r"http\S+", "", regex=True)
  corpus['text'] = corpus['text'].str.replace('[^A-Za-z0–9]+',' ', regex=True)
  corpus['text'] = corpus['text'].apply(lambda words: ' '.join(word.lower() for word in words.split() if word not in stopwords))
  corpus['text'] = corpus['text'].apply(lambda x: str(TextBlob(x).correct()))

reviews = text_preprocessing_pipeline(imdb_reviews)
reviews.head()
                    TEXT                                 |  LABEL
---------------------------------------------------------------------
0   grew b 1965 watching loving thunderbirds mates...    |     0
1   put movie dvd player sat coke chips expectatio...    |     0
2   people know particular time past like feel nee...    |     0
3   even though great interest biblical movies bor...    |     0
4   im die hard dads army fan nothing ever change ...    |     1

Tokenization and Lemmatization

Tokenization

Tokenization is the process of breaking down a sentence into individual words, known as tokens. These tokens are used to understand the context of the sentence and to create a vocabulary. Tokenization is achieved by separating the words in a sentence using spaces or punctuation marks. This process helps to make the text more structured, which makes it easier for machine learning models to understand and analyze the data.

           Text
"The cat sat on the mat."
            |
           \|/
          Tokens
"the", "cat", "sat", "on", "the", "mat", "."

Lemmatization

Lemmatization is a process that helps to reduce a word to its most basic root form. It uses linguistic analysis to determine the root form of a word, and it is necessary to have a comprehensive dictionary for the algorithm to reference in order to link the word form to its root. This process can help to improve the accuracy and performance of machine learning models by reducing the number of variations of a word and making the text more structured.

Studying        Lemmatization           Study
Studies     ---------------------->     Study
Study                                   Study

Applying tokenization and Lemmatization to our Clean Movie Reviews:

import nltk
nltk.download('wordnet')
nltk.download('punkt')
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]
reviews['lemmatized_tokens'] = reviews['text'].apply(lemmatize_text)
reviews.head()

How to extract lemmatized tokens from the text dataset?

Now, we have a clean dataset ready for Exploratory data analysis. 

Text Exploratory Analysis

We are also interested in the most frequent words other than the stopwords but highly frequent in reviews. Let’s find those words!

import itertools
import collections
import pandas as pd
import matplotlib.pyplot as plt
lemmatized_tokens = list(reviews["lemmatized_tokens"])
token_list = list(itertools.chain(*lemmatized_tokens))
counts_no = collections.Counter(token_list)
clean_reviews = pd.DataFrame(counts_no.most_common(30),
                             columns=['words', 'count'])
fig, ax = plt.subplots(figsize=(12, 8))
clean_reviews.sort_values(by='count').plot.barh(x='words',
                         y='count',
                         ax=ax,
                         color="purple")
ax.set_title("Most Frequently used words in Reviews")
plt.show()

Frequency of different words present in the IMDB movie review dataset used for sentiment analysis

Since our dataset contains movie reviews, the resultant word frequency plot is pretty intuitive.

Bigrams

A bigram is a sequence of two adjacent elements from a string of tokens, typically letters, syllables, or words. Let’s also check the highly frequent bigrams in our data.

bigrams = zip(token_list, token_list[1:])
counts_no = collections.Counter(bigrams)

Bigram plot of the IMDB movie review dataset used for sentiment analysis

Almost all the above bigrams make sense in our data. We could go further with trigrams, but that would not be as informative as these bigrams and unigrams.

Visualization of Sentimental Words

Let’s visualize the most practical words representing positive or negative sentiment in reviews.

import scattertext as st
from IPython.display import IFrame
from IPython.core.display import display, HTML
from scattertext import CorpusFromPandas, produce_scattertext_explorer
corpus = reviews.loc[(reviews['label'] == "Positive") | (reviews['label'] == "Negative")]
corpus = st.CorpusFromParsedDocuments(corpus.iloc[:2000,:], category_col='label').build()
html = st.produce_scattertext_explorer(corpus,
                         category="Positive",
                         category_name='Negative',
                         not_category_name='Positive',
                         minimum_term_frequency=5,
                         width_in_pixels=1000,
                         transform=st.Scalers.log_scale_standardize)
file_name = 'Sentimental Words Visualization.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1000, height=700)

Scatter plot for words corresponding to various sentiments present in the IMDB movie review dataset

Let’s quickly summarise our findings:

  • The red cluster represents the words used in most of the positive sentiments. Words farthest from the yellow shade have an even higher positive sentimental context.
  • On the contrary, the blue cluster represents the words that have appeared majorly in the negative sentiments. The farther they are from the yellow shade, the higher will be negative sentimental context.
  • The thin yellow-shaded cluster represents the neutral words.
  • The words on the extreme right side more frequently appear in the reviews than those on the extreme left.

Word Embeddings

Word embedding is a technique used to represent words as numerical vectors. This method encodes words in real-valued vectors, such that words with similar meaning and context are located close to each other in the vector space. In other words, word embeddings connect the way humans understand language to the way machines understand it. They are critical for solving natural language processing (NLP) tasks, as they provide a way for machines to understand the meaning and context of words in a text.

man ---------------> Woman
 |                     |
 |                     |
 |                     |
 |                     |
King ---------------> Queen

There are several methods available for producing word embeddings, but their main idea is the same: to capture as much contextual and semantic information as possible. Choosing the best word embedding method often requires experimentation and can be a difficult task.

Some popular and straightforward methods for creating vector representations of words include:

  • Word2Vec
  • GloVe
  • Bag-of-words
  • TF-IDF
  • ELMO (Embeddings for Language Models)

In this blog, we will keep ourselves confined to the TF-IDF Vectorizer.

TF-IDF Vectorizer

TF-IDF is a short notation for "Term Frequency and Inverse Document Frequency". It is commonly used to transform text into a meaningful representation of numeric vectors. Initially, it is an information retrieval method that relies on Term Frequency (TF) and Inverse Document Frequency (IDF) to measure the importance of a word in a document.

How to do the tf-idf vectorization on the text documents?

Term Frequency (TF) tracks the occurrence of words in a document; Inverse Document Frequency (IDF) assigns a weightage to each word in the corpus. The IDF weightage is high for infrequently appearing words and low for frequent words. This allows us to detect how important a word is to a document.

Let’s implement TF-IDF on our movie reviews:

tfidf_converter = TfidfVectorizer(max_features=2000)
features = tfidf_converter.fit_transform(reviews['text']).toarray()

Model Building

We are ready to build our Sentiment Classification model, but first, we must select a supervised classification model that satisfies our requirements.

We have several algorithms for classification tasks, each with their own pros and cons. One algorithm may produce superior results compared to others but may require more explainability. Even if explainability is not compromised, deploying such complex algorithms can be tedious. In other words, there is a trade-off between performance, model complexity, and model explainability. The ideal algorithm should be explainable, reliable, and easy to deploy, but again, there is no such thing as a perfect algorithm.

For example, XGBoost is a high-performance and explainable algorithm, but on the other hand, it is quite complex and requires high computational power. On the other hand, Logistic Regression is relatively fast, simple to implement, and explainable, but the performance of logistic regression on non-linear datasets is considerably disappointing. As the number of features in the dataset increases, Logistic Regression tends to become slower and its performance deteriorates.

For this blog, we will be using the Light GBM Classifier!

Light Gradient Boosting Machine (Light GBM)

Light GBM is a gradient-boosting framework that is similar to XGBoost and utilizes tree-based learning algorithms. It is designed to be distributed and efficient, with the following benefits:

  • Faster training speed and increased efficiency
  • Lower memory usage
  • Improved accuracy
  • Support for parallel and GPU learning
  • Capable of handling large-scale data

Light GBM is an excellent alternative to XGBoost as it is roughly six times faster than XGBoost without compromising performance. It can handle large datasets and requires low memory to operate.

Let’s implement Light-GBM for Sentiment Classification:

import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
target = reviews['label']
x_train, x_test, y_train, y_test = train_test_split(features, target, test_size=0.3)
clf = lgb.LGBMClassifier(max_depth=20,
                         n_estimators=25,
                         min_child_weight=0.0016,
                         n_jobs=-1)
clf.fit(X_train, y_train)
pred = clf.predict(x_test)
print("Test data Accuracy is : ",accuracy_score(y_test , pred))
print(classification_report(y_test, pred))
#############
Test data Accuracy is : 0.816916666666

Accuracy on the Testing dataset

Evaluation of sentiment analysis model on the IMDB movie review dataset

Classification Report

import seaborn as sns
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, pred)
cm_matrix = create_ticks(cm)
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')


##### This will give us the confusion matrix plot

Industrial Use Cases of Sentiment Analysis

Twitter

Twitter allows businesses to engage personally with consumers by using real-time sentiment classification models to support and manage the marketing strategies of several brands. With so much data available, Twitter's Sentiment analysis enables companies to understand their customers, keep track of what's being said about their brand and competitors, and discover trends in the market.

IBM

IBM is one of the few companies that uses sentiment analysis to understand employee concerns. They are also developing programs to improve employees' likelihood of staying on the job. This helps human-resource managers figure out how workers feel about their company and where management can make changes to improve the experience of their employees.

Nielsen Holdings Inc.

Nielsen relies on Sentiment Analysis to discover market trends and find the popularity of their customer's products. Based on sentimental trends, they also provide consultation for building marketing strategies and campaigns.

Possible Interview Questions

Sentiment analysis projects are a common category of project that is often found in beginners' resumes. However, it's important to be prepared for potential questions on this topic, such as:

  • What steps did you take to preprocess the data?
  • Why did you choose to perform lemmatization instead of stemming?
  • How did you convert the text into a machine-readable or trainable format?
  • Can you explain how Light GBM works and what are the hyperparameters involved with the Light GBM model?
  • How do you know that your model is better and what can be done to improve accuracy further?

Conclusion

We started with a brief introduction to Sentiment Analysis and why it is required in industries. Moving on, we applied a text preprocessing pipeline to our movie review dataset to remove the redundant expressions from the text. We implemented tokenization and Lemmatization to understand the context of those words used in the reviews and limit the recurring words appearing in diverse forms. Further, we performed a text exploratory analysis to understand the frequent unigrams and bigrams used in the reviews and visualize the clusters of positive, negative, and neutral words available in reviews.

Finally, we applied the TF-IDF vectorizer to the processed reviews, built a Light GBM model to classify the reviews, and evaluated the performance on the testing dataset. We also looked at some industrial use cases of Sentiment analysis.

Enjoy Learning, Enjoy Algorithms!

Share Your Insights

More from EnjoyAlgorithms

Self-paced Courses and Blogs

Coding Interview

Machine Learning

System Design

Our Newsletter

Subscribe to get well designed content on data structure and algorithms, machine learning, system design, object orientd programming and math.