In the new era of technical advancement, electronic mails (e-mails) have gathered significant users for professional, commercial, and personal communications. In 2019, on average, every person was receiving 130 emails each day, and overall, 296 Billion emails have been sent in that year.
Because of the high demand and huge user base, there is an upsurge in unwanted emails, also known as spam emails. There were times when more than 50% of the total emails were spam emails. Even in the current date, people lose millions of dollars to frauds every day.
But, in the figure shown below, it can be observed that the quantity of such emails has decreased significantly after 2016 because of the evolution of the software that can detect these spam emails and can filter them out.
Percentage of emails marked as Spam (Source: Statista)
Key takeaways from this article
- What are the different methods used to segregate the incoming emails into the spam or non-spam categories?
- Steps to implement a Spam-classifier using the k-NN algorithm.
- How to evaluate the performance of the model formed?
- Use-case of Gmail, Outlook, and Yahoo. How do these companies classify emails?
- Possible interview questions on this machine learning application.
Many several techniques are present in the market to detect spam e-mails. If we want to classify broadly, there are 5 different techniques based on which algorithms decide whether any mail is spam or not.
Content-Based Filtering Technique
Algorithms analyze words, the occurrence of words, and the distribution of words and phrases inside the content of e-mails and segregate them into spam and non-spam categories.
Case Base Spam Filtering Method
Algorithms trained on well-annotated spam/non-spam marked emails try to classify the incoming mails into two categories.
Heuristic or Rule-Based Spam Filtering Technique
Algorithms use pre-defined rules in the form of a regular expression to give a score to the messages present in the e-mails. Based on the scores generated, they segregate emails into spam and non-spam categories.
The Previous Likeness Based Spam Filtering Technique
Algorithms extract the incoming mails' features and create a multi-dimensional space vector and draw points for every new instance. Based on the KNN algorithm, these new points get assigned to the closest class of spam and non-spam.
Adaptive Spam Filtering Technique
Algorithms classify the incoming mails into various groups and, based on the comparison scores of every group with the defined set of groups, spam, and non-spam emails got segregated.
This article will give an idea for implementing content-based filtering using one of the most famous algorithms for spam detection, which is K-Nearest Neighbour (KNN).
k-NN based algorithms are widely used for clustering tasks. Let’s quickly know the entire architecture of this implementation first and then explore every step. Executing these 5 steps, one after the other will help us implement our spam classifier smoothly.
Training Testing Phase
New Email Classification
Step 1: E-mail Data Collection
The dataset contained in a corpus plays a crucial role in assessing the performance of any spam filter. Many open-source datasets are freely available in the public domain. Below mentioned two datasets are widely popular as they contain a huge amount of emails.
- Enron corpus datasets (Created in 2006 and having 55% spam emails)
- Trec 2007 dataset ( Created in 2007 and having 67% spam emails)
Train/Test Split: Split the dataset into train and test datasets but make sure that both sets must balance the numbers of ham and spam emails (ham is a fancy name for non-spam emails).
Enron Corpus Dataset on Kaggle
Step 2: Pre-processing of E-mail content
At this step, we mainly perform tokenization of mails. Tokenization is a process where we break the content of an email into words and transform big messages into a sequence of representative symbols termed tokens. These tokens are extracted from the email body, header, subject, and image.
Extracting words from images (For a simple implementation, this can be ignored)
These days, senders have options to attach inline images to the mail. These emails can be categorized as spam emails not based on their mail content but on the images' content. This was not an easy task until google came up with the open-source library Tesseract. This library extracts the words from images automatically with certain accuracy. But still, Times New Roman and Captcha words are difficult to read automatically.
Step 3: Feature Extraction and Selection
After pre-processing, we can have a large number of words. Here we can maintain a database that contains the frequency of the different words represented in each column. These attributes can be categorized on a different basis, like:
- Important attributes: Frequency of repeated words, Number of semantic discrepancies, an Adult content bag of words, etc.
- Additional Attributes: Sender account features like Sender country, IP address, email, age of sender, Number of replies, number of recipients, and website address.
Note: These web addresses are converted in the word format only. For example, https://www.google.com/ can be converted to “HTTP google.”
Sometimes these processes are called Normalization.
- Less important attributes: Geographical distance between sender and receiver, Sender’s date of birth, Account lifespan, Sex of sender, and Age of the recipient.
You must be clear that the more the number of attributes → more the time complexity of the model.
These attributes can be huge, and hence techniques like Stemming, noise removal, and stop-words removal can be used. One of the famous stemming algorithms is the Porter Stemmer Algorithm. Some general things that we do in stemming are:
Example dataset format
Step 4: KNN (K-Nearest Neighbour) Implementation
Similar to the Nearest Neighbour algorithm, the K-Nearest Neighbour algorithm serves the purpose of clustering. Still, instead of giving just one nearest instance, it looks at the closest K instances to the new incoming instance. Based on the frequency of those K instances, K-NN classifies the new instances. The value of K is considered to be a hyperparameter that needs tuning. To tune this, one can take one of the famous Hit and Trial approaches where we try some K's values and then check the model's performance.
To find the nearest instance, one can use the Euclidean distance. One can use the Scikit-learn library to implement the K-NN algorithm for this task.
Step 5: Performance Analysis
Now our algorithm is ready, so we must check the performance of the model. Even a single missed important message may cause a user to reconsider the value of spam filtering. So we must be sure that our algorithm will be as close to 100% accurate. But some researchers feel that considering only the accuracy as the evaluation parameter for spam classification is not enough.
According to the below table (also known as the confusion matrix), we must evaluate our spam-classification model based on 4 different parameters.
- Accuracy : (TP + TN)/(TP + FP + FN + TN)
- Precision : TP / (TP + FP)
- Sensitivity : TP / (TP + FN)
- Specificity : TN / (TN + FP)
More advanced algorithms are available in the market for this classification, but you can easily achieve more than 90% accuracy using k-NN based implementation.
Gmail, Yahoo, and Outlook Case Study
Google data centers use thousands of rules to filter spam emails. They provide the weightage to different parameters, and based on that; they filter the mails. Google’s spam classifier is said to be a state of an art technique that uses various techniques like Optical character recognition, linear regression, and a combination of various neural networks.
Yahoo mail is the world’s first free webmail service provider, which still has more than 320 million active users. They have their own filtering techniques to categorize emails. Yahoo's basic methods are URL filtering, email content, and spam complaints from users. Unlike Gmail, Yahoo filter emails messages by domain and not the IP address. Yahoo provides custom filtering options to users as well to directly send the mail in the junk folders.
Microsoft-owned mailing platform widely used among professionals. In 2013, Microsoft renamed Hotmail and Windows Live Mail to Outlook. At present, the outlook has more than 400 Million active users. Outlook has its own distinctive feature based on which it filters every incoming mail. Based on their official website, they have provided the list of spam filters they use to send any mail in the junk folder, which includes :
- Safe Senders list
- Safe Recipients list
- Blocked Senders list
- Blocked Top-Level Domains list
- Blocked Encodings list
Possible Interview Questions
Question 1: What is Porter Stemmer's Algorithm?
Question 2: Why k-NN algorithm for this problem?
Question 3: Is this supervised learning or unsupervised learning?
Question 4: What are the different algorithms that can replace k-NN here?
Question 5: What steps can be taken to improve accuracy further?
In terms of the number of spam emails sent daily and the number of money people lose every day because of these spam scams, Spam-filtering becomes the primary need for all email-providing companies. This article discussed the complete process of spam email filtering using advanced technologies of machine learning. We also have closed one possible way of implementing our own spam classifier using one of the most famous algorithms, k-NN. We also discussed the case studies of famous companies like Gmail, Outlook, and Yahoo to review how they use ML and AI techniques to filter such spammers.
- Emmanuel Gbenga Dada, Joseph Stephen Bassi, Machine learning for email spam filtering: review, approaches, and open research problems.
- Loredana Fire, Camelia Lemnaru, Spam Detection Filter using KNN Algorithm and Resampling
- Anirudh Harisinghaney, Arnan Dixit, Text and Image-Based Spam Email Classification using KNN, Naive Bayes, and Reverse DBSCAN Algorithm
We hope you enjoyed the article and will be able to implement your own spam classifier easily. Enjoy Thinking, Enjoy Machine Learning, Enjoy Algorithms!