Wine Quality Prediction Using k-NN Regressor

Machine learning models solve some unsolved and challenging tasks. One such task can be predicting the quality of wine with some quantitative measurement. Judging the quality of wine manually is difficult; even professional wine tasters have an accuracy of only 71%.

Gaining the title of a Wine taster is quite an involved process. The Master Sommelier’s Diploma exam is the world’s most challenging wine-tasting examination, and only 200 people have passed since the exam’s inception 40 years ago. With the advancements in machine learning and artificial intelligence, predicting the wine quality is a mere matter of minutes if we have all the required parameters.

Key takeaways from this blog

After reading this blog, we would be able to get insights of:

  • Why do we need machine learning models to solve the problem of wine quality assessment?
  • What are the factors that affect wine quality?
  • Which machine learning models can be used to predict wine quality?
  • Some of the industry-based applications of the k-NN algorithm.
  • Possible interview questions on this project.

Let’s take a look at the available wine parameters.

  • Fixed acidity: Nonvolatile, volatile acids of Wine (do not evaporate readily)
  • Volatile acidity: The amount of acetic acid in Wine
  • Citric acid: Adds flavour to Wine and is found in small quantities.
  • Residual sugar: Sugar content after fermentation stops
  • Chlorides: Residual Salt in the Wine
  • Free sulfur dioxide: The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of Wine
  • Total sulfur dioxide: Amount of free and bound forms of S02
  • Density: The density of a substance is its mass per unit volume
  • PH: Describes how acidic or basic a substance is on a scale from 0 to 14
  • Sulfates: a wine additive that can contribute to sulfur dioxide gas (S02) levels
  • Alcohol: Percentage of alcohol content in the Wine
  • Quality: Score between 0 and 10

Now that we have a basic understanding of parameters let’s dive into the data analysis.

Exploratory Data Analysis

In this blog, we will use the Kaggle Red Wine Quality Dataset. It contains 1600 rows of unique red wines. This dataset is interesting because the problem can be interpreted in two ways:

  • Regression: If we consider the target variable as a continuous variable.
  • Classification: If we consider the target variable as a discrete variable.

We will keep ourselves confined by approaching this problem as a Regression Task.

Let’s load the data and take a look!

import pandas as pd
wine_quality = pd.read_csv('winequality-red.csv')
wine_quality.head(5)

Red wine quality dataset features representation

As we can see, almost all the parameters have float data types except for Quality, which is also our target variable. Since all the independent features are continuous, we could learn something from the distributions. A distribution plot depicts the variation in the data distribution. Let’s plot the distribution for each variable using the Seaborn library.

fig = plt.figure(figsize = [20,10])
cols = wine_quality.columns
cnt = 1
for col in cols:
  plt.subplot(4,3,cnt)
  sns.distplot(wine_quality[col],hist_kws=dict(edgecolor="k", linewidth=1,color='blue'),color='red')
  cnt+=1
plt.tight_layout()
plt.show()

Density distribution with respect to features in red wine dataset

Most distributions are approximately normal, while others have some skewness. The wine quality scores 5 and 6 are more frequent than others.

Interdependency of parameters 

We also need to understand the interdependence of parameters over each other. A correlation plot would be helpful in the visualization of such dependencies. Let’s plot the correlation heat map to understand the dependencies!

cmap = sns.diverging_palette(500, 10, as_cmap=True)
sns.heatmap(wine_quality.corr(), cmap=cmap, center=0, square=True)

Correlation of features represented via heatmap

From the above correlation heat map, we can infer that the wine quality is positively correlated with the alcohol content and sulphates. On the contrary, Volatile acidity has a considerable negative correlation with Volatile acidity. It is reasonable that a lower level of acidity is favored in quality tests.

Let’s confirm the relationship mentioned above using the reg-plots.

fig, axs = plt.subplots(nrows=1, ncols=3, figsize=(20,5))
cols = ['volatile acidity', 'alcohol', 'sulphates']
for col, ax in zip(cols, axs.flat):
    sns.regplot(x = wine_quality[col],y = wine_quality["quality"], color = 'purple', ax=ax)

Quality of red wine represented with respect to features

Wine Quality has a real negative relationship with Volatile acidity and a positive relationship with alcohol and sulphates. Let’s dive deeper into the details by visualizing the dependency between Wine Quality and our numeric variables of interest (Independent Feature).

cols = wine_quality.columns
cnt = 1
for col in cols:
  plt.subplot(4,3,cnt)
  sns.boxplot(x="quality",y=col,data=wine_quality,palette="coolwarm")
  cnt = cnt + 1
plt.show()

How to plot the boxplot of features in a dataset?

Let’s summarise our findings from the above boxplots

  • Highly Rated Wines has comparatively higher Alcohol, Citric Acid, and Sulphates.
  • On the contrary, Wines with high volatile acidity, density, and pH are low in Quality.
  • Wines Quality has no significant relationship with total sulfur dioxide, free sulfur dioxide, chlorides, residual sugar, and fixed acidity.

Model Building

Assessing a wine manually is tedious and requires an experienced practitioner to evaluate the Quality. We will address this problem by building a regression model that will take wine parameters’ input and return a predicted quality score. Through this approach, we aim to eliminate the manual tasting and scoring process. To accomplish this task, we must select a regression algorithm that satisfies our requirements.

Following are some regression algorithms that can be used for predicting The Red Wine Quality.

  • Linear Regression
  • Decision Tree Regressor
  • Support Vector Regressor
  • k-NN Regressor
  • Random Forest Regressor

Linear Models are relatively less complex and explainable, but linear models perform poorly on data containing outliers. Also, linear models need to perform better on nonlinear datasets. In such cases, nonlinear regression algorithms Random Forest Regressor and XGBoost Regressor perform better in fitting the nonlinear data.

Which algorithm is best suited for our use case?

We don’t have significant outliers in this data, indicating that we can use linear and complex models. However, the model should have the following qualities:

  • Simple and explainable
  • Provides accurate predictions
  • Robust to concept drift and outliers

Keeping all the points mentioned above in mind, we need to select a regression model. For this tutorial, we will be using the k-NN Regressor.

K Nearest Neighbors Regressor

k-NN Algorithm (K Nearest Neighbors) is a supervised machine learning algorithm that can solve classification and regression tasks. It was extensively used in statistical estimations and pattern recognition during the early 1970s. The k-NN algorithm uses feature similarity to predict the values of any new data points. It estimates the value of a data point by taking out the average of ‘K’ closest values in the Euclidean space. The most commonly used method for calculating the distance between two data points is known as Euclidean Distance.

What is KNN algorithm in Machine Learning?

Before implementing the k-NN Regressor, we need to scale the features as this algorithm demands homogenous characteristics. We measure the distance between the pair of samples influenced by the measurement units. To avoid this, we should normalize the data before implementing k-NN.

This algorithm has only one hyperparameter, K, which indicates the count of samples that will be treated as the nearest neighbors. One way to find the optimum value of K is to derive a plot between the error obtained on the test set and K denoting values. Finally, choose the K corresponding to the minimum error rate.

Let’s implement the k-NN Regressor

from sklearn import neighbors
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
target = wine_quality['quality']
features = wine_quality.drop('quality', axis = 1)
X_train, X_test, Y_train, Y_test = train_test_split(features, target, test_size=0.3)
scaler = MinMaxScaler(feature_range=(0, 1))
X_train = scaler.fit_transform(X_train)
X_train = pd.DataFrame(X_train)
X_test = scaler.fit_transform(X_test)
X_test = pd.DataFrame(X_test)
rms_error = []
for K in range(1,75):
    model = neighbors.KNeighborsRegressor(n_neighbors = K)
    model.fit(X_train, Y_train)  
    pred = model.predict(X_test) 
    error = mean_squared_error(Y_test, pred, squared=False)
    rms_error.append(error, K)
x = np.linspace(1,75, num = 75)
y = rms_error
fig, ax = plt.figure(figsize = [8,5])
ax.plot(x,y)
annot_optimum(x,y)
plt.xlabel('K - Values')
plt.ylabel('RMSE Error')
plt.show()

RMSE error for KNN model on red wine dataset

We found an optimum Model at K = 27 where the RMS Error is minimum. Our dataset was relatively small, which enabled us to see the optimum K; As the dataset grows, the speed of the k-NN algorithm declines very fast, which is a limitation of this algorithm.

Pros and Cons of KNN Regressor

Pros:

  • Intuitive and Simple
  • Only one hyper-parameter to tune
  • Efficient method for small datasets.
  • Constantly Improves with new training data
  • It can be used for both Classification & Regression

Cons:

  • Requires feature scaling
  • Sensitive to Outliers
  • No Capability to handle missing values
  • Requires higher space complexity as well as higher time complexity

Application of K Nearest Neighbors Algorithm

HEINEKEN N.V

Heineken is second the second-largest producer of Beer in the world. However, they also own Zoetermeer Winery for wine production. Heineken relies on Regression analysis to keep track of the Quality of the Wine. As discussed above, even professional wine tasters are just 71% accurate in determining the Quality of Wine.

GRAMMARLY, INC. 

Grammarly is a cross-platform, cloud-based writing assistant that reviews spelling, grammar, punctuation, clarity, engagement, and delivery mistakes. Grammarly relies on the KNN classification algorithm for categorising similar sentences and textual documents.

NETFLIX

Netflix uses the KNN Algorithm to categorize similar content-based shows. Netflix also uses KNN as a Recommendation Engine to recommend similar items; they compare the set of users who like each item — when a matching set of users like two different items, the items are identical!

Possible Interview Question 

Based on this project, the following questions can be asked in any machine learning interview:

  1. Is this problem statement a classification or regression problem?
  2. What modifications do we need to change regression to classification and vice-versa?
  3. Why k-NN? How did you decide the value of k?
  4. Why is k-NN very slow? Can we improve the performance of k-NN?
  5. What is the evaluation metric for your model? Did you compare your results with existing techniques?

Conclusion

We started with a brief introduction to Wine Quality Tasting and problems in the manual Wine tasting approach. Moving on, we discussed the impact of each parameter and started the data analysis. Based on the correlation heat map, we found the most significant parameters. We further confirmed their impact on Wine Quality using Boxplots and Regploys. Finally, we built a K-Nearest Neighbors regression model to predict the Quality of Wine and looked at the pros and cons of using the k-NN Regressor model. We could have approached this problem as a Multiclass classification task.

Next Blog: Introduction to Naive Bayes Algorithm

Enjoy Learning, Enjoy Algorithms!

Share Your Insights

More from EnjoyAlgorithms

Self-paced Courses and Blogs

Coding Interview

Machine Learning

System Design

Our Newsletter

Subscribe to get well designed content on data structure and algorithms, machine learning, system design, object orientd programming and math.