Data and its interpretation are helping many industries across the world. Machine learning models built over the data collected are solving some of the unsolved and challenging tasks. One of such tasks can be predicting the quality of wine along with some quantitative measurement. Judging the quality of Wine manually is really a tough task; even the professional Wine tasters have an accuracy of only 71%. Gaining the title of a Wine taster is a quite involved process. The Master Sommelier’s Diploma exam is the world’s most challenging wine tasting examination, and only 200 people had passed since the exam’s inception 40 years ago. With the current advancements in Machine Learning and Artificial Intelligence, predicting the quality of wine is a mere matter of minutes if we have all the required parameters.
After reading this blog, we would be able to get insights of:
Now that we have a basic understanding of parameters let’s dive into the data analysis.
In this blog, we are going to use the Kaggle Red Wine Quality Dataset. It contains 1600 rows of unique red wines. This dataset is interesting because the problem can be interpreted in two ways:
We will keep ourselves confined by approaching this problem as a Regression Task.
Let’s load the data and take a look!
import pandas as pd wine_quality = pd.read_csv('winequality-red.csv') wine_quality.head(5)
As we can see, almost all the parameters have float data type except for quality, which is also our target variable. Since all the independent features are continuous, we might as well learn something from the distributions. A distribution plot depicts the variation in the data distribution. Let’s plot the distribution for each variable using the Seaborn library.
fig = plt.figure(figsize = [20,10]) cols = wine_quality.columns cnt = 1 for col in cols : plt.subplot(4,3,cnt) sns.distplot(wine_quality[col],hist_kws=dict(edgecolor="k", linewidth=1,color='blue'),color='red') cnt+=1 plt.tight_layout() plt.show()
The majority of the distributions are approximately normal, while others have some skewness. The wine quality scores 5 and 6 are more frequent than others.
We also need to understand the inter-dependency of parameters over each other. A correlation plot would be helpful in the visualization of such dependencies. Let’s plot the correlation heat map for understanding the dependencies!
cmap = sns.diverging_palette(500, 10, as_cmap=True) sns.heatmap(wine_quality.corr(), cmap=cmap, center=0, square=True)
From the above correlation heat-map, we can infer that the wine quality is positively correlated with the alcohol content and sulfates. On the contrary, Volatile acidity has a considerable negative correlation with Volatile acidity. It is reasonable that a lower level of acidity is favored in quality testings.
Let’s confirm the above-mentioned relationship using the reg-plots.
fig, axs = plt.subplots(nrows=1, ncols=3, figsize=(20,5)) cols = ['volatile acidity', 'alcohol', 'sulphates'] for col, ax in zip(cols, axs.flat): sns.regplot(x = wine_quality[col],y = wine_quality["quality"], color = 'purple', ax=ax)
Wine Quality has a confirmed negative relationship with Volatile acidity and a positive relationship with alcohol and sulfates. Let’s dive deeper into the details by visualizing the dependency between Wine Quality and our numeric variables of interest (Independent Feature).
cols = wine_quality.columns cnt = 1 for col in cols : plt.subplot(4,3,cnt) sns.boxplot(x="quality",y=col,data=wine_quality,palette="coolwarm") cnt = cnt + 1 plt.show()
Let’s summarise our findings from the above boxplots
Assessing a wine manually is a tedious task and requires an experienced practitioner to evaluate the quality. We will address this problem by building a regression model that will take wine parameters' input and return a predicted quality score. Through this approach, we aim to eliminate the manual tasting and scoring process. To accomplish this task, we need to select a regression algorithm that suffices our requirements.
Following are some regression algorithms that can be used for predicting The Red Wine Quality.
Linear Models are relatively less complex and explainable, but linear models perform poorly on data containing the outliers. Also, linear models fail to perform well on non-linear datasets. In such cases, non-linear regression algorithms Random Forest Regressor and XGBoost Regressor perform better in fitting the nonlinear data.
Which algorithm is best suited for our use case?
We don’t have significant outliers in this data, indicating that we can use linear as well as complex models. However, the model should have the following qualities:
Keeping all the above-mentioned points in mind, we need to select a regression model. For this tutorial, we will be using the k-NN Regressor.
k-NN Algorithm (K Nearest Neighbors) is a supervised machine learning algorithm that can solve classification and regression tasks. It is extensively used in statistical estimations and even in pattern recognition during the early 1970s. The k-NN algorithm uses feature similarity to predict the values of any new data points. It estimates the value of a data point by taking out the average of ‘K’ closest values in the euclidean space. The most commonly used method for calculating the distance between two data points is known as Euclidean Distance.
Before implementing the k-NN Regressor, we need to scale the features as this algorithm demands homogenous features. We measure the distance between the pair of samples, which are influenced by the measurement units. To avoid this, we should normalize the data before implementing k-NN.
This algorithm has only one hyperparameter, the value of K, which indicates the count of samples that will be treated as the nearest neighbors. One way to find the optimum value of K is to derive a plot between the error obtained on the test set and K denoting values. Finally, choose the K corresponding to the minimum error rate.
Let’s implement the k-NN Regressor
from sklearn import neighbors from sklearn.preprocessing import MinMaxScaler from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error target = wine_quality['quality'] features = wine_quality.drop('quality', axis = 1) X_train, X_test, Y_train, Y_test = train_test_split(features, target, test_size=0.3) scaler = MinMaxScaler(feature_range=(0, 1)) X_train = scaler.fit_transform(X_train) X_train = pd.DataFrame(X_train) X_test = scaler.fit_transform(X_test) X_test = pd.DataFrame(X_test) rms_error =  for K in range(1,75): model = neighbors.KNeighborsRegressor(n_neighbors = K) model.fit(X_train, Y_train) pred = model.predict(X_test) error = mean_squared_error(Y_test, pred, squared=False) rms_error.append(error, K) x = np.linspace(1,75, num = 75) y = rms_error fig, ax = plt.figure(figsize = [8,5]) ax.plot(x,y) annot_optimum(x,y) plt.xlabel('K - Values') plt.ylabel('RMSE Error') plt.show()
We found an optimum Model at K = 27 where the RMS Error is minimum. Our dataset was quite small, which enabled us to find the optimum K; As the dataset grows, the speed of the k-NN algorithm declines very fast, which is a limitation of this algorithm.
Heineken is second the second largest producer of Beer in the world. However, they also own Zoetermeer Winery for wine production. Heineken relies on Regression analysis to keep track of the quality of the wine. As discussed above, even professional wine tasters are just 71% accurate in determining the quality of Wine.
Grammarly is a cross-platform cloud-based writing assistant that reviews spelling, grammar, punctuation, clarity, engagement, and delivery mistakes. For categorizing similar sentences and textual documents, Grammarly relies on the KNN classification algorithm.
Netflix uses the KNN Algorithm to categorize similar content-based shows. Netflix also uses KNN as a Recommendation Engine to recommend similar items; they compare the set of users who like each item — when a similar set of users like two different items, the items themselves are similar!
Based on this project, the following questions can be asked in any machine learning interview:
We started with a brief introduction to Wine Quality Tasting and problems in the manual Wine tasting approach. Moving on, we discussed the impact of each parameter and started the data analysis. Based on the correlation heat-map, we found the most significant parameters. We further confirmed their impact on Wine Quality using Boxplots and Regplots. Finally, we built a K-Nearest Neighbors regression model to predict the Quality of Wine and looked at the pros and cons of using the k-NN Regressor model. We could have approached this problem as a Multiclass classification task.
Enjoy Learning, Enjoy Algorithms!
Subscribe to get weekly content on data structure and algorithms, machine learning, system design and oops.