Introduction to Scikit-Learn in Machine Learning

Machine Learning and Deep Learning are rapidly gaining traction in industrial applications. Many companies are interested in incorporating these technologies into their business, and developers have created frameworks to make this process easier. Instead of starting from scratch, it is often more efficient to use existing algorithms and modify them to fit specific needs. This is where Machine Learning frameworks come in, providing a foundation for building solutions without the need to rewrite all of the underlying logic. Scikit-Learn is a popular framework in the field of Machine Learning, and we will be discussing it in this blog.

What is Scikit-Learn?

Scikit-learn is a free, open-source machine learning library for Python that provides an interface for both supervised and unsupervised learning. It is built on top of the SciPy library and offers a range of features for all machine learning needs. In this blog, we will use Scikit-learn to explore and evaluate various machine learning models as part of our curriculum. The purpose of this blog is to help you become familiar with the key concepts you need to know about Scikit-learn.

Note: Scikit-learn has many tools and features, but we will prioritize specific features per our requirements. For installation-related instructions, we can refer to the make system AI-enabled blog.

What are the different supports provided by Scikit-learn framework in Machine Learning?

Key concepts covered in this blog

  • Dummy Data Availability
  • Pre-processing Modules
  • Feature Engineering Modules
  • Model Building Modules
  • Model Evaluation Modules
  • Complete Pipeline Creation Modules.

So, let's start with Data availability.

Data Availability in Scikit-learn

Data is the heart of machine learning, and data and the purpose of using ML are indeed user-specific. Still, learning Machine Learning has much to do with understanding model behavior on different datasets. Data collection, cleaning, and labeling are expensive processes. But if that were the case, it would have become a hurdle for many beginners as they might need more data.

To begin the journey in Machine Learning, the Scikit-learn library provides a large set of datasets that are freely available as support and can be directly imported. There are different classes of the dataset, and one is the toy class. Some of the standard ones in this class are:

  • Fisher-Iris dataset: This is the most common dataset where Sepal/Petal length/width is recorded for three flower species.

What are the different classes in IRIS dataset?

  • Hand-digit recognition: Image dataset of the handwritten digits.

Samples of MNIST dataset used for handwritten digit recognition tasks

  • The breast cancer dataset: A dataset where the presence of malignant or benign types of cells is marked as per different features.
  • The wine recognition dataset: A dataset where the quality of wine is marked according to different attributes.

Refer here to check out all the datasets in this category. Apart from the toy dataset, there are some real-world datasets incorporated in this library which comprise:

  • Olivetti faces dataset from AT&T: Dataset containing face images from 1992–1994.
  • Newsgroups text: A textual dataset for text-related tasks.

And similar datasets are prepared from real-world scenarios that we can find here. It even allows users to generate data randomly based on their requirement of testing the developed model.

As we know about the datasets, let's quickly see how we can load them.

from sklearn.datasets import load_iris
iris = load_iris()
X,y = iris.data,iris.target
features = iris.feature_names
labels = iris.target_names
print('Available Features :',features)
print('Categories :',labels)
print(len(X))
print(len(y[y==0]))
print(len(y[y==1]))
print(len(y[y==2]))

'''
Available Features : ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm) ']
Categories : ['setosa' 'versicolor' 'virginica']
150
50
50
50
'''

Load and View Sample data using Scikit-learn

The above code shows how to load and view the attributes of a sample dataset (iris flower) from sklearn.datasets. The total number of feature points in each category is shown in this code. Use the command dir(sklearn.datasets) to check all the datasets provided by this package. It also offers the option to generate entirely new data.

import sklearn
print(dir(sklearn.datasets))
  • The functions make_moons or make_circles from sklearn.datasets can be used to generate 2-dimensional datasets that can be used for either clustering or classification models;
  • make_classification can be used to generate datasets for classification models with any number of features and output classes;
  • make_regression can generate datasets for fitting regression models with any number of input features and informative features for generating output by a linear model.

Apart from these, several other datasets were provided by packages such as loadsvmlightfilefetch_openml, etc. We know the datasets are never perfect, and we need to extract meaning from them. We need data preprocessing, and Scikit-learn provides a lot of in-built modules using which we can analyze and preprocess it. So let's have a look.

Pre-processing Support By Scikit-learn

The preprocessing stage of ML deals with obtaining data in a trainable format for the machine learning model. This requires,

  • Selecting appropriate values for missing data
  • Obtaining numerical values for categorical data,
  • Scaling attributes to improve training speed or accuracy.

Let's see in detail,

Imputing missing values using Scikit-learn

The sklearn library provides options to fill in a dataset's missing values/outliers. There can be several ways of replacing a missing value using the particular attribute's mean/median/mode. Several complex procedures use normalization/regularization to fill in these missing values. However, in this introductory section, we will only see the use of a simple imputer to replace missing values.

import numpy as np
from sklearn.impute import SimpleImputer
data = np.array([[19, 18, np.NaN, 26],
                 [85, 53, 76, 45],
                 [83, 97,  1, np.NaN],
                 [73, 28, 38, 37],
                 [87, np.NaN, 86, 66],
                 [23, 28, 11, 10]])

print('Original Data :')
print(data)                                                  #Check Data before imputing          
print(np.isnan(data).any())                                  #Check presence of missing value
imp = SimpleImputer(strategy = 'median')                     #Define Imputer with strategy (mean/median/most_frequent)
data_new = imp.fit_transform(data)                           #Transform data as per the strategy
print('New Data :')
print(data_new)                                              #Check Data after imputing  
Original Data :
[[19. 18. nan 26.]
[85. 53. 76. 45.]
[83. 97. 1. nan]
[73. 28. 38. 37.]
[87. nan 86. 66.]
[23. 28. 11. 10.]]

Transformed to:

New Data:
[[19. 18. 38. 26.]
[85. 53. 76. 45.]
) [83. 97. 1. 37.]
[73. 28. 38. 37.]
[87. 28. 86. 66.]
[23. 28. 11. 10.]]

True

Change the strategy to 'mean’/‘most_frequent' to replace the missing value with the attribute's mean or mode (column).

Label Encoder using Scikit-learn

At times, the dataset available has specific values in the attribute, which are non-numerical yet informative. But computers do not understand anything except numbers. Hence, these quantities cannot be processed by ML models. This is when a label encoder comes in handy. It can replace non-numerical portions with numerical amounts.

import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
path = 'D:/EnjoyAlgorithm/PlayTennis.csv'
PlayTennis = pd.read_csv(path, header = 0, skiprows = 0)        #Loading the Text Dataset
print ("Dataset Length: ", len(PlayTennis)) 
print ("Dataset Shape: ", PlayTennis.shape) 
print(PlayTennis)                                               #Before processing

Le = LabelEncoder()
for label in PlayTennis.columns:
    PlayTennis[label] = Le.fit_transform(PlayTennis[label])
print(PlayTennis)                                               #After processing

How to do the label encoding for the categorical variables using scikit-learn?

For example, in the above playTennis dataset, the LabelEncoder assigned a numerical value to each non-numerical data entry (say 'overcast’=0,' rainy’=1, 'sunny’=2). The processed data is now suitable for a machine learning model.

Scaling Dataset using Scikit-learn

In many real-world datasets, the attributes are in different ranges. This can be a problem for an ML model as higher-order attributes can be preferred more (or less, depending on the algorithm). For example, a hiring manager has to develop a plan to propose the salary for an individual. Their only inputs are specific samples with previous wages and the number of years of work experience. Using an algorithm such as KNN, the attribute salary being in the higher range will outweigh the work experience as the numerical quantity of work experience will vary in the range of 0–70, but salary numbers will range from thousands to crores. Thus, a scaler is required to assign equal weightage of importance to both.

import numpy as np
from sklearn.preprocessing import MinMaxScaler
# define data ---> [salary ($), work ex(yrs)]
data = np.array([[3000, 1],
				[3300, 2],
				[4500, 2],
				[3800, 1],
				[4800, 3],
				[5000, 5]])
print(data)

scaler = MinMaxScaler()                  # define an object of the classmin max scaler

new_data = scaler.fit_transform(data)    # fit and transform the data
print(new_data)

'''
Original
[[3000 1]
[3300  2]
[4500  2]
[3800  1]
[4800  3]
[5000  5]

Scaled
[[0.  0.]
[0.15 0.251]
[0.75 0.25]
[0.4  0. ]
[0.9  0.5 ]
[1.   1.]
'''

There can be different scalars, such as AbsScalar and StandardScalar, which can serve problem-specific purposes.

Feature Engineering using Scikit-learn

Feature engineering is the preparation of proper input for the available processable inputs. In short, take whatever information we have for our problem and turn it into numbers that can be used to build our feature matrix. This provides input well-suited with the machine learning algorithms.

Vectorization using Scikit-learn

Vectorization can expand any particular feature from the input having finite discrete possibilities. This step helps an ML model learn the individual importance of each category in the attribute.

from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False, dtype=int)
data = [
    {'price': 1125000, 'rooms': 4, 'State': 'New York'},
    {'price': 1000000, 'rooms': 3, 'State': 'California'},
    {'price': 750000, 'rooms': 3, 'State': 'Washington'},
    {'price': 800000, 'rooms': 2, 'State': 'California'},
    {'price': 850000, 'rooms': 2, 'State': 'New York'},
]
new_data = vec.fit_transform(data)
print(new_data)



'''
Output = 
[[      0       1       0 1125000       4]
 [      1       0       0 1000000       3]
 [      0       0       1  750000       3]
 [      1       0       0  800000       2]
 [      0       1       0  850000       2]]
'''

Vectorization of the above data expanded the feature' state' in the discrete categories provided in the dataset. This new data now add more meaning from the ML point of view.

Vectorization of the above data expanded the feature' state' in the discrete categories provided in the dataset. This new data now add more meaning to our machine learning algorithms.

Dimensionality Reduction using Scikit-learn

Dimensionality reduction is the projection of high-dimensional data to a low-dimensional space while retaining maximum variance. Datasets may have a significantly high number of attributes, and some may be redundant to the objective. Dimensionality reduction techniques can help remove such attributes and generate new attributes from them. PCA achieves dimensionality reduction by observing the co-relation among features. Let's see how to use Scikit-learn to reduce the dimensionality from 3 to 2.

from sklearn import datasets
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
#____CREATE RANDOM CLASSIFICATION DATASET____
X, y = datasets.make_classification(n_samples=300, n_features=3, n_classes=3, n_redundant=0,
                                    n_clusters_per_class=1, weights=[0.5, 0.3,0.2], random_state=42)
pca = PCA(n_components = 2,svd_solver = 'randomized')
X_fitted = pca.fit(X).transform(X)
fit = pca.fit(X)

print(("Explained Variance: %s") % (fit.explained_variance_ratio_))
#_____PLOT ORIGINAL DATA_____#
fig = plt.figure()

#ax = fig.add_subplot(111,projection = '3d')
ax = fig.add_subplot(1, 1, 1, projection='3d')
ax.scatter(xs = X[:,0], ys = X[:,1], zs = X[:,2], c=y)

ax.set_title("Original 3-featured data")
ax.set_xlabel("X0")
ax.set_ylabel("X1")
ax.set_zlabel("X2")
plt.show()
#_____PLOT REDUCED DIMENSION DATA_____#
fig, ax = plt.subplots(figsize=(9, 6))
plt.title("Reduced 2-featured data")
plt.xlabel("X_fitted_0", fontsize=20)
plt.ylabel("X_fitted_1", fontsize=20)

plt.scatter(X_fitted[:,0], X_fitted[:,1], s=50, c=y)

How to reduce the dimensionality in dataset using Scikit-learn?

As we can see, there is a reduction in the dimension from 3 → 2. Yes, there will be some information loss, but we try to retain maximum information in PCA.

We are ready to apply machine learning algorithms to the prepared dataset and build our machine learning model. Let's see what Scikit-learn provides here.

Building Machine Learning Model using Scikit-learn

The sklearn library provides several machine learning models classified based on their type (linear models, tree-based, SVM-based, ensemble-based, etc.). Some standard algorithms are shown below and how they are imported. Check out the complete list here.

from sklearn.linear_model import LinearRegression,LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.cluster import KMeans
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.decomposition import PCA

The general paradigm for Scikit-learn is

  1. Import the ML model and create an instance of it.
  2. Fit training data into the model.
  3. Use the fitted model to predict.
#EX1: Creating and deploying a Supervised Learning Model
model = DecisionTreeClassifier()                 # Create an instance of the Decision Tree Classifier
model = model.fit(X_train,y_train)               # Fit the training data into the model
model.predict(X_test)                            # Use model to make prediction

#EX2: Creating and deploying an Dimensionality Reduction Model
pca = PCA(n_components = 2)                      # Create an instance of the PCA
X_transformed_data = pca.fit_transform(X_data)   # Fit and transform the data to new dimensions

Once the model is built, we need to evaluate its performance. Scikit-learn provides a wide range of modules to evaluate our models.

Model Evaluation using Scikit-learn

At the beginner level, we are expected to understand different ML models and their performance on similar data. Based on the numbers, we decide which model should be finalized.

Evaluation of the trained model can be done in two simple steps,

  • Import the desired metric
  • Use a trained model to predict the output of test data
  • Compute performance using the test data
#EX1: Evaluating the Model performance using R2-score
from sklearn.metrics import r2_score, mean_squared_error, mean_absolut_error
y_pred = model.predict(X_test)    
r2_score(y_test, y_pred)

Building a complete Machine Learning Pipeline using Scikit-learn

So far, we have seen ways of extracting trainable data from raw data. This includes imputing missing values, transforming or scaling data, then using a model to train (fit) and predict outcomes. The same can be done in an organized, sequential way. A pipeline is a sequential application of transformation to generate a workflow, allowing the processing and evaluation of a model from end to end. And guess what? Scikit-learn helps us there as well.

We will build an end-to-end pipeline using sklearn on the 'iris' flower dataset.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
import numpy as np
iris=load_iris()

iris_data = iris.data.copy()
iris_target = iris.target
#print('Iris Data before replacing samples with NaN',iris_data)
c = 10
mask = np.ones(iris_data.shape)
mask.ravel()[np.random.choice(mask.size, c, replace=False)] = 0

#print(np.where(mask==0))                   #Checking the (c=10) locations where number is replaced by NaN
iris_data[mask==0] = np.NaN
#print('Iris Data before replacing samples with NaN',iris_data)
X_train,X_test,y_train,y_test=train_test_split(iris_data,iris_target,test_size=0.3,random_state=42)
pipeline=Pipeline([('Imputer',SimpleImputer(strategy='mean')),('Scalar',StandardScaler()),
                     ('PCA',PCA(n_components=2)),('SVC',SVC(kernel = 'linear'))])


model = pipeline.fit(X_train, y_train)

print('SVM performance on Iris Classification',model.score(X_test,y_test))



#To view the data in any intermediate stage of the pipeline 
imputer_output = model.named_steps["Imputer"].transform(X_train)
scalar_output = model.named_steps["Scalar"].transform(imputer_output)
pca_output = model.named_steps["PCA"].transform(scalar_output)
model_output = model.named_steps["SVC"].predict(pca_output)




#SVM performance on Iris Classification 0.9333333333333333

To build the pipeline, we have to import the pipeline from sklearn.pipeline. This pipeline inputs the different transformations we chose to apply to our dataset. The iris dataset has no missing values, so we will randomly replace a sample of values from the dataset with NaN (not a number). We replaced (c=10) values with NaN (use np.where() to check the locations of the matrix where this replacement is done).

Now that we have the data ready, we will split it into the training and testing dataset. Sklearn provides a feature train_test_split that can split the data into desired fractions.

The next step is building the pipeline. The pipeline takes in input as a list of tuples. The tuple indexed '0' is the desired name for the transformation, and the indexed '1' is the transformation to be applied. The pipeline consists of the following transformation,

Imputer →StandardScalar →PCA →SVM

  • The imputer handles the missing values as per the strategy. Feel free to change the strategy from 'mean' to 'median' or 'most_frequent' and check the results.
  • The StandardScalar transforms the data into zero mean and unit standard deviation. You can try other variations of scalars, such as MinMaxScalar, MaxAbsScalar, etc.
  • PCA reduces the dimensionality (4 →2) in this dataset.
  • Finally, the output from the PCA is fed into the model Support Vector Classifier.

We can then fit the pipeline into the training dataset and compute the accuracy on the test dataset. To view the output in any intermediate steps, use named_step[“transformation_name”] as shown in the code above. This will allow us to effectively see the pipeline results in the intermediate steps and understand how the pipeline is working.

Conclusion

In this article, we have given an overview of how Scikit-learn plays an essential role at every stage of Machine Learning. We discussed the datasets available, preprocessing data support, feature engineering modules, and fitting the desired model, and then learned about pipeline formation. Scikit-learn is a huge package, and what we have covered here is a tiny but valuable part of getting started. The content aligns with the stages of machine learning that we encounter and get deep into it.

References

Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825–2830, 2011.

Next Blog: Linear Regression in Machine Learning

Enjoy Learning! Enjoy Algorithms!

More From EnjoyAlgorithms

© 2022 Code Algorithms Pvt. Ltd.

All rights reserved.