Introduction to Scikit-Learn in Machine Learning

Everyone wants a solution, and reinventing the wheels is never a better choice. So, we slightly change the existing algorithms and build solutions using them. That’s where ML Frameworks come into the picture, and Scikit-Learn is one of them. 

What is Scikit-Learn?

Scikit-learn is a free machine learning framework available in Python, providing an interface for supervised and unsupervised learning. Its free nature makes it more popular and accessible. It is built over the SciPy library and provides every feature catering to every ML requirement. We will be using Scikit-learn for demonstrating and analyzing different models in our curriculum. So, this blog is to help us familiarize ourselves with the essential things we should know in this.

Note: Scikit-learn has many tools and features, but we will prioritize specific features per our requirements. For installation-related instructions, we can refer to the make system AI-enabled blog.

Scikit-learn capabilities

Key concepts covered in this blog

  • Dummy Data Availability
  • Pre-processing Modules
  • Feature Engineering Modules
  • Model Building Modules
  • Model Evaluation Modules
  • Complete Pipeline Creation Modules.

So, let’s start with Data availability.

Data Availability

Data is the heart of machine learning, and data and the purpose of using ML are indeed user-specific. Still, learning Machine Learning has a lot to do with understanding model behavior on different datasets. Data collection, cleaning, and labeling are expensive processes. But if that were the case, it would have become a hurdle for many beginners as they might not have the data.

To begin the journey in Machine Learning, the Scikit-learn library provides a large set of datasets that are freely available as support and can be directly imported. There are different classes of the dataset, and one of them is the toy class. Some of the standard ones in this class are:

  • Fisher-Iris dataset: This is the most common dataset where Sepal/Petal length/width is recorded for three flower species.

Scikit-learn's Iris dataset visualization

  • Hand-digit recognition: Image dataset of the handwritten digits.

Scikit-learn's Handwritten digit recognition dataset

  • The breast cancer dataset: A dataset where the presence of malignant or benign types of cells is marked as per different features. 

Scikit-learn's breast cancer dataset

  • The wine recognition dataset: A dataset where the quality of wine is marked according to different attributes.

Refer here to check out all the datasets in this category. Apart from the toy dataset, there are some real-world datasets incorporated in this library which comprise:

  • Olivetti faces dataset from AT&T: Dataset containing face images taken in 1992–1994. 
  • Newsgroups text: A textual dataset for text-related tasks.

And similar datasets are prepared from real-world scenarios that we can find here. It even allows users to generate data randomly based on their requirement of testing the developed model.

As we know about the datasets, let’s quickly see how we can load them for our use.

from sklearn.datasets import load_iris
iris = load_iris()
X,y = iris.data,iris.target
features = iris.feature_names
labels = iris.target_names
print('Available Features :',features)
print('Categories :',labels)
print(len(X))
print(len(y[y==0]))
print(len(y[y==1]))
print(len(y[y==2]))

Print results of above code snippet

Load and View Sample data

The above code shows how to load and view the attributes of a sample dataset (iris flower) from sklearn.datasets. The total number of feature points in each category is shown in this code. Use the command dir(sklearn.datasets) to check all the datasets provided by this package. It also offers the option to generate entirely new data.

import sklearn
print(dir(sklearn.datasets))
  • The functions make_moons or make_circles from sklearn.datasets can be used to generate 2-dimensional datasets that can be used for either clustering or classification models;
  • make_classification can be used to generate datasets for classification models with any number of features and output class;
  • make_regression can generate datasets for fitting regression models with any number of input features and informative features for generating output by a linear model.

Apart from these, several other datasets were provided by packages such as loadsvmlightfile, fetch_openml, etc. We know that the datasets are never perfect, and we need to extract meaning from them. We need data preprocessing, and Scikit-learn provides a lot of in-built modules using which we can analyze and preprocess it. So let’s have a look.

Pre-processing

The preprocessing stage of ML deals with obtaining data in a trainable format for the machine learning model. This requires,

  • Selecting appropriate values for missing data
  • Obtaining numerical values for categorical data,
  • Scaling attributes to improve training speed or accuracy.

Let’s see in detail,

Pre-processing: Imputing missing values

The sklearn library provides options to fill the missing values/outliers in a dataset. There can be several ways of replacing a missing value using the particular attribute’s mean/median/mode. Several complex procedures use normalization/regularization to fill these missing values. However, in this introductory section, we will only see the use of a simple imputer to replace missing values.

import numpy as np
from sklearn.impute import SimpleImputer
data = np.array([[19, 18, np.NaN, 26],
                 [85, 53, 76, 45],
                 [83, 97,  1, np.NaN],
                 [73, 28, 38, 37],
                 [87, np.NaN, 86, 66],
                 [23, 28, 11, 10]])

print('Original Data :')
print(data)                                                  #Check Data before imputing          
print(np.isnan(data).any())                                  #Check presence of missing value
imp = SimpleImputer(strategy = 'median')                     #Define Imputer with strategy (mean/median/most_frequent)
data_new = imp.fit_transform(data)                           #Transform data as per the strategy
print('New Data :')
print(data_new)                                              #Check Data after imputing  

Data imputation using scikit-learn

Change the strategy to ‘mean’/‘most_frequent’ to replace the missing value with the attribute’s mean or mode (column).

Pre-processing: Label Encoder

At times, the dataset available has specific values in the attribute, which are non-numerical yet informative. But computers do not understand anything except numbers. Hence, these quantities cannot be processed by ML models. This is when a label encoder comes in handy. It can replace non-numerical portions with numerical amounts.

import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
path = 'D:/EnjoyAlgorithm/PlayTennis.csv'
PlayTennis = pd.read_csv(path, header = 0, skiprows = 0)        #Loading the Text Dataset
print ("Dataset Length: ", len(PlayTennis)) 
print ("Dataset Shape: ", PlayTennis.shape) 
print(PlayTennis)                                               #Before processing

Le = LabelEncoder()
for label in PlayTennis.columns:
    PlayTennis[label] = Le.fit_transform(PlayTennis[label])
print(PlayTennis)                                               #After processing

Label encoder of scikit-learn

For example, in the above playTennis dataset, the LabelEncoder assigned a numerical value to each non-numerical data entry (say ‘overcast’=0, ’rainy’=1, ‘sunny’=2). The processed data is now suitable for a machine learning model.

Pre-processing: Scaling Dataset

In many real-world datasets, the attributes are in different ranges. This can be a problem for an ML model as higher-order attributes can be preferred more (or less, depending on the algorithm). For example, a hiring manager has to develop a plan to propose the salary for an individual. Their only inputs are specific samples with previous wages and the number of years of work experience. Using an algorithm such as KNN, the attribute salary being in the higher range will outweigh the work experience as the numerical quantity of work experience will vary in the range of 0–70, but salary numbers will range in thousands to crores. Thus, a scaler is required to assign equal weightage of importance to both.

import numpy as np
from sklearn.preprocessing import MinMaxScaler
# define data ---> [salary ($), work ex(yrs)]
data = np.array([[3000, 1],
				[3300, 2],
				[4500, 2],
				[3800, 1],
				[4800, 3],
				[5000, 5]])
print(data)

scaler = MinMaxScaler()                  # define an object of the classmin max scaler

new_data = scaler.fit_transform(data)    # fit and transform the data
print(new_data)

'''

'''

Scaling the features

There can be different scalars such as AbsScalar and StandardScalar, which can serve problem-specific purposes.

Feature Engineering

Feature engineering is the preparation of proper input for the available processable inputs. In short, take whatever information we have for our problem and turn it into numbers that can be used to build our feature matrix. This provides input well-suited with the machine learning algorithms.

Feature Engineering: Vectorization

Vectorization can expand any particular feature from the input having finite discrete possibilities. This step helps an ML model learn the individual importance of each category in the attribute.

from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False, dtype=int)
data = [
    {'price': 1125000, 'rooms': 4, 'State': 'New York'},
    {'price': 1000000, 'rooms': 3, 'State': 'California'},
    {'price': 750000, 'rooms': 3, 'State': 'Washington'},
    {'price': 800000, 'rooms': 2, 'State': 'California'},
    {'price': 850000, 'rooms': 2, 'State': 'New York'},
]
new_data = vec.fit_transform(data)
print(new_data)



'''
Output = 
[[      0       1       0 1125000       4]
 [      1       0       0 1000000       3]
 [      0       0       1  750000       3]
 [      1       0       0  800000       2]
 [      0       1       0  850000       2]]
'''

Vectorization of the above data expanded the feature ‘state’ in the discrete categories provided in the dataset. This new data now add more meaning from the ML point of view.

Vectorization of the above data expanded the feature ‘state’ in the discrete categories provided in the dataset. This new data now add more meaning to our machine learning algorithms.

Feature Engineering: Dimensionality Reduction

Dimensionality reduction is the projection of high-dimensional data to a low-dimensional space while retaining maximum variance. Datasets may have a significantly high number of attributes, and some may be redundant to the objective. Dimensionality reduction techniques can help remove such attributes and generate new attributes from them. PCA achieves dimensionality reduction by observing the co-relation among features. Let’s see how to use Scikit-learn to reduce the dimensionality from 3 to 2.

from sklearn import datasets
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
#____CREATE RANDOM CLASSIFICATION DATASET____
X, y = datasets.make_classification(n_samples=300, n_features=3, n_classes=3, n_redundant=0,
                                    n_clusters_per_class=1, weights=[0.5, 0.3,0.2], random_state=42)
pca = PCA(n_components = 2,svd_solver = 'randomized')
X_fitted = pca.fit(X).transform(X)
fit = pca.fit(X)

print(("Explained Variance: %s") % (fit.explained_variance_ratio_))
#_____PLOT ORIGINAL DATA_____#
fig = plt.figure()

#ax = fig.add_subplot(111,projection = '3d')
ax = fig.add_subplot(1, 1, 1, projection='3d')
ax.scatter(xs = X[:,0], ys = X[:,1], zs = X[:,2], c=y)

ax.set_title("Original 3-featured data")
ax.set_xlabel("X0")
ax.set_ylabel("X1")
ax.set_zlabel("X2")
plt.show()
#_____PLOT REDUCED DIMENSION DATA_____#
fig, ax = plt.subplots(figsize=(9, 6))
plt.title("Reduced 2-featured data")
plt.xlabel("X_fitted_0", fontsize=20)
plt.ylabel("X_fitted_1", fontsize=20)

plt.scatter(X_fitted[:,0], X_fitted[:,1], s=50, c=y)

Dimensionality reduction using PCA of scikit-learn

As we can see, there is a reduction in the dimension from 3 → 2. Yes, there will be some information loss, but we try to retain maximum information in PCA.

Now we are ready to apply machine learning algorithms to the prepared dataset and build our machine learning model. Let’s see what Scikit-learn provides here.

Building Machine Learning Model

The sklearn library provides several machine learning models classified based on their type (linear models, tree-based, SVM-based, ensemble-based, etc.). Some standard algorithms are shown below and how they are imported. Check out the complete list here.

from sklearn.linear_model import LinearRegression,LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.cluster import KMeans
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.decomposition import PCA

The general paradigm for Scikit-learn is

  1. Import ML model and create an instance of it.
  2. Fit training data into the model.
  3. Use the fitted model to predict.
#EX1: Creating and deploying a Supervised Learning Model
model = DecisionTreeClassifier()                 # Create an instance of the Decision Tree Classifier
model = model.fit(X_train,y_train)               # Fit the training data into the model
model.predict(X_test)                            # Use model to make prediction

#EX2: Creating and deploying an Dimensionality Reduction Model
pca = PCA(n_components = 2)                      # Create an instance of the PCA
X_transformed_data = pca.fit_transform(X_data)   # Fit and transform the data to new dimensions

Sckit-learn all models (Source: Scikit-learn.org)

Once the model is built, we need to evaluate its performance. Scikit-learn provides a wide range of modules to evaluate our models.

Model Evaluation

At the beginner level, we are expected to understand different ML models and their performance on similar data. Based on the numbers, we decide which model should be finalized.

Evaluation of the trained model can be done in two simple steps,

  • Import the desired metric
  • Use a trained model to predict the output of test data
  • Compute performance using the test data
#EX1: Evaluating the Model performance using R2-score
from sklearn.metrics import r2_score, mean_squared_error, mean_absolut_error
y_pred = model.predict(X_test)    
r2_score(y_test, y_pred)

Building a complete Machine Learning Pipeline

So far, we have seen ways of extracting trainable data from raw data. This includes imputing missing values, transforming or scaling data, then using a model to train (fit) and predict outcomes. The same can be done in an organized, sequential way. A pipeline is a sequential application of transformation to generate a workflow, allowing the processing and evaluation of a model from end to end. And guess what, Scikit-learn helps us there as well. 

We will build an end-to-end pipeline using sklearn on the ‘iris’ flower dataset.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
import numpy as np
iris=load_iris()

iris_data = iris.data.copy()
iris_target = iris.target
#print('Iris Data before replacing samples with NaN',iris_data)
c = 10
mask = np.ones(iris_data.shape)
mask.ravel()[np.random.choice(mask.size, c, replace=False)] = 0

#print(np.where(mask==0))                   #Checking the (c=10) locations where number is replaced by NaN
iris_data[mask==0] = np.NaN
#print('Iris Data before replacing samples with NaN',iris_data)
X_train,X_test,y_train,y_test=train_test_split(iris_data,iris_target,test_size=0.3,random_state=42)
pipeline=Pipeline([('Imputer',SimpleImputer(strategy='mean')),('Scalar',StandardScaler()),
                     ('PCA',PCA(n_components=2)),('SVC',SVC(kernel = 'linear'))])


model = pipeline.fit(X_train, y_train)

print('SVM performance on Iris Classification',model.score(X_test,y_test))



#To view the data in any intermediate stage of the pipeline 
imputer_output = model.named_steps["Imputer"].transform(X_train)
scalar_output = model.named_steps["Scalar"].transform(imputer_output)
pca_output = model.named_steps["PCA"].transform(scalar_output)
model_output = model.named_steps["SVC"].predict(pca_output)




#SVM performance on Iris Classification 0.9333333333333333

To build the pipeline, we have to import the pipeline from sklearn.pipeline. This pipeline inputs the different transformations that we chose to apply to our dataset. The iris dataset doesn’t have any missing values, so we will randomly replace a sample of values from the dataset with NaN (not a number). We replaced (c=10) values with NaN (use np.where() to check the locations of the matrix where this replacement is done).

Now that we have the data ready, we will split it into the training and testing dataset. Sklearn provides a feature train_test_split that can split the data into desired fractions.

The next step is building the pipeline. The pipeline takes in input as a list of tuples. The tuple indexed ‘0’ is the desired name for the transformation, and the indexed ‘1’ is the transformation to be applied. The pipeline consists of the following transformation,

Imputer →StandardScalar →PCA →SVM

  • The imputer handles the missing values as per the strategy. Feel free to change the strategy from ‘mean’ to ‘median’ or ‘most_frequent’ and check the results.
  • The StandardScalar transforms the data into zero mean and unit standard deviation. You can try other variations of scalars such as MinMaxScalar, MaxAbsScalar, etc.
  • PCA reduces the dimensionality (4 →2) in this dataset.
  • Finally, the output from the PCA is fed into the model Support Vector Classifier.

We can then fit the pipeline into the training dataset and compute the accuracy on the test dataset. To view the output in any intermediate steps, use named_step[“transformation_name”] as shown in the code above. This will allow us to effectively see the pipeline results in the intermediate steps and understand how the pipeline is working.

Conclusion

In this article, we have given an overview of how Scikit-learn plays an essential role at every stage of Machine Learning. We discussed the datasets available, preprocessing data support, feature engineering modules, fitting the desired model, and then learned about pipeline formation. Scikit-learn is a huge package, and what we have covered here is a tiny but valuable part of getting started. The content provided aligns with the stages of machine learning that we encounter and get deep into it.

References

Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825–2830, 2011.

Share feedback with us

More blogs to explore

Our weekly newsletter

Subscribe to get weekly content on data structure and algorithms, machine learning, system design and oops.

© 2022 Code Algorithms Pvt. Ltd.

All rights reserved.