Introduction to Scikit-Learn in Machine Learning

Machine Learning is gaining traction, and companies are looking towards integrating ML solutions to enhance their business. All this has become possible because of the community and developer's support to make it even more popular and easily usable. Machine Learning developers have created an open-source framework known as Scikit-learn. This serves as the foundation, and ML practitioners do not need to write everything from scratch.

In this article, we will discuss the basic supports provided by Scikit-learn for all the stages of Machine Learning model development. Like other frameworks in software, it contains numerous tools and features, and it is impossible to cover everything in one blog. Here, our focus would be prioritizing specific features that a basic learner should know.

Key concepts covered in this blog

We will cover each stage of the Machine Learning model development to make learners aware of the support it provides in each stage.

  • Data Availability: Dummy Data Availability
  • Preprocessing: Preprocessing and Feature Engineering Modules
  • Model Development: Model Building Modules
  • Model Evaluation: Model Evaluation Modules
  • Model Deployment: Complete Pipeline Creation Modules.

But before discussing all this, let's know more about it and the installation steps.

What is Scikit-Learn?

Scikit-learn, also known as sklearn, is a Python-based, free, open-source Machine Learning library that provides support for tasks in data mining, data analytics, data science, and Machine Learning. It is built on top of the famous Python packages Scipy, Numpy, and Matplotlib. As it is open-source, we can easily access its codebase and dive deeper into the codes for each functional support it provides.

The official GitHub repository of the Scikit-learn library can be found here. From this repository, we can see that there are more than 2700 contributors for this library and 55.5k stars, representing its popularity in the ML community.

Installation of Scikit-learn using Pip

There is a direct command for installing Scikit-learn using Pip:
pip install -U scikit-learn.

What are the different supports provided by Scikit-learn framework in Machine Learning?

Stages in Machine Learning Model Development 

Let's think from a beginner's perspective: What would be required if someone starts their journey in the ML field?

  • Problem Statement: The first need would be to define the problem statement clearly. However, beginners prefer to do hands-on on some common problem statements in the ML field. Scikit-learn provides a wider range of problem statements, like Flower class classification, cancer cell classification, etc.
  • Data Availability: After finalizing the problem statement, we might need data corresponding to that problem statement.Data collection and labeling are costly, and we can not expect every beginner to undergo this process. Hence, frameworks provide a wider range of openly available data sets so practitioners can do their hands-on smoothly. Some popular datasets are the IRIS dataset for flower classification, the cancer predicting Dataset, etc.
  • Pre-procession and feature engineering: Datasets available in the open domain may require some preprocessing efforts.Scikit-learn supports all types of tasks in Supervised and Unsupervised learning, Data Visualization, Data analytics, and Data Science. Feature engineering and data preprocessing are a part of that. Beginners can use these modules to learn data cleaning effects on ML models.
  • Model Development: After preparing the data, we want ML algorithms to find complex patterns in the Dataset. Scikit-learn provides the flexibility to use the already written programs for various algorithms in Machine Learning. For example, Linear Regression, SVM, etc.
  • Model Evaluation: After developing ML models, we need to evaluate their performance to identify whether ML models are suffering from problems like Underfitting or Overfitting. Scikit-learn provides support for evaluating our ML models as well.
  • Model Deployment: After finalizing the model, we need to deploy the model on web servers or inside software. This stage of the ML lifecycle is tricky as many models fail to get deployed on real-time servers. But thanks to Scikit-learn, it converts our complete ML process into a pipeline, and that code can be easily deployed on servers or hardware.

Let's learn each of these supports in greater detail now, as beginners would be very much interested in knowing these supports with codes.

Data Availability in Scikit-learn

Data lies in the core part of the Machine Learning process. Indeed, data is very specific to the needs for which the ML model is being developed. But from a learning perspective, we need some pre-existing modules to help us experiment with multiple algorithms and understand their behavior. 

To begin the journey in Machine Learning, the Scikit-learn library provides a large set of freely available datasets that can be directly imported into our programs. The toy dataset is one of the most famous of various dataset classes provided by the Scikit-learn library. Popular datasets available in this set are:

  • Fisher-Iris dataset: This is the most common Dataset where Sepal/Petal length/width is recorded for three flower species: Setosa, Versicolor, and Virginica. We will frequently use this Dataset in our entire Machine Learning course on enjoyalgorithms.

What are the different classes in IRIS dataset?

  • Hand-digit recognition: This is an image dataset of the handwritten digits, which consists of 8x8 pixel images of digits.

Samples of MNIST dataset used for handwritten digit recognition tasks

  • The breast cancer dataset: A dataset where the presence of malignant or benign types of cells is marked as per different features. This is a good dataset if any beginner wants to apply machine learning in medical science applications.
  • The wine recognition dataset: A dataset where the quality of wine is marked according to different attributes.

Refer here to check out all the datasets in this toy set category. Apart from the toy dataset, there are some real-world datasets incorporated in the Scikit-learn library, which comprise:

  • Olivetti faces Dataset from AT&T: Dataset containing face images from 1992–1994.
  • Newsgroups text: A textual dataset for text-related tasks.

Some other similar datasets are prepared from real-world scenarios that we can find here. It even allows users to generate data randomly based on their requirement of testing the developed model.

As we know about the datasets, let's quickly see how to load them into our programs.

from sklearn.datasets import load_iris
iris = load_iris()
X,y = iris.data,iris.target
features = iris.feature_names
labels = iris.target_names
print('Available Features :',features)
print('Categories :',labels)
print(len(X))
print(len(y[y==0]))
print(len(y[y==1]))
print(len(y[y==2]))

'''
Available Features : ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm) ']
Categories : ['setosa' 'versicolor' 'virginica']
150
50
50
50
'''

Load and View Sample data using Scikit-learn

The above code shows how to load and view the attributes of a sample dataset (iris flower) sklearn.datasets,and the total number of samples for each category is shown in this code. We can use the command dir(sklearn.datasets) to check all the datasets this package provides.

import sklearn
print(dir(sklearn.datasets))

Scikit-learn also provides the option to generate an entirely new dataset as per the requirements.

  • The functions make_moons or make_circles from sklearn.datasets can be used to generate 2-dimensional two interleaving half circles or full circles. Later, these datasets can be used for classification or clustering tasks.
  • make_classification can be used to generate datasets for classification models with any number of features and output classes.
  • make_regression can generate datasets for fitting regression models with any number of input and output features for generating output by a linear model.

Apart from these, several other datasets were provided by the scikit-learn packages, such as loadsvmlightfile, fetch_openml, etc. 

We know the datasets are always flawed and demand preprocessing to extract meaning from them. Scikit-learn provides a lot of in-built modules using which we can analyze and preprocess. So let's have a look.

Preprocessing Modules in Scikit-learn

The objective of the data preprocessing stage of ML models is to obtain the data in trainable format. This requires,

  • Selecting appropriate values for missing samples.
  • Converting categorical data to machine-readable numerical format.
  • Scaling attributes to improve training speed or accuracy.

Let's see in detail,

Imputing missing values using Scikit-learn

We can use the Sckikit-learn library to fill in the missing values in the Dataset. This process is called imputing. There can be many ways to do this, but here, we would focus on using a simple imputer to replace missing values.

import numpy as np
from sklearn.impute import SimpleImputer
data = np.array([[19, 18, np.NaN, 26],
                 [85, 53, 76, 45],
                 [83, 97,  1, np.NaN],
                 [73, 28, 38, 37],
                 [87, np.NaN, 86, 66],
                 [23, 28, 11, 10]])

print('Original Data :')
print(data)                                                  #Check Data before imputing          
print(np.isnan(data).any())                                  #Check presence of missing value
imp = SimpleImputer(strategy = 'median')                     #Define Imputer with strategy (mean/median/most_frequent)
data_new = imp.fit_transform(data)                           #Transform data as per the strategy
print('New Data :')
print(data_new)                                              #Check Data after imputing  
Original Data :
[[19. 18. nan 26.]
[85. 53. 76. 45.]
[83. 97. 1. nan]
[73. 28. 38. 37.]
[87. nan 86. 66.]
[23. 28. 11. 10.]]

Transformed to:

New Data:
[[19. 18. 38. 26.]
[85. 53. 76. 45.]
) [83. 97. 1. 37.]
[73. 28. 38. 37.]
[87. 28. 86. 66.]
[23. 28. 11. 10.]]

True

We can also use strategies like 'mean'/'most_frequent' to replace the missing value with the particular feature's mean or mode (column).

Label Encoders in Scikit-learn

At times, attributes of the Dataset can be in non-numeric form yet informative, and computers only understand numbers. Hence, these non-numeric values cannot be processed by ML models. That's when a label encoder comes into the picture. It can replace non-numerical values with numerical values and make these attributes understandable to machines.

import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
path = 'D:/EnjoyAlgorithm/PlayTennis.csv'
PlayTennis = pd.read_csv(path, header = 0, skiprows = 0)        #Loading the Text Dataset
print ("Dataset Length: ", len(PlayTennis)) 
print ("Dataset Shape: ", PlayTennis.shape) 
print(PlayTennis)                                               #Before processing

Le = LabelEncoder()
for label in PlayTennis.columns:
    PlayTennis[label] = Le.fit_transform(PlayTennis[label])
print(PlayTennis)                                               #After processing

How to do the label encoding for the categorical variables using scikit-learn?

For example, in the above playTennis dataset, the Label Encoder assigned a numerical value to each non-numerical data entry (say 'overcast’=0,' rainy’=1, 'sunny’=2). The processed data is now suitable to develop a Machine Learning model.

Scaling Dataset using Scikit-learn

In many real-world datasets, different attributes are present in different numerical ranges, which means their minimum and maximum do not match. This will create problems as attributes with higher magnitudes will be preferred more (or less, depending on the algorithm). 

For example, a hiring manager has to develop a plan to propose the salary for an individual. Their inputs include previous wages and the number of years of work experience. If we use an ML algorithm, e.g., KNN, the previous salary feature (being in the higher magnitude range) will outweigh the work experience as the numerical quantity of work experience will vary in the range of 0–70, but salary numbers will range from thousands to millions. Hence, we need to scale these features to assign them equal importance.

import numpy as np
from sklearn.preprocessing import MinMaxScaler
# define data ---> [salary ($), work ex(yrs)]
data = np.array([[3000, 1],
				[3300, 2],
				[4500, 2],
				[3800, 1],
				[4800, 3],
				[5000, 5]])
print(data)

scaler = MinMaxScaler()                  # define an object of the classmin max scaler

new_data = scaler.fit_transform(data)    # fit and transform the data
print(new_data)

'''
Original
[[3000 1]
[3300  2]
[4500  2]
[3800  1]
[4800  3]
[5000  5]

Scaled
[[0.  0.]
[0.15 0.251]
[0.75 0.25]
[0.4  0. ]
[0.9  0.5 ]
[1.   1.]
'''

We can use different scalars, such as AbsScalar and StandardScalar, and the choice depends upon the problem statement and dataset nature.

Feature Engineering using Scikit-learn

In feature engineering, we prepare the proper input features for ML models. Some popular techniques to perform feature engineering are:

Vectorization using Scikit-learn: Vectorization techniques are mainly used when data is not present in a tabular format, like text data, JSON file, dictionary, etc. It converts the entire data into a vector format, making it more understandable to machines.

from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False, dtype=int)
data = [
    {'price': 1125000, 'rooms': 4, 'State': 'New York'},
    {'price': 1000000, 'rooms': 3, 'State': 'California'},
    {'price': 750000, 'rooms': 3, 'State': 'Washington'},
    {'price': 800000, 'rooms': 2, 'State': 'California'},
    {'price': 850000, 'rooms': 2, 'State': 'New York'},
]
new_data = vec.fit_transform(data)
print(new_data)



'''
Output = 
[[      0       1       0 1125000       4]
 [      1       0       0 1000000       3]
 [      0       0       1  750000       3]
 [      1       0       0  800000       2]
 [      0       1       0  850000       2]]
'''

Dimensionality Reduction using Scikit-learn

Dimensionality reduction brings data samples in high-dimensional data to a low-dimensional space while retaining maximum information. We can not visualize datasets that are present in dimensional space higher than 3, and using dimensional reduction techniques, we bring them in lower dimensions and then visualize them. PCA is one such technique, and the Scikit-learn library supports that.

Let's see how to use Scikit-learn to reduce the dimensionality from 3 to 2.

from sklearn import datasets
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
#____CREATE RANDOM CLASSIFICATION DATASET____
X, y = datasets.make_classification(n_samples=300, n_features=3, n_classes=3, n_redundant=0,
                                    n_clusters_per_class=1, weights=[0.5, 0.3,0.2], random_state=42)
pca = PCA(n_components = 2,svd_solver = 'randomized')
X_fitted = pca.fit(X).transform(X)
fit = pca.fit(X)

print(("Explained Variance: %s") % (fit.explained_variance_ratio_))
#_____PLOT ORIGINAL DATA_____#
fig = plt.figure()

#ax = fig.add_subplot(111,projection = '3d')
ax = fig.add_subplot(1, 1, 1, projection='3d')
ax.scatter(xs = X[:,0], ys = X[:,1], zs = X[:,2], c=y)

ax.set_title("Original 3-featured data")
ax.set_xlabel("X0")
ax.set_ylabel("X1")
ax.set_zlabel("X2")
plt.show()
#_____PLOT REDUCED DIMENSION DATA_____#
fig, ax = plt.subplots(figsize=(9, 6))
plt.title("Reduced 2-featured data")
plt.xlabel("X_fitted_0", fontsize=20)
plt.ylabel("X_fitted_1", fontsize=20)

plt.scatter(X_fitted[:,0], X_fitted[:,1], s=50, c=y)

How to reduce the dimensionality in dataset using Scikit-learn?

As we can see, there is a reduction in the dimension from 3 to 2.

We are now ready to apply ML algorithms to the prepared Dataset and build our model. Let's see what Scikit-learn provides here.

Building Machine Learning Model using Scikit-learn

The sklearn library provides support for various machine learning models classified based on their type (linear models, tree-based, SVM-based, ensemble-based, etc.). Some standard algorithms are shown in the code and the commands to show how they can be used inside our Python programs. Check out the complete list here.

from sklearn.linear_model import LinearRegression,LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.cluster import KMeans
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.decomposition import PCA

The general paradigm for Scikit-learn is

  1. Import the ML model and create an instance of it.
  2. Fit training data into the model.
  3. Use the fitted model to predict.
#EX1: Creating and deploying a Supervised Learning Model
model = DecisionTreeClassifier()                 # Create an instance of the Decision Tree Classifier
model = model.fit(X_train,y_train)               # Fit the training data into the model
model.predict(X_test)                            # Use model to make prediction

#EX2: Creating and deploying an Dimensionality Reduction Model
pca = PCA(n_components = 2)                      # Create an instance of the PCA
X_transformed_data = pca.fit_transform(X_data)   # Fit and transform the data to new dimensions

Once the model is trained, we can evaluate its performance. Scikit-learn provides a wide range of modules to evaluate our models.

Model Evaluation using Scikit-learn

We evaluate our trained ML models on train and test sets using functional supports provided by Scikit-learn. Based on the performance, we decide whether the model is suffering from problems like Underfitting or Overfitting.

Evaluation of the trained model can be done in two simple steps,

  • Import the desired metric.
  • Use a trained model to predict the output on test/train data.
  • Compute the performance and report numbers.
#EX1: Evaluating the Model performance using R2-score
from sklearn.metrics import r2_score, mean_squared_error, mean_absolut_error
y_pred = model.predict(X_test)    
r2_score(y_test, y_pred)

Building a complete Machine Learning Pipeline using Scikit-learn

So far, we have seen the ways of extracting trainable data from raw data and then using them to train our ML algorithms. This complete process can be organized sequentially, also known as a pipeline. A pipelining process allows the processing and evaluation of a trained model from end to end. Scikit-learn converts this entire process into a pipeline, which makes them readily employable or deployable.

For example, let's see an end-to-end pipeline building using sklearn on the 'iris' flower dataset.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
import numpy as np
iris=load_iris()

iris_data = iris.data.copy()
iris_target = iris.target
#print('Iris Data before replacing samples with NaN',iris_data)
c = 10
mask = np.ones(iris_data.shape)
mask.ravel()[np.random.choice(mask.size, c, replace=False)] = 0

#print(np.where(mask==0))                   #Checking the (c=10) locations where number is replaced by NaN
iris_data[mask==0] = np.NaN

#print('Iris Data before replacing samples with NaN',iris_data)
X_train,X_test,y_train,y_test=train_test_split(iris_data,iris_target,test_size=0.3,random_state=42)
pipeline=Pipeline([('Imputer',SimpleImputer(strategy='mean')),('Scalar',StandardScaler()),
                     ('PCA',PCA(n_components=2)),('SVC',SVC(kernel = 'linear'))])


model = pipeline.fit(X_train, y_train)

print('SVM performance on Iris Classification',model.score(X_test,y_test))



#To view the data in any intermediate stage of the pipeline 
imputer_output = model.named_steps["Imputer"].transform(X_train)
scalar_output = model.named_steps["Scalar"].transform(imputer_output)
pca_output = model.named_steps["PCA"].transform(scalar_output)
model_output = model.named_steps["SVC"].predict(pca_output)




#SVM performance on Iris Classification 0.9333333333333333

To build this pipeline, we need to import the pipeline from sklearn.pipeline library. This pipeline takes the input of different transformations we apply to our Dataset. Let's suppose we want to do the imputation. 

The iris dataset has no missing values, so we will randomly replace a sample of values from the Dataset with NaN (not a number) to see the pipeline working closely.

Now that the data is ready, we will split it into the training and testing datasets. Sklearn provides a feature train_test_split that can split the data into desired fractions.

The next step is building the pipeline. The pipeline takes in input as a list of tuples. The tuple indexed '0' is the desired name for the transformation, and the indexed '1' is the transformation to be applied. The pipeline consists of the following transformation,

Imputer →StandardScalar →PCA →SVM

  • The imputer handles the missing values as per the strategy. One can change the strategy from 'mean' to 'median' or 'most_frequent' and check the results.
  • The StandardScalar transforms the data into zero mean and unit standard deviation. One can try other variations of scalars, such as MinMaxScalar, MaxAbsScalar, etc.
  • PCA reduces the dimensionality (4 →2) in this Dataset.
  • Finally, the output from the PCA is fed into the model Support Vector Classifier.

We can then fit the pipeline into the training dataset and compute the accuracy on the test dataset. To view the output in any intermediate steps, use named_step[“transformation_name”] as shown in the code above. This will allow us to effectively see the pipeline results in the intermediate steps and understand how the pipeline is working.

Conclusion

Scikit-learn is an open-source Machine Learning library built on top of famous Python packages and provides support for every Machine Learning model development stage. This article covered the essential introduction to the Scikit-learn library from a beginner's perspective and discussed its support for every ML model development and deployment stage. We hope you find the article enjoyable and learn something new. 

Enjoy Learning! Enjoy Algorithms!

References

Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825–2830, 2011.

Share Your Insights

More from EnjoyAlgorithms

Self-paced Courses and Blogs

Coding Interview

Machine Learning

System Design

Our Newsletter

Subscribe to get well designed content on data structure and algorithms, machine learning, system design, object orientd programming and math.