Feature Selection Techniques in Machine Learning

While working on a specific machine learning problem, it is common to see several features in the dataset. Still, it is rare to see all those features helping build the best model. Keeping the irrelevant features in the analysis reduces the model's generalization ability and also affects the model's performance. Further, adding more features may increase the complexity of the model, which in turn increases the generalization error. 

The best model would have the lowest number of features involved in the analysis keeping the performance high. Therefore, determining the relevant features for the model building phase is necessary. In this session, we will see some feature selection methods and discuss the pros and cons of each.

Following are some feature selection methods discussed in this blog

  • Wrapper Method
  • Embedded Method
  • Filter Method

Let's discuss them one by one!

Wrapper Method

The wrapper method for feature selection requires an algorithm to evaluate the model's performance over all the possible subsets of features. It assesses the quality of learning with different subsets of features against the evaluation criterion, and the output would be the model's performance versus different sets of features. Finally, the user can select the optimum set of features for which the model's performance is optimum. 

The wrapper method is known for the greedy approach as the model's performance is evaluated over all possible combinations of features till a specific criterion is fulfilled. Imagine having a large dataset with more than 50 features, and this would require at least 1275 model fits for each feature subset. It is a significant shortcoming of the wrapper method. However, the wrapper method produces better results when compared to the filter method-based feature selection techniques, which we will discuss in the next section.

Wrapper Method representation for feature extraction

Let's look at some wrapper feature selection techniques:

Forward Feature Selection

This method works iteratively by selecting the best variable among all the features and clubs another variable with the previously selected variable. This process persists until a specific criterion is fulfilled. Let's implement Forward Feature Selection on the Boston house price dataset:

import numpy as np
import pandas as pd
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LinearRegression

features = boston_house_price.iloc[:,:13]
target = boston_house_price.iloc[:,-1]

SFS = SequentialFeatureSelector(LinearRegression(),  #Regressor
                                k_features=12,       #When to stop
                                forward=True,        #Ensures FFS
                                scoring = 'r2')      #Scoring metric

SFS.fit(features, target)
SFS_results = pd.DataFrame(SFS.subsets_).transpose()

SFS_results

Output for Forward Feature Selection method implementation

In the above illustration, we are using the Boston House Price dataset. It's a regression problem, and hence, we are utilizing the linear regression model for fitting the data. R-squared error is a performance metric here. Performance of the model rapidly increased till the top seven features, and then it saturates around 0.74 avg_score. These results indicate that the seven features are sufficient for building the model, and the rest of the features can be dumped to keep the model explainable and faster. 

Backward Feature Elimination

Backward Feature Elimination is just the opposite of the above method. We start with all the features and fit the model. Then, we eliminate the feature from the model to which we receive the best performance. This process is repeated till we achieve a specific criterion. In the case below, the stopping criteria would be to halt once we were left with four parameters only. We can adjust the stopping criteria accordingly.

from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LinearRegression

import pandas as pd
import numpy as np

features = boston_house_price.iloc[:,:13]
target = boston_house_price.iloc[:,-1]

SFS = SequentialFeatureSelector(LinearRegression(),  #Regressor
                                k_features=4,        #When to stop
                                forward=False,       #Ensures BFE
                                scoring = 'r2')      #Scoring metric

SFS.fit(features, target)
SFS_results = pd.DataFrame(SFS.subsets_).transpose()

SFS_results

Output for Backward Feature Selection method implementation

Exhaustive Feature Selection (EFS)

This method searches for all possible combinations of features and evaluates the model over each subset of features. The output of EFS would be the combination of features securing the best score. It is a brute-force approach with high computational time. Let's implement it over the Boston house price prediction dataset.

from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
from sklearn.linear_model import LinearRegression

lr = LinearRegression()

features = boston_house_price.iloc[:,:13]
target = boston_house_price.iloc[:,-1]

# Create an EFS object
efs = EFS(LinearRegression(),     # Regressor                 
            min_features=6,       # Min features to consider
            max_features=13,      # Max features to consider
            scoring='r2')         # R-Squared as evaluation criteria 

# Train EFS with our dataset
efs = efs.fit(features, target)

# Print the results
print('Best subset (indices):', efs.best_idx_)  
print('Best subset (corresponding names):', efs.best_feature_names_)


# Best subset (indices): (0, 1, 3, 4, 6, 7, 8, 9, 10, 11, 12)
# Best subset (corresponding names): ('CRIM', 'ZN', 'CHAS', 'NOX', 'AGE', 'DIS',
#                   'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT')

Now, as we have understood the wrapper method working, let's look at the other method,

Embedded Method

The embedded method surpasses the filter and wrapper methods due to its fair computational cost and reliable performance. Embedded methods are algorithm-based, where an algorithm helps extract the relevant features. The algorithm keeps track of relevant features using certain criteria and collects the features that contribute most during the training phase. 

Embedded Method representation for feature selection

The computational cost for the embedded method is lower than wrapper methods, and performance is remarkably better than the other two methods. Let's look at some embedded feature selection techniques:

LASSO Regularization L1

LASSO Regularization is commonly used as a feature selection criteria. It works by penalizing irrelevant parameters by shrinking their weights or coefficients to zero. Hence, those features are removed from the model, and it not only removes the extraneous features and prevents the model from overfitting. If you are unaware of regularization, please visit this link

Let's implement LASSO Regularization:

from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import StandardScaler

target = diabetes["Outcome"]
features = diabetes.drop("Outcome", axis=1)

scaler = StandardScaler()
scaler.fit(features)
scaled_features = scaler.transform(features)

logistic = SelectFromModel(LogisticRegression(C=1, penalty='l1', solver='liblinear'))
logistic.fit(scaled_features, target)

selected_features = features.columns[(logistic.get_support())]

print('Total number of features: {}'.format((features.shape[1])))
print('Features selected: {}'.format(len(selected_features)))
print('Number of discarded features: {}'.format(np.sum(logistic.estimator_.coef_ == 0)))

# Total number of features: 8
# Features selected: 7
# Number of discareded features: 1
features.columns[(logistic.estimator_.coef_ == 0).ravel()]

# Index(['SkinThickness'], dtype='object')

Random Forest Feature Importance

Random Forest falls under ensemble learning algorithms that utilize several weak learners' aggregation (decision trees) for prediction. This tree-based approach naturally ranks the features of a dataset by measuring how well the purity is improving. In the decision trees, the impurity drops rapidly at the starting node of the tree, and this rate decreases as we go down. Naturally, the initial node of the tree holds more critical Information. Hence, such features are relevant from the feature selection perspective, while those contributing to the lower portion of the tree are less relevant. This mechanism allows us to create a hierarchy of features sorted by their importance. Let's implement Random Forest Feature Importance:

from sklearn.ensemble import RandomForestClassifier

features = df.drop('Outcome',axis=1)
target = df['Outcome']

classifier = RandomForestClassifier(random_state=90, oob_score=True)
classifier.fit(features, target)

feature_importance = classifier.feature_importances_

feature_importance = 100.0 * (feature_importance / feature_importance.max())

sorted_idx = np.argsort(feature_importance)
sorted_idx = sorted_idx[len(feature_importance) - 50:]

pos = np.arange(sorted_idx.shape[0]) + .5

plt.figure(figsize=(10,12))
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.xticks(size =14)
plt.yticks(pos, features.columns[sorted_idx], size =14)
plt.xlabel('Relative Importance', fontsize = 15)
plt.title('Variable Importance', fontsize = 15)
plt.show()

Random Forest Feature Importance plot

Filter Method

Filter methods are a collection of statistical techniques commonly used for measuring the importance of features in a dataset. These methods are fast and computationally inexpensive than the wrapper method. While dealing with large datasets, it is more reasonable to use filter methods. Let's look at some filter feature selection methods:

Correlation Coefficients

Correlation helps measure the linear relationship between two or more features and is primarily valid when features are numeric. Its application extends as a feature selection method since the correlation matrix, a.k.a the heatmap, helps visualize the relationship between the features and the target variable. It can also reveal Information on collinear features, which are redundant in the analysis since they don't contribute to any new information; thus, removing such features is recommended. Secondly, we can decide on a threshold value of correlation. If the absolute correlation between the feature and target variable is lower than that threshold, we can discard that feature from the analysis. Let's implement a correlation heatmap:

import seaborn as sns
import matplotlib.pyplot as plt

correlation = boston_house_price.corr()
plt.figure(figsize= (15,12))
sns.heatmap(correlation, annot=True)

Correlation Heatmap for wine quality data

TAX and RAD parameters share a high correlation. Keeping anyone of TAX and RAD would suffice. Now, which parameter should we remove out of TAX and RAD? For this, we will check the absolute correlation with the target variable. The target variable is MEDV and TAX has a high absolute correlation with MEDV. Hence, we will discard RAD from the analysis. 

We also need to decide on a threshold of absolute correlation. Below this threshold, we will toss the feature from the analysis. Let's keep it 0.4 as a feature selection criteria. We are left with 'INDUS,' 'NOX,' 'RM,' 'TAX,' 'PTRATIO,' 'LSTAT.' as the final features. Selecting an optimal threshold is an empirical process and requires hit & trial to arrive at an optimum threshold value. 

Mutual Information

Mutual Information is a feature selection technique commonly used when the independent features are numeric. It measures the dependency of an independent variable over the target variable. The Mutual Information is zero when two variables are independent, and a higher value suggests a higher dependence. It relies on entropy estimation, which further uses the K-nearest-neighbors distances. Mutual Information applies to both regression and classification problems. Let's implement this over the wine-quality dataset.

import pandas as pd
from sklearn.feature_selection import mutual_info_classif

wine_quality = pd.read_csv('WineQT.csv')
target = wine_quality["quality"]

features = wine_quality.drop("quality", axis=1)

mutual_information = mutual_info_classif(features, target)
mutual_information_series = pd.Series(mutual_information)

mutual_information_series.index = features.columns

mutual_information_series.sort_values(ascending=False)
mutual_information_series.sort_values(ascending=False).plot.bar()

Mutual Information method implementation on wine quality data

The initial nine parameters in descending order have a significant relationship with the dependent parameter. The rest of the parameters can be scrapped from the analysis. 

Variance Threshold

This approach assumes that the features with low variance do not contribute much Information to the analysis. Variables with zero variance are the first to be removed; such variables contain a single input throughout the dataset and provide no valuable information. Features with high variance are often helpful as per this method, but this isn't true for all instances. A variance threshold is required as a feature selection criteria. Any features having variance lower than the decided threshold will be tossed. 

from sklearn.feature_selection import VarianceThreshold

target = wine_quality["quality"]
features = wine_quality.drop("quality", axis=1)

selector = VarianceThreshold(threshold=0.03)
selector.fit(features)

best_features = features.columns[selector.get_support()]
print(best_features)

## Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual
## sugar', 'free sulfer dioxide', 'total sulfer dioxide', 'alcohol'], dtype='object')

Chi-Square Test

The chi-Squared test is used when all the parameters are categorical in the dataset. It requires the computation of the chi-square value between the feature and the target. Features are selected based on their chi-square scores. Before applying the chi-square test, certain conditions have to be met. Following are those conditions:

  • Features should be categorical 
  • Observations should be independent
  • The sample should be large (features having a frequency greater than 5)

Let's implement the Chi-Square Test:

from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import LabelEncoder

tennis_data = pd.read_csv('play_tennis.csv')

le = LabelEncoder()
cat_columns = tennis_data.columns

tennis_data[cat_columns] = tennis_data[cat_columns].apply(lambda x: le.fit_transform(x))

target = tennis_data["play"]
features = tennis_data.drop(["play"], axis=1)

chi2_features = SelectKBest(chi2, k = 4)
Best_k_features = chi2_features.fit_transform(features, target)

print("Total Number of Features" ,features.shape[1])
print("Reduced to {} features".format(Best_k_features.shape[1]))

## Total Number of Features 5
## Reduced to 4 features

Possible Interview Questions On Feature Selection

Feature selection is one of the most frequent topics that interviewers love to ask because this process is involved in almost all the projects that we mention in our resumes. Some of the questions that can be asked are:

  • How many features were present in the raw data you received?
  • What was the final number of features used to train the machine learning model?
  • On what basis did you remove specific features from the final set of features?
  • Which method would you prefer if raw data has 1000 features with a more significant number of samples?
  • Why a wrapper function is a greedy approach?

Conclusion

In this article, we explored three primary feature selection methods covering the Wrapper, Embedded, and Filter methods. Each method has its pros and cons, and we learned that there's no ideal feature selection method. The best features returned by each method might be different, and selecting the best features is an empirical process that requires experiments with data and domain knowledge. We implemented all the strategies in Python and learned the basic intuition behind each technique. Filter methods are fast and more reasonable for large datasets, while wrapper methods are robust but slow and more applicable over small datasets. The embedded method sits between the wrapper and filter method in execution time but provides reliable results.

Enjoy Learning, Enjoy Algorithms!

More From EnjoyAlgorithms

Our weekly newsletter

Subscribe to get free weekly content on data structure and algorithms, machine learning, system design, oops design and mathematics.

Follow Us:

LinkedinMedium

© 2020 EnjoyAlgorithms Inc.

All rights reserved.