While working on a specific machine learning problem, it is common to see several features in the dataset. Still, it is rare to see all those features helping build the best model. Keeping the irrelevant features in the analysis reduces the model's generalization ability and also affects the model's performance. Further, adding more features may increase the complexity of the model, which in turn increases the generalization error.
The best model would have the lowest number of features involved in the analysis keeping the performance high. Therefore, determining the relevant features for the model building phase is necessary. In this session, we will see some feature selection methods and discuss the pros and cons of each.
Let's discuss them one by one!
The wrapper method for feature selection requires an algorithm to evaluate the model's performance over all the possible subsets of features. It assesses the quality of learning with different subsets of features against the evaluation criterion, and the output would be the model's performance versus different sets of features. Finally, the user can select the optimum set of features for which the model's performance is optimum.
The wrapper method is known for the greedy approach as the model's performance is evaluated over all possible combinations of features till a specific criterion is fulfilled. Imagine having a large dataset with more than 50 features, and this would require at least 1275 model fits for each feature subset. It is a significant shortcoming of the wrapper method. However, the wrapper method produces better results when compared to the filter method-based feature selection techniques, which we will discuss in the next section.
Let's look at some wrapper feature selection techniques:
This method works iteratively by selecting the best variable among all the features and clubs another variable with the previously selected variable. This process persists until a specific criterion is fulfilled. Let's implement Forward Feature Selection on the Boston house price dataset:
import numpy as np
import pandas as pd
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LinearRegression
features = boston_house_price.iloc[:,:13]
target = boston_house_price.iloc[:,-1]
SFS = SequentialFeatureSelector(LinearRegression(), #Regressor
k_features=12, #When to stop
forward=True, #Ensures FFS
scoring = 'r2') #Scoring metric
SFS.fit(features, target)
SFS_results = pd.DataFrame(SFS.subsets_).transpose()
SFS_results
In the above illustration, we are using the Boston House Price dataset. It's a regression problem, and hence, we are utilizing the linear regression model for fitting the data. R-squared error is a performance metric here. Performance of the model rapidly increased till the top seven features, and then it saturates around 0.74 avg_score. These results indicate that the seven features are sufficient for building the model, and the rest of the features can be dumped to keep the model explainable and faster.
Backward Feature Elimination is just the opposite of the above method. We start with all the features and fit the model. Then, we eliminate the feature from the model to which we receive the best performance. This process is repeated till we achieve a specific criterion. In the case below, the stopping criteria would be to halt once we were left with four parameters only. We can adjust the stopping criteria accordingly.
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LinearRegression
import pandas as pd
import numpy as np
features = boston_house_price.iloc[:,:13]
target = boston_house_price.iloc[:,-1]
SFS = SequentialFeatureSelector(LinearRegression(), #Regressor
k_features=4, #When to stop
forward=False, #Ensures BFE
scoring = 'r2') #Scoring metric
SFS.fit(features, target)
SFS_results = pd.DataFrame(SFS.subsets_).transpose()
SFS_results
This method searches for all possible combinations of features and evaluates the model over each subset of features. The output of EFS would be the combination of features securing the best score. It is a brute-force approach with high computational time. Let's implement it over the Boston house price prediction dataset.
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
features = boston_house_price.iloc[:,:13]
target = boston_house_price.iloc[:,-1]
# Create an EFS object
efs = EFS(LinearRegression(), # Regressor
min_features=6, # Min features to consider
max_features=13, # Max features to consider
scoring='r2') # R-Squared as evaluation criteria
# Train EFS with our dataset
efs = efs.fit(features, target)
# Print the results
print('Best subset (indices):', efs.best_idx_)
print('Best subset (corresponding names):', efs.best_feature_names_)
# Best subset (indices): (0, 1, 3, 4, 6, 7, 8, 9, 10, 11, 12)
# Best subset (corresponding names): ('CRIM', 'ZN', 'CHAS', 'NOX', 'AGE', 'DIS',
# 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT')
Now, as we have understood the wrapper method working, let's look at the other method,
The embedded method surpasses the filter and wrapper methods due to its fair computational cost and reliable performance. Embedded methods are algorithm-based, where an algorithm helps extract the relevant features. The algorithm keeps track of relevant features using certain criteria and collects the features that contribute most during the training phase.
The computational cost for the embedded method is lower than wrapper methods, and performance is remarkably better than the other two methods. Let's look at some embedded feature selection techniques:
LASSO Regularization is commonly used as a feature selection criteria. It works by penalizing irrelevant parameters by shrinking their weights or coefficients to zero. Hence, those features are removed from the model, and it not only removes the extraneous features and prevents the model from overfitting. If you are unaware of regularization, please visit this link.
Let's implement LASSO Regularization:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import StandardScaler
target = diabetes["Outcome"]
features = diabetes.drop("Outcome", axis=1)
scaler = StandardScaler()
scaler.fit(features)
scaled_features = scaler.transform(features)
logistic = SelectFromModel(LogisticRegression(C=1, penalty='l1', solver='liblinear'))
logistic.fit(scaled_features, target)
selected_features = features.columns[(logistic.get_support())]
print('Total number of features: {}'.format((features.shape[1])))
print('Features selected: {}'.format(len(selected_features)))
print('Number of discarded features: {}'.format(np.sum(logistic.estimator_.coef_ == 0)))
# Total number of features: 8
# Features selected: 7
# Number of discareded features: 1
features.columns[(logistic.estimator_.coef_ == 0).ravel()]
# Index(['SkinThickness'], dtype='object')
Random Forest falls under ensemble learning algorithms that utilize several weak learners' aggregation (decision trees) for prediction. This tree-based approach naturally ranks the features of a dataset by measuring how well the purity is improving. In the decision trees, the impurity drops rapidly at the starting node of the tree, and this rate decreases as we go down. Naturally, the initial node of the tree holds more critical Information. Hence, such features are relevant from the feature selection perspective, while those contributing to the lower portion of the tree are less relevant. This mechanism allows us to create a hierarchy of features sorted by their importance. Let's implement Random Forest Feature Importance:
from sklearn.ensemble import RandomForestClassifier
features = df.drop('Outcome',axis=1)
target = df['Outcome']
classifier = RandomForestClassifier(random_state=90, oob_score=True)
classifier.fit(features, target)
feature_importance = classifier.feature_importances_
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
sorted_idx = sorted_idx[len(feature_importance) - 50:]
pos = np.arange(sorted_idx.shape[0]) + .5
plt.figure(figsize=(10,12))
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.xticks(size =14)
plt.yticks(pos, features.columns[sorted_idx], size =14)
plt.xlabel('Relative Importance', fontsize = 15)
plt.title('Variable Importance', fontsize = 15)
plt.show()
Filter methods are a collection of statistical techniques commonly used for measuring the importance of features in a dataset. These methods are fast and computationally inexpensive than the wrapper method. While dealing with large datasets, it is more reasonable to use filter methods. Let's look at some filter feature selection methods:
Correlation helps measure the linear relationship between two or more features and is primarily valid when features are numeric. Its application extends as a feature selection method since the correlation matrix, a.k.a the heatmap, helps visualize the relationship between the features and the target variable. It can also reveal Information on collinear features, which are redundant in the analysis since they don't contribute to any new information; thus, removing such features is recommended. Secondly, we can decide on a threshold value of correlation. If the absolute correlation between the feature and target variable is lower than that threshold, we can discard that feature from the analysis. Let's implement a correlation heatmap:
import seaborn as sns
import matplotlib.pyplot as plt
correlation = boston_house_price.corr()
plt.figure(figsize= (15,12))
sns.heatmap(correlation, annot=True)
TAX and RAD parameters share a high correlation. Keeping anyone of TAX and RAD would suffice. Now, which parameter should we remove out of TAX and RAD? For this, we will check the absolute correlation with the target variable. The target variable is MEDV and TAX has a high absolute correlation with MEDV. Hence, we will discard RAD from the analysis.
We also need to decide on a threshold of absolute correlation. Below this threshold, we will toss the feature from the analysis. Let's keep it 0.4 as a feature selection criteria. We are left with 'INDUS,' 'NOX,' 'RM,' 'TAX,' 'PTRATIO,' 'LSTAT.' as the final features. Selecting an optimal threshold is an empirical process and requires hit & trial to arrive at an optimum threshold value.
Mutual Information is a feature selection technique commonly used when the independent features are numeric. It measures the dependency of an independent variable over the target variable. The Mutual Information is zero when two variables are independent, and a higher value suggests a higher dependence. It relies on entropy estimation, which further uses the K-nearest-neighbors distances. Mutual Information applies to both regression and classification problems. Let's implement this over the wine-quality dataset.
import pandas as pd
from sklearn.feature_selection import mutual_info_classif
wine_quality = pd.read_csv('WineQT.csv')
target = wine_quality["quality"]
features = wine_quality.drop("quality", axis=1)
mutual_information = mutual_info_classif(features, target)
mutual_information_series = pd.Series(mutual_information)
mutual_information_series.index = features.columns
mutual_information_series.sort_values(ascending=False)
mutual_information_series.sort_values(ascending=False).plot.bar()
The initial nine parameters in descending order have a significant relationship with the dependent parameter. The rest of the parameters can be scrapped from the analysis.
This approach assumes that the features with low variance do not contribute much Information to the analysis. Variables with zero variance are the first to be removed; such variables contain a single input throughout the dataset and provide no valuable information. Features with high variance are often helpful as per this method, but this isn't true for all instances. A variance threshold is required as a feature selection criteria. Any features having variance lower than the decided threshold will be tossed.
from sklearn.feature_selection import VarianceThreshold
target = wine_quality["quality"]
features = wine_quality.drop("quality", axis=1)
selector = VarianceThreshold(threshold=0.03)
selector.fit(features)
best_features = features.columns[selector.get_support()]
print(best_features)
## Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual
## sugar', 'free sulfer dioxide', 'total sulfer dioxide', 'alcohol'], dtype='object')
The chi-Squared test is used when all the parameters are categorical in the dataset. It requires the computation of the chi-square value between the feature and the target. Features are selected based on their chi-square scores. Before applying the chi-square test, certain conditions have to be met. Following are those conditions:
Let's implement the Chi-Square Test:
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import LabelEncoder
tennis_data = pd.read_csv('play_tennis.csv')
le = LabelEncoder()
cat_columns = tennis_data.columns
tennis_data[cat_columns] = tennis_data[cat_columns].apply(lambda x: le.fit_transform(x))
target = tennis_data["play"]
features = tennis_data.drop(["play"], axis=1)
chi2_features = SelectKBest(chi2, k = 4)
Best_k_features = chi2_features.fit_transform(features, target)
print("Total Number of Features" ,features.shape[1])
print("Reduced to {} features".format(Best_k_features.shape[1]))
## Total Number of Features 5
## Reduced to 4 features
Feature selection is one of the most frequent topics that interviewers love to ask because this process is involved in almost all the projects that we mention in our resumes. Some of the questions that can be asked are:
In this article, we explored three primary feature selection methods covering the Wrapper, Embedded, and Filter methods. Each method has its pros and cons, and we learned that there's no ideal feature selection method. The best features returned by each method might be different, and selecting the best features is an empirical process that requires experiments with data and domain knowledge. We implemented all the strategies in Python and learned the basic intuition behind each technique. Filter methods are fast and more reasonable for large datasets, while wrapper methods are robust but slow and more applicable over small datasets. The embedded method sits between the wrapper and filter method in execution time but provides reliable results.
Enjoy Learning, Enjoy Algorithms!
Classification problems are among the most used problem statements in Machine Learning. We evaluate our classification models with available models using standard evaluation metrics like Confusion matrix, Accuracy, Precision, Recall, ROC. In this article, we will discuss some of the popular evaluation metrics used to evaluate the classification models.
Scikit-learn is a free machine learning framework available for Python, providing an interface for supervised and unsupervised learning. It is built over the SciPy library and provides every feature catering to every ML requirement. In this blog, we will learn the essential concepts, tools, and features related to Scikit-learn.
When we build a solution for any regression problem, we compare its performance with the existing work using standard metrics, like measuring distance in meters, plot size in square feet, etc. Similarly, we need some standard evaluation metrics to evaluate two regression models. Some of them are MAE, MSE, RMSE, and R-Squared.
Subscribe to get free weekly content on data structure and algorithms, machine learning, system design, oops design and mathematics.