While training Machine Learning models, data availability is one of the biggest bottlenecks. Even if the data is present, it misses essential attributes that can help models learn better. Also, attributes in their native state are not scaled, making models biased towards particular features.
To resolve such hurdles, feature engineering is the technique that allows us to generate or rectify existing features to enhance our model's performance. This blog will teach us more about feature engineering and its various techniques.
Let's start with our first question.
Features are like food to machine learning models. Like we process our food before eating them, we must process features before feeding them into the model. Feature engineering is one such pre-processing step performed while developing machine learning models. Here we extract meaningful features from available raw data, which helps increase model performance drastically.
Feature engineering has become more critical with the recent "Data-based" machine learning trend. Let's take one example: Suppose we have raw features like 'Current' and 'Voltage' values for any lithium-ion batteries, but the model expects 'Power' to predict the battery's charge percentage. But we do not have a 'Power' feature. Here comes the need for feature engineering, where we transform the existing features so that it becomes helpful for the ML model. We can form a new feature by multiplying the 'Current' and 'Voltage' values, which will result in 'Power'.
Initially, feature engineering was a manual task, consuming significant time. But since 2016, the entire feature engineering process can be automated, which helps us extract meaningful features from raw data. There are four main processes involved in feature engineering :
Let's discuss each one of them one by one.
A dataset contains hundreds and thousands of attributes, but not all are suitable for building a Machine Learning model. We sometimes need to create features from the existing features and then use that particular attribute for training the model. There can be multiple ways to do so, but some common methods include the following:
Developers sometimes need to design their strategy to develop a new feature to enhance the ML model's performance.
Raw data has multiple attributes, and every feature has its defined range. For example, the engine temperature recorded for a car will range from 20°C to 100°C, while the car's speed will range from 0 KMPH to 200 KMPH, and the distance covered can be in the range of 5000 Km to 7000 Km. If we directly use these features while building the model, our model will become biased towards the distance features as the magnitude of distance is very high.
So we adjust the feature variable to improve the machine learning model's performance by making it unbiased towards any particular feature. A prevalent example would be scaling all features within a specific range (e.g., 0 to 1). Standardization and Normalization are some perfect techniques for the same, and we will study the details of these methods in our "Why we need scaling?" blog.
It involves extracting new features from the raw data. This is different from feature creation as here, we mainly want to reduce the total number of features used to build the ML model but, at the same time, retain the maximum information from the earlier data.
As discussed in feature creation, we took the average temperature and lost data for every second. But in feature extraction, we ensure that the intermediate information is least hampered. This helps a lot while modelling data and visualizing the features. Some popular feature extraction methods are PCA, t-SNE, cluster analysis, and edge detection, and details of these algorithms can be found on our Machine Learning course page.
It is a way of selecting the subset of the most relevant features from the original features by removing the redundant, noisy, or irrelevant ones. Features can suffer from missing values/null values, garbage values, constant values, noisy values, too many outliers and many more. These anomalies can drastically reduce our ML model's performance, which is shown in the image below.
If a feature is essential but is of deficient quality, we must remove it from our final feature list as it will negatively affect the performance. Hence we do feature selection techniques to create a subset of features from the existing attributes and then use it for training ML models. This process needs knowledge about the domain we are applying Machine Learning.
For example, to predict the remaining useful life of the battery, we have recorded the atmospheric temperature values as a feature which does not affect battery life. So we can drop this feature which is influenced by our domain knowledge that atmospheric temperature does not affect battery health.
Machine learning workflow involves several steps, each with its own time set. Below pie chart shows the time distribution in each step involved and is based on the survey conducted by Forbes.
Data scientists invest around 80% time in the pre-processing step. This is where feature engineering helps and provides the most relevant features for model training. We can learn more about pre-processing of structured data in this blog.
But why can't we train our model on raw features? Why is data pre-processing required? Why don't we have the features needed in a table and use them directly?
We can not train a Machine Learning model on raw features because:
In summary, data pre-processing is an essential step in machine learning to ensure high performance for ML models.
The most common benefits of feature engineering while building a model are as follows:
Steps involved in feature engineering vary from person to person and model to model. But some common steps are required, and they are:
Cleaning up raw, unstructured, and dirty data bulks may seem daunting, but various feature engineering techniques help us. Let's learn more about it in the next section.
Feature engineering techniques that are used very frequently are:
Human errors and data flow interruptions can lead to missing values in data. These need to be handled to prevent having an impact on the performance of the model. There are two types of imputation:
Numerical Imputation: Numerical features can have missing values for some records. For example, missing some values in a list that contains the count of the number of people eating a particular product in a region. These missing values can be filled by the median, average, or just the number 0. An example is shown below:
data_imputed = data.fillna(0) ## We fill all missing values with 0. ## Do note that it assumes that all missing data here is numerical
Categorical Imputation: When the feature is categorical, and it contains missing values, we can replace them with the majority categorical value for that column. A new field called 'Other' can also be created, which is helpful if there is no most frequent categorical value or if replacing missing values with the most frequent ones leads to an increase in outliers.
data_imputated['categorical_column_name'] = data['categorical_column_name'].fillna(data['categorical_column_name'].value_counts().idxmax()) # #here missing values of categrical column are filled by their max value
Please note that we can use 'inplace=True' to update in the same data frame. This is a feature of Pandas. For more details, you can refer to the blog.
Outlier handling techniques help us to remove outliers and help to increase the performance of models. The improvement is drastic for models like linear regression, which are susceptible to outliers. The various methods are:
We can use the Standard deviation for identifying outliers because identification is the most difficult part of working with datasets having outliers. For example, a point larger than a particular distance within a space can be considered an outlier.
This is the most basic type of encoding used to give a numerical value to categorical features. We all know models play with numbers, and this encoding helps assign a unique vector for each feature. It works when the number of features is finite, and the vector length corresponds to the number of features. An example is shown below.
# Total features are 4 Men = [1,0,0,0] Women = [0,1,0,0] Child = [0,0,1,0] Girl = [0,0,0,1]
It is also known as logarithm transformation. When we have skewed distributed data, we want to turn it into less skewed or normally distributed data. We can transform this feature by taking the log of that and using the resulting values. An example is shown below:
data_log_transformed['log' + '_column_name'] = np.log(data['column_name'])
Feature scaling is an essential step in data pre-processing. The features can be scaled up or down as per needs and can improve the performance of many algorithms, especially distance-based algorithms like K-Means. The commonly used methods are :
More details about the feature scaling can be found in this blog.
One can refer to the data pre-processing hands on the blog for more detailed techniques on feature engineering.
There are also some tools to make applying feature engineering a lot easier. Let's discuss some of these tools to conclude our discussion.
These tools help us to do the whole feature engineering and produce relevant features efficiently for our models. Some of the standard tools are:
It is an open-source library used to perform automated feature engineering. Data Scientists use it to fast-forward the feature generation process and develop a more accurate model.
It is available as a Python package and helpful when features are related to time series. We can also use it along with the feature tools. It helps us extract features like the number of peaks, maximum value, etc., to train our model.
Feature engineering is an essential method that extracts critical features from the available features in the dataset. It can drastically improve our model's performance, making it necessary to understand. This blog discussed the basics of feature engineering and the various steps involved. We hope you enjoyed reading the article.
If you have any queries/doubts/feedback, please write us at firstname.lastname@example.org. Enjoy machine learning!
Subscribe to get well designed content on data structure and algorithms, machine learning, system design, object orientd programming and math.