Data Pre-processing Techniques for Machine Learning

Nowadays, data collection is one of the most common trends among organizations. Every company collects data for a variety of uses. 

  • Trading companies are collecting data to analyze the performance of stock. 
  • Automobile companies are collecting data for prognostics of vehicles or automate driving.
  • Marketing companies are collecting data to analyze the performance of marketing strategies.
  • Sports companies are collecting data of sportspersons to track their form.

Similarly, if we observe, every single firm is collecting data in any form. But interestingly, most of these firms are just piling up the millions of gigabytes of data without even utilizing it. Let’s think over an open-ended question and post the answer in the comment section,

If everyone is collecting this gigantic amount of data, in the near future, will it be considered Digital Junk?

Key takeaways from this blog

  1. What are data and data pre-processing?
  2. What are the common things we do in data pre-processing?
  3. What is the need for data pre-processing?
  4. What are categorical and numerical features?
  5. What are feature selection, feature quality assessment, feature aggregation, feature sampling, and feature reduction?
  6. Possible interview questions on this topic.

Moving ahead with the pre-processing data discussion, let’s first know the meaning of data. If we define the term “data” properly,

“Data is raw qualitative or quantitative information stored in numerical, textual, audible or visual form.”

From the above definition, we can easily sense that data can be in mainly four different forms, Numerical, Textual, Visual or Audio. But this raw form of data can not be used directly because it may have outliers, unwanted features, missing values, and many other challenges. Hence we need to transform this into some meaningful form. The process of this transformation can be termed data-preprocessing. If we properly define the data pre-processing term,

“Data pre-processing techniques generally refer to the addition, deletion, or transformation of training set data.”

Page 27, Applied Predictive Modeling, 2013.

Data Funnel view

Now that we know what data pre-processing is and the primary reason to use data preprocessing, let’s quickly move ahead to look at some standard methods included in this process.

While collecting the data, we usually observe and measure the values of many individual properties for any phenomenon. This practice is common because it may be possible that one particular property can help solve one task but is useless to another. This measurable individual property is termed a feature in machine learning.

Processes of data pre-processing

Features can be of two types:

Categorization of features

Categorical Features

In this category, features can take values from a fixed, defined set of values. For example, weekday can take values from the set of {Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday} and hence we can say that weekday is a categorical feature. Similarly, if any feature can take only a boolean value, i.e., from the set {True, False} or {0, 1}, then that feature can be considered as a categorical feature.

As we saw in the weekday example above, the defined set of categorical variables can contain non-numeric values. But, we know that machines can not understand these values or names of weekdays, so we represent these features using one-hot encoding instead of feeding these features directly. Seven days can be converted as seven features, like (1, 0, 0, 0, 0, 0, 0) for Monday; (0, 1, 0, 0, 0, 0, 0) for Tuesday and so on.

Numerical Features 

In this category, values of the features are continuous or real-valued. For example, the vehicle's speed is the numerical feature; Battery’s Current is a numerical feature. These features are advised to scale before feeding to the machine learning model. There are several methods like

  • Min-max normalization
  • Z-mean normalization
  • Exponential Normalization, etc.

These are widely used scaling techniques for the normalization of features. Detailed methods and the need for scaling are discussed in this blog here. So intuitively, our first step should be selecting the required features from the bulk of available features to solve the given problem.

Feature Selection

This step requires a little amount of domain knowledge of the problem statement. Domain knowledge means knowledge of the area in which we are trying to solve the problem. For example, suppose we are training a machine learning model to predict the battery's remaining life. In that case, we must know the features like Current (I), Voltage (V), and Temperature(T) patterns while charging and discharging the batteries because these features derive the factor of aging of the battery. 

But how did we know that these features will contribute to the battery's aging? This came from domain knowledge. Another example could be that if the problem is related to estimating the energy produced by wind-turbine, we must know critical factors that control wind energy production.

Generally, firms try to collect as many attributes as possible because having extra things does not hurt, but missing important features can force reiterating the entire process.  

Feature selection

Feature Value Quality Assessment

In this step, we evaluate the quality of the data. During the data collection process, it may be possible that some sensors stop working or got affected by noises, or in the case of human intervention, there can be some inconsistency related to the scale or measurement units. It is simply unrealistic to expect that the data will be perfect.

Missing and Wrong data

Some famous anomalies that can be present in the data are:

  • Missing Values: If we consider data in the form of rows and columns, then Columns represent the features, and rows represent the values of those features. Some rows may have either no values or NaN (Not a Number) values. To tackle this problem, there are two solutions:

1. Eliminate Rows:- We can delete the rows having missing or NaN values. But suppose the first column is the time feature, which shows that data was collected at that particular time. If we need data at a fixed sampling rate (the difference between two consecutive samples to be consistent), this method will create problems.

2. Estimate Rows: We can estimate the missing values in the rows using various techniques like *linear interpolation, forecast modeling, forward filling, backward filling, average filling*, etc. These estimates can be wrong, but the deviations will be significantly less.

  • Wrong/ Inconsistent Values: These inconsistencies are better resolved with the help of human cross-checking. We can use some programming techniques to cross-check whether the data is consistent, but this is infeasible because of the 100s and 1000s of corner cases. Suppose the speed range for a motorbike can be in the range of 0 to 300 KM/h. But what if it is giving the values in the range of 1000–5000 KM/h. We usually represent features graphically and manually resolve such errors.
  • Duplicate Values: Because of various factors, it may be possible to have duplicate values present in the data. It can make our model bias towards this duplicity. These anomalies can be removed with the help of graphical or statistical observations.
  • Presence of Noise: Sensors collecting data can have multiple forms of disturbances or noises. These noises are random and hence bring inconsistency in the data. Several filtering techniques can be used to denoise the signals, like Savitzky–Golay filter shown in the GIF below. These filters remove the jitter and smooth out the observations.

Savitzky golay filter

  • Presence of Outliers: Outliers are the observations that lie at an abnormal distance from other values present in the random sample of the dataset. The decision of being abnormal depends upon the data scientists. Several techniques like replacing such values from the mean of the entire data or the nearest good samples. 

Outlier removal benefits

Feature Aggregation

At the starting of this blog, we discussed that nowadays, data collection is a common trend. Following this trend, people have started collecting every minute of data. But suppose if some data is not required with higher frequency, but still, we are collecting it. This is present in the dataset that we are going to use for training our machine learning model. But using all this data is computationally very expansive. So we need to aggregate the data in some fashion.

For example, suppose one company collects the precipitation of waste materials from a water source every hour and collects it for years. We can use the monthly or daily precipitation data for a more straightforward visualization instead of every hour. It will reduce the quantity of data and give a quick overview of the entire dataset.

Feature Sampling

Sampling is one of the crucial concepts in the data processing technique, which means creating subsets of datasets to perform specific actions on them or visualize them better. In real-life scenarios, because of sensor failure or human mistakes, it may be possible that different features have different sampling rates. Also, if we consider using all the datasets present, it may be too computationally expansive.

Let’s consider one scenario where there are two machine learning models:

  1. A light model that requires a considerable amount of data to achieve a certain amount of accuracy.
  2. One complex model can not be fed with massive data because of the limitation of the computational power. Still, if fewer data with meaningful information is given to this model, it can quickly achieve the accuracy of the first model. 

Which model would you prefer?

Feature sampling

(Pic credit: google.site)

The second one, right? So sampling of features gives us the ability to select the second model. But the major challenge in the feature sampling step is that we must not lose valuable information. Because of the varying sampling rates, it causes imbalance among features, which means one feature has more samples than others. This imbalance can cause bias in the training data, and eventually model will be biased towards that feature. Sampling techniques help in reducing this imbalance present in the data. 

Some famous sampling techniques are:-

  1. Upsampling/Downsampling
  2. Stratified Sampling
  3. Simple Random sampling: 
  4. Gaussian sampling

Dimensionality Reduction

After gaining experience or getting advice from a domain expert, we can extract useful features from the sea of available features. But, as we mentioned, firms try to capture as many possible attributes as possible so that they do not lose the important ones; Hence, there are too many features in the real-world dataset, and using all of them is infeasible. Let’s take an example as some companies collect data on the vehicle's speed and acceleration. From the domain knowledge, we know that both features are important, but the question is,

Do we really need speed and acceleration, both to be present in the final set?

Increasing the number of features brings the complexity factor with it. Every feature can be considered a separate dimension, and visualization will become more difficult. If we observe, speed and acceleration are highly dependant or correlated, and we can choose just one and drop the other.

To reduce the dimension, some famous techniques are:-

  1. PCA
  2. High Cross-correlation
  3. t-SNE

Dimensionality reduction

Reducing dimensionality brings :

  1. Explainability in machine learning models
  2. Easiness for data pre-processing algorithms
  3. Better visualization

Splitting the dataset

After the steps mentioned above are done, we can say that we can use it directly for the machine learning model. But it is advised to split the data into three sets before feeding it to the model,

  1. Training Set
  2. Validation Set
  3. Testing Set

Training Set: This set of data is used to train the model. The model learns the relationship between the input and output data or even among the input data present in the training dataset.

Validation Set: This set is used to validate model learning from the training dataset and tune the hyperparameters involved in the learning.

Test Set: This set is used to evaluate the model performance when used in real-world data. This set is entirely unseen to the model before actually being tested.

After knowing these three sets, one question must have been coming to our mind,

What is the ratio of the split in which we should divide the whole dataset into train/validate/test sets?

The answer to this question is not fixed. One can choose the ratio as per their need. But, it is always advisable to use more training data so that our machine learning model will get exposure to all the corner cases. Based on different famous courses on Machine learning and Deep Learning, it is proposed that the optimal split ratio can be 3:1:1, I.e., 60% of the total dataset can be used in the training set, 20% as the validation set, and the rest 20% as test data. This is not hard and fast and can vary as per the need.

Splitting of data into three sets

Possible Interview Questions

  1. What is data pre-processing, and why do we need it?
  2. How do we check the quality of the data?
  3. How does feature sampling help?
  4. Is it necessary to remove outliers?
  5. Suppose you don't have the domain knowledge. How will you approach the feature selection step?
  6. Does dimensionality reduction help in data visualization, or it transforms the data as well?

Conclusion

In this article, we covered the breadth of data pre-processing techniques in a detailed manner. We covered feature types, feature selection, feature quality assessment, feature aggregation, feature sampling, and dimensionality reduction. In the last, we also covered data splitting into training, testing, and validation set. We have tried to cover all the broad aspects of data science in this article and hope you have enjoyed it. There are some open-ended questions that you can think about and post your answers in the comment section.

Enjoy Learning! Enjoy pre-processing! Enjoy Algorithms!

Our Weekly Newsletter

Subscribe to get well-designed content on data structures and algorithms, machine learning, system design, oops, and mathematics. enjoy learning!

We Welcome Doubts and Feedback!

More Content From EnjoyAlgorithms