ML Foundations: Preprocessing Data for Machine Learning
ML Foundations: Preprocessing Data for Machine Learning
Introduction
You’ve learned about powerful machine learning models like Linear and Logistic Regression. But in the real world, data is rarely clean and ready to be fed into a model. It’s often messy, incomplete, and in a format that algorithms can’t understand.
This is where Data Preprocessing comes in. It’s the crucial first step of cleaning and preparing your data, and it is arguably the most important phase of any machine learning project. The principle is simple: “Garbage in, garbage out.” No matter how sophisticated your model is, it will produce poor results if the data it’s trained on is flawed.
This guide will walk you through the three most essential preprocessing tasks you’ll encounter in almost every project:
- Handling Missing Values: What to do when data is incomplete.
- Encoding Categorical Data: How to convert text labels into numbers models can understand.
- Feature Scaling: Why and how to bring all your features to a similar scale.
We’ll use Python’s pandas and scikit-learn libraries for practical, real-world examples.
1. Handling Missing Values
Most machine learning algorithms cannot work with missing data, often represented as NaN (Not a Number). Let’s first create a sample dataset with some missing values.
1
2
3
4
5
6
7
8
9
10
import pandas as pd
import numpy as np
data = {
'Age': [25, 32, np.nan, 45, 38],
'Salary': [50000, 80000, 65000, np.nan, 72000],
'Country': ['USA', 'Canada', 'USA', np.nan, 'Mexico']
}
df = pd.DataFrame(data)
print(df)
Strategy 1: Remove the Data
The simplest approach is to remove the rows or columns containing NaN values.
- Remove Rows:
df.dropna(axis=0)- Pros: Quick and easy.
- Cons: You lose data, which can be a problem if your dataset is small.
- Remove Columns:
df.dropna(axis=1)- Pros: Useful if a column is mostly empty or irrelevant.
- Cons: You lose a potentially useful feature.
This is generally not recommended unless the missing data is a very small fraction of your dataset.
Strategy 2: Impute the Data (Fill It In)
A better approach is to fill in the missing values. This is called imputation. The SimpleImputer from Scikit-Learn is perfect for this.
For Numerical Data (Age, Salary)
You can replace NaN values with the mean, median, or a constant value. The median is often preferred because it is robust to outliers.
1
2
3
4
5
6
7
8
9
from sklearn.impute import SimpleImputer
# Impute 'Age' with the median
num_imputer = SimpleImputer(strategy='median')
df['Age'] = num_imputer.fit_transform(df[['Age']])
# Impute 'Salary' with the mean
num_imputer_mean = SimpleImputer(strategy='mean')
df['Salary'] = num_imputer_mean.fit_transform(df[['Salary']])
For Categorical Data (Country)
You can replace NaN values with the most frequent category (the mode) or a constant string like “Missing”.
1
2
3
cat_imputer = SimpleImputer(strategy='most_frequent')
df['Country'] = cat_imputer.fit_transform(df[['Country']])
print(df)
2. Encoding Categorical Data
Machine learning models are mathematical, so they need numbers, not text. We must convert categorical features like “Country” into a numerical format.
Strategy 1: Ordinal Encoding
This method is used when the categories have a natural, ordered relationship. For example, ['Small', 'Medium', 'Large']. We can map these to [0, 1, 2].
Strategy 2: One-Hot Encoding
This is the most common method, used for nominal data where there is no intrinsic order, like our “Country” feature. Simply mapping USA=0, Canada=1, Mexico=2 would incorrectly imply that Mexico > Canada.
One-Hot Encoding solves this by creating a new binary column for each category. For a given row, the column corresponding to its category will be 1, and all other new columns will be 0.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Let's use a fresh dataframe for this example
data_cat = {'Country': ['USA', 'Canada', 'USA', 'Mexico']}
df_cat = pd.DataFrame(data_cat)
# Using pandas get_dummies is a very easy way to do this
df_one_hot_pd = pd.get_dummies(df_cat, prefix='Country')
print(df_one_hot_pd)
# Using Scikit-Learn's OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
one_hot_encoder = OneHotEncoder(sparse_output=False)
country_encoded = one_hot_encoder.fit_transform(df_cat[['Country']])
print(country_encoded)
Output (from pandas):
1
2
3
4
5
Country_Canada Country_Mexico Country_USA
0 False False True
1 True False False
2 False False True
3 False True False
3. Feature Scaling
Consider a dataset with Age (e.g., 20-70) and Salary (e.g., 50,000-200,000). Many ML algorithms, especially those based on distance (like SVMs) or gradient descent, will be dominated by the Salary feature because its scale is so much larger. Feature scaling solves this by bringing all features onto a comparable scale.
Strategy 1: Normalization (Min-Max Scaling)
This technique scales all data to a fixed range, usually [0, 1].
Formula: X_scaled = (X - X_min) / (X_max - X_min)
- When to use it: Good for algorithms that don’t assume any specific data distribution, like k-Nearest Neighbors (k-NN).
- Downside: It’s sensitive to outliers. A single very large or small value can skew the scaling of all other data points.
1
2
3
4
5
6
7
8
from sklearn.preprocessing import MinMaxScaler
data_scale = {'Age': [25, 32, 45, 38], 'Salary': [50000, 80000, 120000, 72000]}
df_scale = pd.DataFrame(data_scale)
scaler = MinMaxScaler()
df_normalized = pd.DataFrame(scaler.fit_transform(df_scale), columns=df_scale.columns)
print("Normalized Data:\n", df_normalized)
Strategy 2: Standardization (Z-score Scaling)
This is the most common scaling technique. It transforms the data to have a mean of 0 and a standard deviation of 1.
Formula: X_scaled = (X - mean(X)) / std_dev(X)
- When to use it: The default choice for most algorithms, including Linear/Logistic Regression and Support Vector Machines.
- Benefit: It is much less affected by outliers than normalization.
1
2
3
4
5
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_standardized = pd.DataFrame(scaler.fit_transform(df_scale), columns=df_scale.columns)
print("Standardized Data:\n", df_standardized)
4. Putting It All Together: The Pipeline
In a real project, you must be careful not to “leak” data from your test set into your training process. For example, you should calculate the mean for imputation or the min/max for scaling using only the training data. Then, you use those same learned parameters to transform the test data.
Scikit-Learn’s Pipeline and ColumnTransformer make this process safe, easy, and repeatable.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# Let's use a more complete dataset
X = pd.DataFrame({
'num_feature_1': [1, 2, np.nan, 4, 5],
'num_feature_2': [10, np.nan, 30, 40, 50],
'cat_feature': ['A', 'B', 'A', 'C', np.nan]
})
y = np.array([0, 1, 0, 1, 0])
# 1. Create preprocessing pipelines for numerical and categorical features
numeric_pipeline = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_pipeline = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# 2. Use ColumnTransformer to apply different transformations to different columns
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_pipeline, ['num_feature_1', 'num_feature_2']),
('cat', categorical_pipeline, ['cat_feature'])
])
# 3. Create the full pipeline, including the model
full_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression())
])
# 4. Now, you can train your model on the raw data!
full_pipeline.fit(X, y)
# And make predictions
new_data = pd.DataFrame({
'num_feature_1': [3], 'num_feature_2': [25], 'cat_feature': ['B']
})
print("Prediction:", full_pipeline.predict(new_data))
Conclusion
Data preprocessing is a universe in itself, but these three techniques—handling missing values, encoding categorical data, and feature scaling—form the essential foundation. By mastering them, you ensure that your model receives clean, well-formatted data, allowing it to learn effectively and produce reliable results.
Using Scikit-Learn’s Pipeline and ColumnTransformer is a professional best practice that not only saves you from common errors like data leakage but also makes your code cleaner and more reproducible.
Suggested Reading
- Scikit-Learn Documentation: The user guide on “Preprocessing data” is an invaluable and comprehensive resource.
- “Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow” by Aurélien Géron: Chapter 2 is dedicated entirely to a real-world data preprocessing workflow.
- “Python for Data Analysis” by Wes McKinney: A must-read for mastering the
pandasskills that underpin all preprocessing tasks.