04 — Data Cleaning

Clean Data,
Better Models

Three interactive playgrounds for missing values, feature scaling & encoding, and outlier detection with SMOTE. Drag sliders, click strategies, and press ▶ for full audio explanations.

01
Missing Value Strategies
The type of missingness (MCAR/MAR/MNAR) determines the right imputation strategy — using the wrong one introduces systematic bias into your model.
MCAR → mean/median/mode MAR → KNN / MICE imputer MNAR → binary flag feature Fit on train only! Missingness ≠ random
Interactive Widget — Missingness Explorer & Imputation Simulator
▶ Narration
Missing Values — press play for audio explanation
Missing rate 20%
Missingness type
Imputation strategy
Data heatmap — cyan = observed · pink = missing · green = imputed
Original vs Imputed Distribution
Imputation Error by Method & Type
Missing %
Imputed rows
MAE (imputed)
Bias
Select a missingness type and imputation strategy to see the impact.
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer

np.random.seed(42)
df = pd.DataFrame({
    'age':    np.random.randint(20, 60, 100).astype(float),
    'income': np.random.normal(50000, 15000, 100),
})
df.loc[np.random.choice(100, 20), 'age']    = np.nan
df.loc[np.random.choice(100, 15), 'income'] = np.nan

print("Missing:", df.isnull().sum().to_dict())
df[['age']]    = SimpleImputer(strategy='median').fit_transform(df[['age']])
df[['income']] = KNNImputer(n_neighbors=5).fit_transform(df[['income']])
print("After  :", df.isnull().sum().to_dict())
02
Feature Scaling & Encoding
Distance-based models require scaling because features on different scales dominate distances. Categorical variables need encoding since models only accept numeric input.
StandardScaler: μ=0, σ=1 MinMaxScaler: [0, 1] RobustScaler: IQR-based OHE for nominal categories Ordinal for ordered
Interactive Widget — Scaler Comparison & Encoding Demo
▶ Narration
Scaling & Encoding — press play for audio explanation
Outlier contamination % 0%
Feature skew 0.0
Feature values before & after scaling (3 scalers side by side)
StandardScaler MinMaxScaler RobustScaler Raw (normalized)
Encoding strategies for categorical features
One-Hot: creates binary columns per category, avoids ordinal assumption.
Std Mean
Std Std
MM Range
Rob IQR
Add outliers to see why RobustScaler outperforms StandardScaler in dirty data.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import pandas as pd, numpy as np

df = pd.DataFrame({
    'age':    [25,45,35,50,28],
    'income': [35000,80000,55000,120000,42000],
    'city':   ['NY','LA','NY','SF','LA'],
    'edu':    ['BS','MS','PhD','BS','MS'],
})
ct = ColumnTransformer([
    ('num', StandardScaler(), ['age','income']),
    ('cat', OneHotEncoder(drop='first'), ['city']),
])
X = ct.fit_transform(df)
print(f"Before: {df.shape} -> After: {X.shape}")
edu_map = {'BS':1,'MS':2,'PhD':3}
df['edu_ord'] = df['edu'].map(edu_map)
print(df[['edu','edu_ord']])
03
Outlier Detection & Class Imbalance
Outliers distort model training by pulling decision boundaries. Class imbalance causes models to ignore the minority class — both require explicit treatment.
IQR: Q1−1.5·IQR / Q3+1.5·IQR Z-score: |z|>3 Isolation Forest: unsupervised SMOTE: synthetic minority oversampling Class weight='balanced'
Interactive Widget — Outlier Detection & SMOTE Visualizer
▶ Narration
Outliers & Imbalance — press play for audio explanation
Outlier count 3
Outlier magnitude 4.0
Detection method
Data Distribution + Detected Outliers
Boxplot & Detection Boundaries
Total Points
Detected
True Outliers
Precision
Adjust outlier count and magnitude to compare detection methods.
import numpy as np
from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification
from collections import Counter

# IQR outlier detection
data = np.array([10,12,11,13,200,9,11,12,14,-100])
Q1, Q3 = np.percentile(data, [25, 75])
IQR = Q3 - Q1
outliers = data[(data < Q1-1.5*IQR) | (data > Q3+1.5*IQR)]
print("Outliers:", outliers)

# SMOTE for class imbalance
X, y = make_classification(n_samples=1000, weights=[0.95,0.05],
                            random_state=42)
print("Before SMOTE:", Counter(y))
X_res, y_res = SMOTE(random_state=42).fit_resample(X, y)
print("After  SMOTE:", Counter(y_res))
Buy me a coffee QR code

Found this useful? If you'd like to spare me a coffee, scan the QR code or click here