01
Missing Value Strategies
The type of missingness (MCAR/MAR/MNAR) determines the right imputation strategy — using the wrong one introduces systematic bias into your model.
Interactive Widget — Missingness Explorer & Imputation Simulator
Missing rate 20%
Missingness type
Imputation strategy
Data heatmap — cyan = observed · pink = missing · green = imputed
Original vs Imputed Distribution
Imputation Error by Method & Type
Missing %
—
Imputed rows
—
MAE (imputed)
—
Bias
—
Select a missingness type and imputation strategy to see the impact.
import pandas as pd import numpy as np from sklearn.impute import SimpleImputer, KNNImputer np.random.seed(42) df = pd.DataFrame({ 'age': np.random.randint(20, 60, 100).astype(float), 'income': np.random.normal(50000, 15000, 100), }) df.loc[np.random.choice(100, 20), 'age'] = np.nan df.loc[np.random.choice(100, 15), 'income'] = np.nan print("Missing:", df.isnull().sum().to_dict()) df[['age']] = SimpleImputer(strategy='median').fit_transform(df[['age']]) df[['income']] = KNNImputer(n_neighbors=5).fit_transform(df[['income']]) print("After :", df.isnull().sum().to_dict())
02
Feature Scaling & Encoding
Distance-based models require scaling because features on different scales dominate distances. Categorical variables need encoding since models only accept numeric input.
Interactive Widget — Scaler Comparison & Encoding Demo
Outlier contamination % 0%
Feature skew 0.0
Feature values before & after scaling (3 scalers side by side)
StandardScaler
MinMaxScaler
RobustScaler
Raw (normalized)
Encoding strategies for categorical features
One-Hot: creates binary columns per category, avoids ordinal assumption.
Std Mean
—
Std Std
—
MM Range
—
Rob IQR
—
Add outliers to see why RobustScaler outperforms StandardScaler in dirty data.
from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder import pandas as pd, numpy as np df = pd.DataFrame({ 'age': [25,45,35,50,28], 'income': [35000,80000,55000,120000,42000], 'city': ['NY','LA','NY','SF','LA'], 'edu': ['BS','MS','PhD','BS','MS'], }) ct = ColumnTransformer([ ('num', StandardScaler(), ['age','income']), ('cat', OneHotEncoder(drop='first'), ['city']), ]) X = ct.fit_transform(df) print(f"Before: {df.shape} -> After: {X.shape}") edu_map = {'BS':1,'MS':2,'PhD':3} df['edu_ord'] = df['edu'].map(edu_map) print(df[['edu','edu_ord']])
03
Outlier Detection & Class Imbalance
Outliers distort model training by pulling decision boundaries. Class imbalance causes models to ignore the minority class — both require explicit treatment.
Interactive Widget — Outlier Detection & SMOTE Visualizer
Outlier count 3
Outlier magnitude 4.0
Detection method
Data Distribution + Detected Outliers
Boxplot & Detection Boundaries
Total Points
—
Detected
—
True Outliers
—
Precision
—
Adjust outlier count and magnitude to compare detection methods.
import numpy as np from imblearn.over_sampling import SMOTE from sklearn.datasets import make_classification from collections import Counter # IQR outlier detection data = np.array([10,12,11,13,200,9,11,12,14,-100]) Q1, Q3 = np.percentile(data, [25, 75]) IQR = Q3 - Q1 outliers = data[(data < Q1-1.5*IQR) | (data > Q3+1.5*IQR)] print("Outliers:", outliers) # SMOTE for class imbalance X, y = make_classification(n_samples=1000, weights=[0.95,0.05], random_state=42) print("Before SMOTE:", Counter(y)) X_res, y_res = SMOTE(random_state=42).fit_resample(X, y) print("After SMOTE:", Counter(y_res))
