03 — Model Evaluation

Measuring
What Matters

Interactive playgrounds for classification metrics, ROC vs PR curves, and regression scoring. Drag thresholds. Watch precision and recall trade off in real time. Press ▶ to hear each concept explained.

01
Classification Metrics
Accuracy is misleading for imbalanced datasets. Precision/recall tradeoff must be understood in terms of the business cost of false positives vs false negatives.
Precision = TP/(TP+FP) Recall = TP/(TP+FN) F1 = 2·P·R / (P+R) Accuracy misleads on imbalance Threshold controls trade-off
Interactive Widget — Confusion Matrix & Threshold Explorer
▶ Narration
Classification Metrics — press play for audio explanation
Class imbalance (% positive) 10%
Model skill (separability) 0.80
Decision threshold 0.50
Predicted +
Predicted −
Actual +
0
0
Actual −
0
0
■ TP ■ FN ■ FP ■ TN
Accuracy
Precision
Recall
F1 Score
Precision · Recall · F1 vs Threshold
Precision Recall F1
Adjust the threshold to explore the precision-recall trade-off.
from sklearn.metrics import classification_report, average_precision_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, weights=[0.9,0.1],
                            random_state=42)
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2)
clf = RandomForestClassifier(class_weight='balanced', random_state=42)
clf.fit(Xtr, ytr)
y_pred = clf.predict(Xte)
yprob  = clf.predict_proba(Xte)[:,1]
print(classification_report(yte, y_pred))
print(f"AUC-PR: {average_precision_score(yte, yprob):.3f}")
02
AUC-ROC vs AUC-PR
AUC-ROC measures overall ranking quality; AUC-PR is more informative when positives are rare because it never counts true negatives in its calculation.
ROC: TPR vs FPR at all thresholds AUC=0.5 random · 1.0 perfect AUC-PR: Precision vs Recall Use AUC-PR when positives < 5% Both are threshold-independent
Interactive Widget — ROC & PR Curve Explorer
▶ Narration
ROC & PR Curves — press play for audio explanation
Positive class prevalence 5%
Model discrimination (AUC) 0.85
Highlighted threshold 0.50
ROC Curve (TPR vs FPR)
Precision-Recall Curve
AUC-ROC
AUC-PR
Prevalence
Random PR baseline
Compare how ROC and PR curves change with class imbalance.
from sklearn.metrics import roc_auc_score, average_precision_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=10000, weights=[0.97,0.03],
                            n_features=15, random_state=42)
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2)
model = LogisticRegression(class_weight='balanced', max_iter=500)
model.fit(Xtr, ytr)
yprob = model.predict_proba(Xte)[:,1]

print(f"AUC-ROC: {roc_auc_score(yte, yprob):.4f}")
print(f"AUC-PR:  {average_precision_score(yte, yprob):.4f}")
# AUC-PR much lower -> reveals true difficulty
03
Regression Metrics
MAE is robust to outliers; RMSE penalizes large errors heavily; R² shows how much variance your model explains vs a naive mean baseline.
MAE = mean|y − ŷ| RMSE = √MSE — outlier-sensitive R² = 1 − SS_res/SS_tot MAPE = mean|y−ŷ|/|y| Negative R² = worse than mean
Interactive Widget — Regression Error Explorer
▶ Narration
Regression Metrics — press play for audio explanation
Model noise / error 25
Outlier intensity 0
Number of outliers 0
Predicted vs Actual (residuals shown)
Residual Distribution
MAE
RMSE
MAPE
MAE vs RMSE sensitivity to outliers — as outlier intensity grows
MAE RMSE ← RMSE grows faster with outliers
Add outliers and watch RMSE diverge from MAE — RMSE penalizes large errors disproportionately.
from sklearn.metrics import (mean_absolute_error,
    mean_squared_error, r2_score)
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
import numpy as np

X, y = make_regression(n_samples=1000, n_features=10,
                        noise=25, random_state=42)
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2)
model = GradientBoostingRegressor(n_estimators=100, random_state=42)
model.fit(Xtr, ytr)
yp = model.predict(Xte)
print(f"MAE:  {mean_absolute_error(yte,yp):.2f}")
print(f"RMSE: {np.sqrt(mean_squared_error(yte,yp)):.2f}")
print(f"R^2:  {r2_score(yte,yp):.4f}")
Buy me a coffee QR code

Found this useful? If you'd like to spare me a coffee, scan the QR code or click here