Model Evaluation Metrics — Interactive Guide

01

Classification Metrics

Accuracy is misleading for imbalanced datasets. Precision/recall tradeoff must be understood in terms of the business cost of false positives vs false negatives.

Precision = TP/(TP+FP) Recall = TP/(TP+FN) F1 = 2·P·R / (P+R) Accuracy misleads on imbalance Threshold controls trade-off

Interactive Widget — Confusion Matrix & Threshold Explorer

▶ Narration

Classification Metrics — press play for audio explanation

Class imbalance (% positive) 10%

Model skill (separability) 0.80

Decision threshold 0.50

Predicted +

Predicted −

Actual +

0

Actual −

0

■ TP ■ FN ■ FP ■ TN

Accuracy

—

Precision

—

Recall

—

F1 Score

—

Precision · Recall · F1 vs Threshold

Precision Recall F1

Adjust the threshold to explore the precision-recall trade-off.

from sklearn.metrics import classification_report, average_precision_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, weights=[0.9,0.1],
                            random_state=42)
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2)
clf = RandomForestClassifier(class_weight='balanced', random_state=42)
clf.fit(Xtr, ytr)
y_pred = clf.predict(Xte)
yprob  = clf.predict_proba(Xte)[:,1]
print(classification_report(yte, y_pred))
print(f"AUC-PR: {average_precision_score(yte, yprob):.3f}")

02

AUC-ROC vs AUC-PR

AUC-ROC measures overall ranking quality; AUC-PR is more informative when positives are rare because it never counts true negatives in its calculation.

ROC: TPR vs FPR at all thresholds AUC=0.5 random · 1.0 perfect AUC-PR: Precision vs Recall Use AUC-PR when positives < 5% Both are threshold-independent

Interactive Widget — ROC & PR Curve Explorer

▶ Narration

ROC & PR Curves — press play for audio explanation

Positive class prevalence 5%

Model discrimination (AUC) 0.85

Highlighted threshold 0.50

ROC Curve (TPR vs FPR)

Precision-Recall Curve

AUC-ROC

—

AUC-PR

—

Prevalence

—

Random PR baseline

—

Compare how ROC and PR curves change with class imbalance.

from sklearn.metrics import roc_auc_score, average_precision_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=10000, weights=[0.97,0.03],
                            n_features=15, random_state=42)
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2)
model = LogisticRegression(class_weight='balanced', max_iter=500)
model.fit(Xtr, ytr)
yprob = model.predict_proba(Xte)[:,1]

print(f"AUC-ROC: {roc_auc_score(yte, yprob):.4f}")
print(f"AUC-PR:  {average_precision_score(yte, yprob):.4f}")
# AUC-PR much lower -> reveals true difficulty

03

Regression Metrics

MAE is robust to outliers; RMSE penalizes large errors heavily; R² shows how much variance your model explains vs a naive mean baseline.

MAE = mean|y − ŷ| RMSE = √MSE — outlier-sensitive R² = 1 − SS_res/SS_tot MAPE = mean|y−ŷ|/|y| Negative R² = worse than mean

Interactive Widget — Regression Error Explorer

▶ Narration

Regression Metrics — press play for audio explanation

Model noise / error 25

Outlier intensity 0

Number of outliers 0

Predicted vs Actual (residuals shown)

Residual Distribution

MAE

—

RMSE

—

R²

—

MAPE

—

MAE vs RMSE sensitivity to outliers — as outlier intensity grows

MAE RMSE ← RMSE grows faster with outliers

Add outliers and watch RMSE diverge from MAE — RMSE penalizes large errors disproportionately.

from sklearn.metrics import (mean_absolute_error,
    mean_squared_error, r2_score)
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
import numpy as np

X, y = make_regression(n_samples=1000, n_features=10,
                        noise=25, random_state=42)
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2)
model = GradientBoostingRegressor(n_estimators=100, random_state=42)
model.fit(Xtr, ytr)
yp = model.predict(Xte)
print(f"MAE:  {mean_absolute_error(yte,yp):.2f}")
print(f"RMSE: {np.sqrt(mean_squared_error(yte,yp)):.2f}")
print(f"R^2:  {r2_score(yte,yp):.4f}")

MeasuringWhat Matters

Measuring
What Matters