Business Cases & ML System Design

A/B Testing — Sample Size & Analysis

Rigorous A/B testing requires pre-computing sample size to avoid peeking bias — the most common mistake causing false positive experiment results in industry.

Sample size from baseline, MDE, α, power Run full weeks — avoid early stopping One primary metric + guardrails Network effects → cluster randomization Multiple testing → Bonferroni / FDR

Interactive Widget — Sample Size Calculator & Experiment Simulator

▶ Narration

A/B Testing — press play for audio explanation

Baseline CTR 12%

Min detectable effect (MDE) 2%

Statistical power 80%

Significance α 0.05

Daily traffic per variant 5,000

True lift (simulate) 2%

Simulated experiment — p-value trajectory over time (peeking line = danger zone)

p-value over time α = significance threshold Required n boundary

Power curve — sample size vs detectable effect

Null vs alternative sampling distributions

Required n/variant

—

Days to run

—

Simulated p

—

Result

—

Peeking risk

—

Adjust baseline rate and MDE to compute required sample size.

import numpy as np
from scipy import stats

def sample_size(baseline, mde, alpha=0.05, power=0.80):
    p1, p2 = baseline, baseline + mde
    pooled = (p1+p2)/2
    za = stats.norm.ppf(1-alpha/2)  # two-tailed
    zb = stats.norm.ppf(power)
    n  = (za*np.sqrt(2*pooled*(1-pooled)) +
          zb*np.sqrt(p1*(1-p1)+p2*(1-p2)))**2 / (p2-p1)**2
    return int(np.ceil(n))

n = sample_size(baseline=0.12, mde=0.02)
print(f"Required n per variant: {n:,}")

np.random.seed(42)
ctrl = np.random.binomial(1, 0.12, n)
trt  = np.random.binomial(1, 0.14, n)  # 2% lift
t, p = stats.ttest_ind(ctrl, trt)
print(f"CTR ctrl={ctrl.mean():.3f}, trt={trt.mean():.3f}")
print(f"p-value: {p:.4f} -> {'Significant' if p<0.05 else 'Not significant'}")

Fraud Detection — ML System Design

The 8-step ML design framework applied to fraud detection: formulate → features → model → eval → serve → monitor. Each step has distinct tradeoffs for a production system.

Binary classification, 0.1% fraud rate AUC-PR not AUC-ROC for imbalance <100ms online inference SLA Feature store for real-time features Monitor for concept drift

Interactive Widget — Fraud ML System Design & Live Scoring Simulator

▶ Narration

Fraud ML Design — press play for audio explanation

8-step ML design framework — click a step to explore

Decision threshold — business cost tradeoff

Fraud score threshold 0.50

FP cost / FN cost ratio 1:10

Feature importance (GBM model)

Live transaction scoring — adjust features to see risk score change

Transaction amount ($) $250

Velocity (txns last 1h) 1

Merchant risk score 0.20

Is international? No

Hour of day 14:00

System monitoring signals

Fraud score

—

Decision

—

Latency SLA

—

Daily FP cost

—

Adjust the threshold slider — watch precision and recall trade off.

import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import average_precision_score
from sklearn.model_selection import train_test_split

np.random.seed(42)
n = 10000
df = pd.DataFrame({
    'amount':           np.random.exponential(100, n),
    'velocity_1h':      np.random.poisson(2, n),
    'is_international': np.random.binomial(1, 0.1, n),
    'merchant_risk':    np.random.uniform(0, 1, n),
    'hour_of_day':      np.random.randint(0, 24, n),
})
fraud_p = 1/(1+np.exp(-(-5 + 0.02*df['amount'] + 0.5*df['velocity_1h']
    + 2*df['is_international'] + df['merchant_risk'])))
df['fraud'] = np.random.binomial(1, fraud_p)
print(f"Fraud rate: {df['fraud'].mean():.2%}")

X, y = df.drop('fraud',axis=1), df['fraud']
Xtr,Xte,ytr,yte = train_test_split(X, y, stratify=y, test_size=0.2)
clf = GradientBoostingClassifier(n_estimators=100, random_state=42)
clf.fit(Xtr, ytr)
yp = clf.predict_proba(Xte)[:,1]
print(f"AUC-PR: {average_precision_score(yte, yp):.4f}")
feat_imp = pd.Series(clf.feature_importances_, index=X.columns)
print("Top features:\n", feat_imp.sort_values(ascending=False).head(3))

Build SystemsThat Work

Build Systems
That Work