01
Neural Networks & Backpropagation
Neural networks learn by forward-passing to compute loss then backpropagating gradients via chain rule to update weights — this loop repeated thousands of times is training.
Interactive Widget — Network Architecture & Gradient Flow
Hidden layers 2
Neurons per layer 8
Activation function
Network architecture — node brightness = activation strength · edge color = gradient magnitude
Activation function curve
Gradient magnitude by layer (vanishing gradient demo)
Layers
—
Parameters
—
Min gradient
—
Vanishing?
—
Adjust layers and activation to see gradient flow change.
import torch import torch.nn as nn import torch.optim as optim class MLP(nn.Module): def __init__(self): super().__init__() self.net = nn.Sequential( nn.Linear(10, 64), nn.ReLU(), nn.Dropout(0.3), nn.BatchNorm1d(64), nn.Linear(64, 1), nn.Sigmoid() ) def forward(self, x): return self.net(x) model = MLP() optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4) X = torch.randn(32, 10) y = torch.randint(0, 2, (32, 1)).float() loss = nn.BCELoss()(model(X), y) loss.backward() # compute gradients (backprop) optimizer.step() # update weights print(f"Loss: {loss.item():.4f}")
02
CNNs — Convolutional Networks
CNNs exploit spatial locality via shared filter weights — dramatically fewer parameters than fully connected layers while learning hierarchical image features.
Interactive Widget — CNN Layer Calculator & Feature Map Visualizer
Input size (W×W) 32
Filter size (F×F) 3
Stride 1
Padding 1
Conv layers 2
Channels out 32
CNN layer flow — spatial dimensions shrink, channels grow
Simulated feature maps (8 filters, layer 1)
Parameter count: FC vs CNN
Output size
—
CNN params
—
FC params
—
Reduction
—
Adjust filter size and stride to see how spatial dimensions shrink.
import torch import torch.nn as nn class SimpleCNN(nn.Module): def __init__(self, num_classes=10): super().__init__() self.features = nn.Sequential( nn.Conv2d(3, 32, 3, padding=1), # 3->32 channels nn.BatchNorm2d(32), nn.ReLU(), nn.MaxPool2d(2), # halve spatial dims nn.Conv2d(32, 64, 3, padding=1), nn.BatchNorm2d(64), nn.ReLU(), nn.MaxPool2d(2), ) self.clf = nn.Linear(64*8*8, num_classes) def forward(self, x): x = self.features(x) # (B, 64, 8, 8) return self.clf(x.flatten(1)) model = SimpleCNN() x = torch.randn(4, 3, 32, 32) # batch of 4 RGB images print("Output shape:", model(x).shape) # (4, 10)
03
Regularization & Optimizers
Regularization prevents overfitting by constraining model complexity; adaptive optimizers like AdamW converge faster by adjusting learning rates per-parameter.
Interactive Widget — Optimizer Comparison & Regularization Explorer
Learning rate 0.010
Dropout rate 0.30
Weight decay (L2) 0.0001
Training vs Validation loss curves
LR schedule (cosine annealing)
Dropout effect on weight distribution
Best val loss
—
Convergence ep.
—
Overfit gap
—
Early stop ep.
—
Switch optimizers to compare convergence speed and stability.
import torch, torch.nn as nn, torch.optim as optim model = nn.Sequential( nn.Linear(20,128), nn.ReLU(), nn.Dropout(0.4), nn.Linear(128,64), nn.ReLU(), nn.Dropout(0.3), nn.Linear(64,1) ) optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4) scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50) best_val, patience, counter = float('inf'), 5, 0 for epoch in range(30): val_loss = 1.0 - epoch*0.02 + 0.05*(epoch//10) scheduler.step() if val_loss < best_val: best_val = val_loss; counter = 0 else: counter += 1 if counter >= patience: print(f"Early stop at epoch {epoch}"); break print(f"Best val loss: {best_val:.3f}")
