05 — Deep Learning

Networks That
Learn

Three interactive playgrounds for neural networks & backpropagation, CNNs, and regularization with optimizers. Watch gradients flow, filters activate, and loss curves converge. Press ▶ for audio explanations.

01
Neural Networks & Backpropagation
Neural networks learn by forward-passing to compute loss then backpropagating gradients via chain rule to update weights — this loop repeated thousands of times is training.
z = Wx + b → a = activation(z) ReLU: max(0,x) Vanishing gradient → ReLU + residual He init for ReLU Xavier init for sigmoid/tanh
Interactive Widget — Network Architecture & Gradient Flow
▶ Narration
Neural Networks — press play for audio explanation
Hidden layers 2
Neurons per layer 8
Activation function
Network architecture — node brightness = activation strength · edge color = gradient magnitude
Activation function curve
Gradient magnitude by layer (vanishing gradient demo)
Layers
Parameters
Min gradient
Vanishing?
Adjust layers and activation to see gradient flow change.
import torch
import torch.nn as nn
import torch.optim as optim

class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(10, 64), nn.ReLU(),
            nn.Dropout(0.3),
            nn.BatchNorm1d(64),
            nn.Linear(64, 1), nn.Sigmoid()
        )
    def forward(self, x): return self.net(x)

model = MLP()
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
X = torch.randn(32, 10)
y = torch.randint(0, 2, (32, 1)).float()

loss = nn.BCELoss()(model(X), y)
loss.backward()       # compute gradients (backprop)
optimizer.step()      # update weights
print(f"Loss: {loss.item():.4f}")
02
CNNs — Convolutional Networks
CNNs exploit spatial locality via shared filter weights — dramatically fewer parameters than fully connected layers while learning hierarchical image features.
Shared weights → fewer params Output: ⌊(W−F+2P)/S⌋+1 MaxPool: reduce spatial dims Transfer: freeze early layers Hierarchical feature learning
Interactive Widget — CNN Layer Calculator & Feature Map Visualizer
▶ Narration
CNNs — press play for audio explanation
Input size (W×W) 32
Filter size (F×F) 3
Stride 1
Padding 1
Conv layers 2
Channels out 32
CNN layer flow — spatial dimensions shrink, channels grow
Simulated feature maps (8 filters, layer 1)
Parameter count: FC vs CNN
Output size
CNN params
FC params
Reduction
Adjust filter size and stride to see how spatial dimensions shrink.
import torch
import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1),  # 3->32 channels
            nn.BatchNorm2d(32), nn.ReLU(),
            nn.MaxPool2d(2),                  # halve spatial dims
            nn.Conv2d(32, 64, 3, padding=1),
            nn.BatchNorm2d(64), nn.ReLU(),
            nn.MaxPool2d(2),
        )
        self.clf = nn.Linear(64*8*8, num_classes)

    def forward(self, x):
        x = self.features(x)   # (B, 64, 8, 8)
        return self.clf(x.flatten(1))

model = SimpleCNN()
x = torch.randn(4, 3, 32, 32)  # batch of 4 RGB images
print("Output shape:", model(x).shape)  # (4, 10)
03
Regularization & Optimizers
Regularization prevents overfitting by constraining model complexity; adaptive optimizers like AdamW converge faster by adjusting learning rates per-parameter.
Dropout: zero p% of activations L2: penalizes large weights BatchNorm: normalize layer inputs AdamW: momentum + RMSProp + wd Early stopping: monitor val loss
Interactive Widget — Optimizer Comparison & Regularization Explorer
▶ Narration
Regularization & Optimizers — press play for audio explanation
Learning rate 0.010
Dropout rate 0.30
Weight decay (L2) 0.0001
Training vs Validation loss curves
LR schedule (cosine annealing)
Dropout effect on weight distribution
Best val loss
Convergence ep.
Overfit gap
Early stop ep.
Switch optimizers to compare convergence speed and stability.
import torch, torch.nn as nn, torch.optim as optim

model = nn.Sequential(
    nn.Linear(20,128), nn.ReLU(), nn.Dropout(0.4),
    nn.Linear(128,64), nn.ReLU(), nn.Dropout(0.3),
    nn.Linear(64,1)
)
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)

best_val, patience, counter = float('inf'), 5, 0
for epoch in range(30):
    val_loss = 1.0 - epoch*0.02 + 0.05*(epoch//10)
    scheduler.step()
    if val_loss < best_val:
        best_val = val_loss; counter = 0
    else:
        counter += 1
        if counter >= patience:
            print(f"Early stop at epoch {epoch}"); break
print(f"Best val loss: {best_val:.3f}")
Buy me a coffee QR code

Found this useful? If you'd like to spare me a coffee, scan the QR code or click here