Deep Learning — Interactive Guide

01

Neural Networks & Backpropagation

Neural networks learn by forward-passing to compute loss then backpropagating gradients via chain rule to update weights — this loop repeated thousands of times is training.

z = Wx + b → a = activation(z) ReLU: max(0,x) Vanishing gradient → ReLU + residual He init for ReLU Xavier init for sigmoid/tanh

Interactive Widget — Network Architecture & Gradient Flow

▶ Narration

Neural Networks — press play for audio explanation

Hidden layers 2

Neurons per layer 8

Activation function

Network architecture — node brightness = activation strength · edge color = gradient magnitude

Activation function curve

Gradient magnitude by layer (vanishing gradient demo)

Layers

—

Parameters

—

Min gradient

—

Vanishing?

—

Adjust layers and activation to see gradient flow change.

import torch
import torch.nn as nn
import torch.optim as optim

class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(10, 64), nn.ReLU(),
            nn.Dropout(0.3),
            nn.BatchNorm1d(64),
            nn.Linear(64, 1), nn.Sigmoid()
        )
    def forward(self, x): return self.net(x)

model = MLP()
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
X = torch.randn(32, 10)
y = torch.randint(0, 2, (32, 1)).float()

loss = nn.BCELoss()(model(X), y)
loss.backward()       # compute gradients (backprop)
optimizer.step()      # update weights
print(f"Loss: {loss.item():.4f}")

02

CNNs — Convolutional Networks

CNNs exploit spatial locality via shared filter weights — dramatically fewer parameters than fully connected layers while learning hierarchical image features.

Shared weights → fewer params Output: ⌊(W−F+2P)/S⌋+1 MaxPool: reduce spatial dims Transfer: freeze early layers Hierarchical feature learning

Interactive Widget — CNN Layer Calculator & Feature Map Visualizer

▶ Narration

CNNs — press play for audio explanation

Input size (W×W) 32

Filter size (F×F) 3

Stride 1

Padding 1

Conv layers 2

Channels out 32

CNN layer flow — spatial dimensions shrink, channels grow

Simulated feature maps (8 filters, layer 1)

Parameter count: FC vs CNN

Output size

—

CNN params

—

FC params

—

Reduction

—

Adjust filter size and stride to see how spatial dimensions shrink.

import torch
import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1),  # 3->32 channels
            nn.BatchNorm2d(32), nn.ReLU(),
            nn.MaxPool2d(2),                  # halve spatial dims
            nn.Conv2d(32, 64, 3, padding=1),
            nn.BatchNorm2d(64), nn.ReLU(),
            nn.MaxPool2d(2),
        )
        self.clf = nn.Linear(64*8*8, num_classes)

    def forward(self, x):
        x = self.features(x)   # (B, 64, 8, 8)
        return self.clf(x.flatten(1))

model = SimpleCNN()
x = torch.randn(4, 3, 32, 32)  # batch of 4 RGB images
print("Output shape:", model(x).shape)  # (4, 10)

03

Regularization & Optimizers

Regularization prevents overfitting by constraining model complexity; adaptive optimizers like AdamW converge faster by adjusting learning rates per-parameter.

Dropout: zero p% of activations L2: penalizes large weights BatchNorm: normalize layer inputs AdamW: momentum + RMSProp + wd Early stopping: monitor val loss

Interactive Widget — Optimizer Comparison & Regularization Explorer

▶ Narration

Regularization & Optimizers — press play for audio explanation

Learning rate 0.010

Dropout rate 0.30

Weight decay (L2) 0.0001

Training vs Validation loss curves

LR schedule (cosine annealing)

Dropout effect on weight distribution

Best val loss

—

Convergence ep.

—

Overfit gap

—

Early stop ep.

—

Switch optimizers to compare convergence speed and stability.

import torch, torch.nn as nn, torch.optim as optim

model = nn.Sequential(
    nn.Linear(20,128), nn.ReLU(), nn.Dropout(0.4),
    nn.Linear(128,64), nn.ReLU(), nn.Dropout(0.3),
    nn.Linear(64,1)
)
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)

best_val, patience, counter = float('inf'), 5, 0
for epoch in range(30):
    val_loss = 1.0 - epoch*0.02 + 0.05*(epoch//10)
    scheduler.step()
    if val_loss < best_val:
        best_val = val_loss; counter = 0
    else:
        counter += 1
        if counter >= patience:
            print(f"Early stop at epoch {epoch}"); break
print(f"Best val loss: {best_val:.3f}")

Networks ThatLearn

Networks That
Learn