01
Transformer & Self-Attention
Transformers process all tokens in parallel via self-attention — each token attends to every other token in a single matrix operation, replacing sequential RNNs.
Interactive Widget — Self-Attention Visualizer
Click a token to see its attention pattern
Attention heads 4
Temperature (1/√d_k) 0.125
Mode
Attention matrix — row = query token, col = key token
Low
High attention
Positional encoding (first 6 dims, each token)
Tokens
—
Heads
—
Attn ops
—
Selected
—
Click a token above to highlight its attention pattern in the matrix.
import torch, torch.nn.functional as F, math def scaled_dot_product_attention(Q, K, V, mask=None): d_k = Q.shape[-1] scores = torch.matmul(Q, K.transpose(-2,-1)) / math.sqrt(d_k) if mask is not None: scores = scores.masked_fill(mask==0, -1e9) weights = F.softmax(scores, dim=-1) return torch.matmul(weights, V), weights # seq_len=5 tokens, d_k=64 dimensions Q = torch.randn(1, 5, 64) # queries K = torch.randn(1, 5, 64) # keys V = torch.randn(1, 5, 64) # values out, w = scaled_dot_product_attention(Q, K, V) print("Output:", out.shape) # (1,5,64) print("Weights:", w.shape) # (1,5,5) - each token attends to all 5
02
Fine-tuning & LoRA
Full fine-tuning updates all model parameters; LoRA inserts tiny trainable low-rank matrices into frozen weights — 10–100× fewer trainable params, similar performance.
Interactive Widget — LoRA Parameter Calculator & Rank Explorer
Model hidden dim (d) 768
LoRA rank (r) 8
LoRA alpha 16
LoRA architecture — W frozen, only A and B are trained
Parameter breakdown — full vs LoRA trainable
LoRA low-rank approximation (singular value spectrum)
Fine-tuning method comparison
Base params
—
LoRA params
—
Trainable %
—
Scale α/r
—
Adjust rank to explore the parameter efficiency tradeoff.
import torch import torch.nn as nn class LoRALinear(nn.Module): def __init__(self, in_f, out_f, rank=4, alpha=16): super().__init__() self.base = nn.Linear(in_f, out_f, bias=False) self.base.weight.requires_grad = False # freeze base self.A = nn.Parameter(torch.randn(rank, in_f) * 0.01) self.B = nn.Parameter(torch.zeros(out_f, rank)) self.scale = alpha / rank def forward(self, x): return self.base(x) + (x @ self.A.T @ self.B.T) * self.scale lora = LoRALinear(768, 768, rank=8) total = sum(p.numel() for p in lora.parameters()) trainable = sum(p.numel() for p in lora.parameters() if p.requires_grad) print(f"Total: {total:,} | Trainable: {trainable:,}") print(f"Trainable ratio: {trainable/total:.2%}") # ~2%
03
RAG — Retrieval Augmented Generation
RAG retrieves relevant documents via vector search and injects them into the LLM prompt — solving hallucination and knowledge cutoff without expensive fine-tuning.
Interactive Widget — RAG Pipeline Simulator
RAG pipeline — click a query below to simulate retrieval
Query
▶
Embed
▶
Search
▶
Retrieve
▶
LLM
Select a query to run through the pipeline
Top-k chunks to retrieve 2
Retrieval method
Chunk size (tokens) 256
Knowledge base — ranked by similarity to query
Vector space (2D projection of embeddings)
Simulated LLM response (grounded in retrieved context)
Select a query above to see a simulated answer…
Docs in KB
—
Retrieved
—
Top score
—
Method
—
RAG grounds LLM answers in retrieved facts, reducing hallucination.
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity import numpy as np # Simplified RAG (use sentence-transformers in production) docs = [ "Python is a high-level language known for readability.", "Machine learning finds patterns in data using algorithms.", "Transformers use attention mechanisms to process sequences.", "RAG combines retrieval and LLM generation for grounded answers.", ] vectorizer = TfidfVectorizer() doc_vecs = vectorizer.fit_transform(docs) def retrieve(query, top_k=2): q_vec = vectorizer.transform([query]) sims = cosine_similarity(q_vec, doc_vecs)[0] idx = np.argsort(sims)[::-1][:top_k] return [(docs[i], sims[i]) for i in idx] results = retrieve("How does attention work?") for doc, score in results: print(f"[{score:.3f}] {doc[:55]}...")
