06 — NLP & Generative AI

Language That
Understands

Interactive playgrounds for Transformer self-attention, LoRA fine-tuning, and RAG pipelines. Click tokens, drag ranks, and type queries. Press ▶ to hear each concept explained.

01
Transformer & Self-Attention
Transformers process all tokens in parallel via self-attention — each token attends to every other token in a single matrix operation, replacing sequential RNNs.
Attention(Q,K,V) = softmax(QKᵀ/√d_k)·V Multi-head: h parallel heads Positional encoding: order info BERT: encoder-only, bidirectional GPT: decoder-only, causal
Interactive Widget — Self-Attention Visualizer
▶ Narration
Self-Attention — press play for audio explanation
Click a token to see its attention pattern
Attention heads 4
Temperature (1/√d_k) 0.125
Mode
Attention matrix — row = query token, col = key token
Low
High attention
Positional encoding (first 6 dims, each token)
Tokens
Heads
Attn ops
Selected
Click a token above to highlight its attention pattern in the matrix.
import torch, torch.nn.functional as F, math

def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.shape[-1]
    scores = torch.matmul(Q, K.transpose(-2,-1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask==0, -1e9)
    weights = F.softmax(scores, dim=-1)
    return torch.matmul(weights, V), weights

# seq_len=5 tokens, d_k=64 dimensions
Q = torch.randn(1, 5, 64)  # queries
K = torch.randn(1, 5, 64)  # keys
V = torch.randn(1, 5, 64)  # values

out, w = scaled_dot_product_attention(Q, K, V)
print("Output:", out.shape)   # (1,5,64)
print("Weights:", w.shape)    # (1,5,5) - each token attends to all 5
02
Fine-tuning & LoRA
Full fine-tuning updates all model parameters; LoRA inserts tiny trainable low-rank matrices into frozen weights — 10–100× fewer trainable params, similar performance.
LoRA: ΔW = A·B, rank r ≪ d QLoRA: LoRA on 4-bit quantized base PEFT: LoRA / Prefix / Adapters RLHF: SFT → reward → PPO Only A & B trained, base frozen
Interactive Widget — LoRA Parameter Calculator & Rank Explorer
▶ Narration
LoRA Fine-tuning — press play for audio explanation
Model hidden dim (d) 768
LoRA rank (r) 8
LoRA alpha 16
LoRA architecture — W frozen, only A and B are trained
Parameter breakdown — full vs LoRA trainable
LoRA low-rank approximation (singular value spectrum)
Fine-tuning method comparison
Base params
LoRA params
Trainable %
Scale α/r
Adjust rank to explore the parameter efficiency tradeoff.
import torch
import torch.nn as nn

class LoRALinear(nn.Module):
    def __init__(self, in_f, out_f, rank=4, alpha=16):
        super().__init__()
        self.base = nn.Linear(in_f, out_f, bias=False)
        self.base.weight.requires_grad = False  # freeze base
        self.A = nn.Parameter(torch.randn(rank, in_f) * 0.01)
        self.B = nn.Parameter(torch.zeros(out_f, rank))
        self.scale = alpha / rank

    def forward(self, x):
        return self.base(x) + (x @ self.A.T @ self.B.T) * self.scale

lora = LoRALinear(768, 768, rank=8)
total     = sum(p.numel() for p in lora.parameters())
trainable = sum(p.numel() for p in lora.parameters() if p.requires_grad)
print(f"Total: {total:,} | Trainable: {trainable:,}")
print(f"Trainable ratio: {trainable/total:.2%}")  # ~2%
03
RAG — Retrieval Augmented Generation
RAG retrieves relevant documents via vector search and injects them into the LLM prompt — solving hallucination and knowledge cutoff without expensive fine-tuning.
Query → Embed → Search → Retrieve → LLM Chunking: 256–512 tokens Dense: cosine sim on embeddings Hybrid: dense + BM25 sparse Cross-encoder re-ranking
Interactive Widget — RAG Pipeline Simulator
▶ Narration
RAG Pipeline — press play for audio explanation
RAG pipeline — click a query below to simulate retrieval
Query
🔢
Embed
📄
Retrieve
🤖
LLM
Select a query to run through the pipeline
Top-k chunks to retrieve 2
Retrieval method
Chunk size (tokens) 256
Knowledge base — ranked by similarity to query
Vector space (2D projection of embeddings)
Simulated LLM response (grounded in retrieved context)
Select a query above to see a simulated answer…
Docs in KB
Retrieved
Top score
Method
RAG grounds LLM answers in retrieved facts, reducing hallucination.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Simplified RAG (use sentence-transformers in production)
docs = [
    "Python is a high-level language known for readability.",
    "Machine learning finds patterns in data using algorithms.",
    "Transformers use attention mechanisms to process sequences.",
    "RAG combines retrieval and LLM generation for grounded answers.",
]
vectorizer = TfidfVectorizer()
doc_vecs = vectorizer.fit_transform(docs)

def retrieve(query, top_k=2):
    q_vec = vectorizer.transform([query])
    sims  = cosine_similarity(q_vec, doc_vecs)[0]
    idx   = np.argsort(sims)[::-1][:top_k]
    return [(docs[i], sims[i]) for i in idx]

results = retrieve("How does attention work?")
for doc, score in results:
    print(f"[{score:.3f}] {doc[:55]}...")
Buy me a coffee QR code

Found this useful? If you'd like to spare me a coffee, scan the QR code or click here