NLP & Generative AI — Interactive Guide

01

Transformer & Self-Attention

Transformers process all tokens in parallel via self-attention — each token attends to every other token in a single matrix operation, replacing sequential RNNs.

Attention(Q,K,V) = softmax(QKᵀ/√d_k)·V Multi-head: h parallel heads Positional encoding: order info BERT: encoder-only, bidirectional GPT: decoder-only, causal

Interactive Widget — Self-Attention Visualizer

▶ Narration

Self-Attention — press play for audio explanation

Click a token to see its attention pattern

Attention heads 4

Temperature (1/√d_k) 0.125

Mode

Attention matrix — row = query token, col = key token

Low

High attention

Positional encoding (first 6 dims, each token)

Tokens

—

Heads

—

Attn ops

—

Selected

—

Click a token above to highlight its attention pattern in the matrix.

import torch, torch.nn.functional as F, math

def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.shape[-1]
    scores = torch.matmul(Q, K.transpose(-2,-1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask==0, -1e9)
    weights = F.softmax(scores, dim=-1)
    return torch.matmul(weights, V), weights

# seq_len=5 tokens, d_k=64 dimensions
Q = torch.randn(1, 5, 64)  # queries
K = torch.randn(1, 5, 64)  # keys
V = torch.randn(1, 5, 64)  # values

out, w = scaled_dot_product_attention(Q, K, V)
print("Output:", out.shape)   # (1,5,64)
print("Weights:", w.shape)    # (1,5,5) - each token attends to all 5

02

Fine-tuning & LoRA

Full fine-tuning updates all model parameters; LoRA inserts tiny trainable low-rank matrices into frozen weights — 10–100× fewer trainable params, similar performance.

LoRA: ΔW = A·B, rank r ≪ d QLoRA: LoRA on 4-bit quantized base PEFT: LoRA / Prefix / Adapters RLHF: SFT → reward → PPO Only A & B trained, base frozen

Interactive Widget — LoRA Parameter Calculator & Rank Explorer

▶ Narration

LoRA Fine-tuning — press play for audio explanation

Model hidden dim (d) 768

LoRA rank (r) 8

LoRA alpha 16

LoRA architecture — W frozen, only A and B are trained

Parameter breakdown — full vs LoRA trainable

LoRA low-rank approximation (singular value spectrum)

Fine-tuning method comparison

Base params

—

LoRA params

—

Trainable %

—

Scale α/r

—

Adjust rank to explore the parameter efficiency tradeoff.

import torch
import torch.nn as nn

class LoRALinear(nn.Module):
    def __init__(self, in_f, out_f, rank=4, alpha=16):
        super().__init__()
        self.base = nn.Linear(in_f, out_f, bias=False)
        self.base.weight.requires_grad = False  # freeze base
        self.A = nn.Parameter(torch.randn(rank, in_f) * 0.01)
        self.B = nn.Parameter(torch.zeros(out_f, rank))
        self.scale = alpha / rank

    def forward(self, x):
        return self.base(x) + (x @ self.A.T @ self.B.T) * self.scale

lora = LoRALinear(768, 768, rank=8)
total     = sum(p.numel() for p in lora.parameters())
trainable = sum(p.numel() for p in lora.parameters() if p.requires_grad)
print(f"Total: {total:,} | Trainable: {trainable:,}")
print(f"Trainable ratio: {trainable/total:.2%}")  # ~2%

03

RAG — Retrieval Augmented Generation

RAG retrieves relevant documents via vector search and injects them into the LLM prompt — solving hallucination and knowledge cutoff without expensive fine-tuning.

Query → Embed → Search → Retrieve → LLM Chunking: 256–512 tokens Dense: cosine sim on embeddings Hybrid: dense + BM25 sparse Cross-encoder re-ranking

Interactive Widget — RAG Pipeline Simulator

▶ Narration

RAG Pipeline — press play for audio explanation

RAG pipeline — click a query below to simulate retrieval

❓

Query

▶

🔢

Embed

▶

🔍

Search

▶

📄

Retrieve

▶

🤖

LLM

Select a query to run through the pipeline

Top-k chunks to retrieve 2

Retrieval method

Chunk size (tokens) 256

Knowledge base — ranked by similarity to query

Vector space (2D projection of embeddings)

Simulated LLM response (grounded in retrieved context)

Select a query above to see a simulated answer…

Docs in KB

—

Retrieved

—

Top score

—

Method

—

RAG grounds LLM answers in retrieved facts, reducing hallucination.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Simplified RAG (use sentence-transformers in production)
docs = [
    "Python is a high-level language known for readability.",
    "Machine learning finds patterns in data using algorithms.",
    "Transformers use attention mechanisms to process sequences.",
    "RAG combines retrieval and LLM generation for grounded answers.",
]
vectorizer = TfidfVectorizer()
doc_vecs = vectorizer.fit_transform(docs)

def retrieve(query, top_k=2):
    q_vec = vectorizer.transform([query])
    sims  = cosine_similarity(q_vec, doc_vecs)[0]
    idx   = np.argsort(sims)[::-1][:top_k]
    return [(docs[i], sims[i]) for i in idx]

results = retrieve("How does attention work?")
for doc, score in results:
    print(f"[{score:.3f}] {doc[:55]}...")

Language ThatUnderstands

Language That
Understands