Attention and Transformers¶

Scenario: Building a Code Autocomplete Tool¶

You're creating an AI assistant for programmers that predicts the next code tokens based on the entire file context. Simple RNNs forget early parts of long files—use transformers with attention to capture dependencies across hundreds of lines, enabling accurate and context-aware code completion.

Learning Objectives¶

By the end of this module (45-60 minutes), you should be able to: - Explain how attention mechanisms work and why they're better than RNNs for long sequences. - Implement self-attention and multi-head attention from scratch. - Understand the transformer architecture (encoder-decoder, positional encoding). - Fine-tune a pretrained transformer like BERT for text classification. - Troubleshoot common issues like overfitting or attention collapse.

Prerequisites: Basic neural networks (matrices, softmax); understanding of sequences. Difficulty: Advanced.

What This Is¶

Attention lets a model look at all positions in the input at once and decide which ones matter for each output. Transformers are architectures built entirely on attention — no recurrence, no convolutions — and they dominate modern NLP, vision, and multimodal AI.

The core idea: instead of processing sequences left-to-right, let every position attend to every other position in parallel.

When You Use It¶

processing text where long-range dependencies matter
building or fine-tuning language models (BERT, GPT)
applying transformers to vision (ViT) or audio
any task where the relationship between distant elements is important

The Attention Mechanism¶

Attention computes a weighted sum of values, where the weights come from comparing queries to keys:

Attention(Q, K, V) = softmax(QKᵀ / √d_k) × V

In plain terms: 1. Each position creates a query ("what am I looking for?") 2. Each position creates a key ("what do I contain?") 3. Each position creates a value ("what information do I carry?") 4. Queries compare against all keys to produce attention weights 5. Values are combined using those weights

Self-Attention Step By Step¶

import torch
import torch.nn.functional as F

def self_attention(X, W_q, W_k, W_v):
    Q = X @ W_q    # queries
    K = X @ W_k    # keys
    V = X @ W_v    # values

    d_k = K.shape[-1]
    scores = Q @ K.T / (d_k ** 0.5)    # scaled dot-product
    weights = F.softmax(scores, dim=-1) # attention weights
    output = weights @ V                 # weighted combination
    return output, weights

Multi-Head Attention¶

Instead of one attention operation, split Q/K/V into multiple "heads" that each attend to different aspects:

class MultiHeadAttention(torch.nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.n_heads = n_heads
        self.d_head = d_model // n_heads
        self.W_q = torch.nn.Linear(d_model, d_model)
        self.W_k = torch.nn.Linear(d_model, d_model)
        self.W_v = torch.nn.Linear(d_model, d_model)
        self.W_out = torch.nn.Linear(d_model, d_model)

Why multiple heads: one head might attend to syntax, another to semantics, another to position. The model learns to specialize.

Transformer Block¶

A transformer block combines:

Multi-head self-attention — look at all positions
Add & Norm — residual connection + layer normalization
Feed-forward network — two linear layers with ReLU
Add & Norm — another residual connection

Input → Self-Attention → Add & Norm → FFN → Add & Norm → Output
  ↓___________________________↑         ↓____________↑
        (residual)                      (residual)

Positional Encoding¶

Attention has no notion of position — "the cat sat" and "sat the cat" look the same. Positional encoding injects order:

# Sinusoidal positional encoding
position = torch.arange(seq_len).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model))
pe = torch.zeros(seq_len, d_model)
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)

Modern models often use learned positional embeddings instead.

Encoder vs Decoder¶

Component	Sees	Used By
Encoder	all positions (bidirectional)	BERT, classification, embeddings
Decoder	only past positions (causal mask)	GPT, text generation
Encoder-Decoder	encoder sees all, decoder attends to encoder output	T5, translation, summarization

Common Transformer Models¶

Model	Architecture	Main Use
BERT	encoder-only	classification, NER, embeddings
GPT	decoder-only	text generation
T5	encoder-decoder	text-to-text tasks
ViT	encoder on image patches	image classification

Failure Pattern¶

Training a transformer from scratch on a small dataset. Transformers are data-hungry — for most practical tasks, fine-tuning a pretrained model is far more effective than training from scratch.

Another failure: ignoring the quadratic cost of attention. Self-attention scales as O(n²) with sequence length, which becomes a bottleneck for long documents.

Common Mistakes¶

forgetting the causal mask in decoder models, which lets the model "cheat" by looking at future tokens
not scaling the dot products by √d_k, leading to vanishingly small gradients through softmax
using too few attention heads for the model dimension
ignoring the positional encoding and wondering why word order does not matter

Practice¶

Implement single-head self-attention from scratch and inspect the attention weights.
Show what happens when you remove positional encoding from a sequence task.
Compare a 2-head and 8-head attention and explain whether more heads help.
Explain the difference between encoder attention (bidirectional) and decoder attention (causal).
Fine-tune a pretrained BERT model and compare against training from scratch.

Case Study: GPT and ChatGPT¶

GPT models use causal attention to generate coherent text by predicting the next word based on all previous context. This enables conversational AI like ChatGPT, demonstrating how transformers revolutionized language generation and understanding.

Quick Quiz¶

Why is attention better than RNNs for long sequences?
a) It processes sequences faster
b) It directly models dependencies without vanishing gradients
c) It uses less memory
d) It handles variable-length inputs better
What does multi-head attention do?
a) Increases model depth
b) Runs multiple attention mechanisms in parallel
c) Reduces sequence length
d) Adds positional encoding
How does positional encoding work in transformers?
a) It sorts the tokens by importance
b) It adds sinusoidal signals for sequence order
c) It masks future tokens
d) It normalizes attention weights
In the code autocomplete scenario, why use transformers?
a) For faster training
b) To capture long-range dependencies in code
c) For better memory efficiency
d) For simpler implementation

Checkpoint¶

[ ] Implement self-attention and visualize attention weights
[ ] Build a simple transformer block and train on a toy task
[ ] Fine-tune BERT on a text classification dataset
[ ] Understand encoder vs decoder architectures
[ ] Handle positional encoding and causal masking

Runnable Example¶

This example stays local-only for now because the browser runner does not yet include PyTorch.

Longer Connection¶

Continue with Transfer and Fine-Tuning for practical fine-tuning of transformer backbones, and Text Representations and Order for the simpler text representations that serve as baselines before transformers.