Attention and Transformers¶
Scenario: Building a Code Autocomplete Tool¶
You're creating an AI assistant for programmers that predicts the next code tokens based on the entire file context. Simple RNNs forget early parts of long files—use transformers with attention to capture dependencies across hundreds of lines, enabling accurate and context-aware code completion.
Learning Objectives¶
By the end of this module (45-60 minutes), you should be able to: - Explain how attention mechanisms work and why they're better than RNNs for long sequences. - Implement self-attention and multi-head attention from scratch. - Understand the transformer architecture (encoder-decoder, positional encoding). - Fine-tune a pretrained transformer like BERT for text classification. - Troubleshoot common issues like overfitting or attention collapse.
Prerequisites: Basic neural networks (matrices, softmax); understanding of sequences. Difficulty: Advanced.
What This Is¶
Attention lets a model look at all positions in the input at once and decide which ones matter for each output. Transformers are architectures built entirely on attention — no recurrence, no convolutions — and they dominate modern NLP, vision, and multimodal AI.
The core idea: instead of processing sequences left-to-right, let every position attend to every other position in parallel.
When You Use It¶
- processing text where long-range dependencies matter
- building or fine-tuning language models (BERT, GPT)
- applying transformers to vision (ViT) or audio
- any task where the relationship between distant elements is important
The Attention Mechanism¶
Attention computes a weighted sum of values, where the weights come from comparing queries to keys:
Attention(Q, K, V) = softmax(QKᵀ / √d_k) × V
In plain terms: 1. Each position creates a query ("what am I looking for?") 2. Each position creates a key ("what do I contain?") 3. Each position creates a value ("what information do I carry?") 4. Queries compare against all keys to produce attention weights 5. Values are combined using those weights
Self-Attention Step By Step¶
import torch
import torch.nn.functional as F
def self_attention(X, W_q, W_k, W_v):
Q = X @ W_q # queries
K = X @ W_k # keys
V = X @ W_v # values
d_k = K.shape[-1]
scores = Q @ K.T / (d_k ** 0.5) # scaled dot-product
weights = F.softmax(scores, dim=-1) # attention weights
output = weights @ V # weighted combination
return output, weights
Multi-Head Attention¶
Instead of one attention operation, split Q/K/V into multiple "heads" that each attend to different aspects:
class MultiHeadAttention(torch.nn.Module):
def __init__(self, d_model, n_heads):
super().__init__()
self.n_heads = n_heads
self.d_head = d_model // n_heads
self.W_q = torch.nn.Linear(d_model, d_model)
self.W_k = torch.nn.Linear(d_model, d_model)
self.W_v = torch.nn.Linear(d_model, d_model)
self.W_out = torch.nn.Linear(d_model, d_model)
Why multiple heads: one head might attend to syntax, another to semantics, another to position. The model learns to specialize.
Transformer Block¶
A transformer block combines:
- Multi-head self-attention — look at all positions
- Add & Norm — residual connection + layer normalization
- Feed-forward network — two linear layers with ReLU
- Add & Norm — another residual connection
Input → Self-Attention → Add & Norm → FFN → Add & Norm → Output
↓___________________________↑ ↓____________↑
(residual) (residual)
Positional Encoding¶
Attention has no notion of position — "the cat sat" and "sat the cat" look the same. Positional encoding injects order:
# Sinusoidal positional encoding
position = torch.arange(seq_len).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model))
pe = torch.zeros(seq_len, d_model)
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
Modern models often use learned positional embeddings instead.
Encoder vs Decoder¶
| Component | Sees | Used By |
|---|---|---|
| Encoder | all positions (bidirectional) | BERT, classification, embeddings |
| Decoder | only past positions (causal mask) | GPT, text generation |
| Encoder-Decoder | encoder sees all, decoder attends to encoder output | T5, translation, summarization |
Common Transformer Models¶
| Model | Architecture | Main Use |
|---|---|---|
| BERT | encoder-only | classification, NER, embeddings |
| GPT | decoder-only | text generation |
| T5 | encoder-decoder | text-to-text tasks |
| ViT | encoder on image patches | image classification |
Failure Pattern¶
Training a transformer from scratch on a small dataset. Transformers are data-hungry — for most practical tasks, fine-tuning a pretrained model is far more effective than training from scratch.
Another failure: ignoring the quadratic cost of attention. Self-attention scales as O(n²) with sequence length, which becomes a bottleneck for long documents.
Common Mistakes¶
- forgetting the causal mask in decoder models, which lets the model "cheat" by looking at future tokens
- not scaling the dot products by √d_k, leading to vanishingly small gradients through softmax
- using too few attention heads for the model dimension
- ignoring the positional encoding and wondering why word order does not matter
Practice¶
- Implement single-head self-attention from scratch and inspect the attention weights.
- Show what happens when you remove positional encoding from a sequence task.
- Compare a 2-head and 8-head attention and explain whether more heads help.
- Explain the difference between encoder attention (bidirectional) and decoder attention (causal).
- Fine-tune a pretrained BERT model and compare against training from scratch.
Case Study: GPT and ChatGPT¶
GPT models use causal attention to generate coherent text by predicting the next word based on all previous context. This enables conversational AI like ChatGPT, demonstrating how transformers revolutionized language generation and understanding.
Quick Quiz¶
-
Why is attention better than RNNs for long sequences?
a) It processes sequences faster
b) It directly models dependencies without vanishing gradients
c) It uses less memory
d) It handles variable-length inputs better -
What does multi-head attention do?
a) Increases model depth
b) Runs multiple attention mechanisms in parallel
c) Reduces sequence length
d) Adds positional encoding -
How does positional encoding work in transformers?
a) It sorts the tokens by importance
b) It adds sinusoidal signals for sequence order
c) It masks future tokens
d) It normalizes attention weights -
In the code autocomplete scenario, why use transformers?
a) For faster training
b) To capture long-range dependencies in code
c) For better memory efficiency
d) For simpler implementation
Checkpoint¶
- [ ] Implement self-attention and visualize attention weights
- [ ] Build a simple transformer block and train on a toy task
- [ ] Fine-tune BERT on a text classification dataset
- [ ] Understand encoder vs decoder architectures
- [ ] Handle positional encoding and causal masking
Further Reading¶
- The Illustrated Transformer for visual explanations.
- Attention Is All You Need paper.
- Hugging Face tutorials for fine-tuning transformers.
Runnable Example¶
Longer Connection¶
Continue with Transfer and Fine-Tuning for practical fine-tuning of transformer backbones, and Text Representations and Order for the simpler text representations that serve as baselines before transformers.