Skip to content

Attention and Transformers

Scenario: Building a Code Autocomplete Tool

You're creating an AI assistant for programmers that predicts the next code tokens based on the entire file context. Simple RNNs forget early parts of long files—use transformers with attention to capture dependencies across hundreds of lines, enabling accurate and context-aware code completion.

Learning Objectives

By the end of this module (45-60 minutes), you should be able to: - Explain how attention mechanisms work and why they're better than RNNs for long sequences. - Implement self-attention and multi-head attention from scratch. - Understand the transformer architecture (encoder-decoder, positional encoding). - Fine-tune a pretrained transformer like BERT for text classification. - Troubleshoot common issues like overfitting or attention collapse.

Prerequisites: Basic neural networks (matrices, softmax); understanding of sequences. Difficulty: Advanced.

What This Is

Attention lets a model look at all positions in the input at once and decide which ones matter for each output. Transformers are architectures built entirely on attention — no recurrence, no convolutions — and they dominate modern NLP, vision, and multimodal AI.

The core idea: instead of processing sequences left-to-right, let every position attend to every other position in parallel.

When You Use It

  • processing text where long-range dependencies matter
  • building or fine-tuning language models (BERT, GPT)
  • applying transformers to vision (ViT) or audio
  • any task where the relationship between distant elements is important

The Attention Mechanism

Attention computes a weighted sum of values, where the weights come from comparing queries to keys:

Attention(Q, K, V) = softmax(QKᵀ / √d_k) × V

In plain terms: 1. Each position creates a query ("what am I looking for?") 2. Each position creates a key ("what do I contain?") 3. Each position creates a value ("what information do I carry?") 4. Queries compare against all keys to produce attention weights 5. Values are combined using those weights

Self-Attention Step By Step

import torch
import torch.nn.functional as F

def self_attention(X, W_q, W_k, W_v):
    Q = X @ W_q    # queries
    K = X @ W_k    # keys
    V = X @ W_v    # values

    d_k = K.shape[-1]
    scores = Q @ K.T / (d_k ** 0.5)    # scaled dot-product
    weights = F.softmax(scores, dim=-1) # attention weights
    output = weights @ V                 # weighted combination
    return output, weights

Multi-Head Attention

Instead of one attention operation, split Q/K/V into multiple "heads" that each attend to different aspects:

class MultiHeadAttention(torch.nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.n_heads = n_heads
        self.d_head = d_model // n_heads
        self.W_q = torch.nn.Linear(d_model, d_model)
        self.W_k = torch.nn.Linear(d_model, d_model)
        self.W_v = torch.nn.Linear(d_model, d_model)
        self.W_out = torch.nn.Linear(d_model, d_model)

Why multiple heads: one head might attend to syntax, another to semantics, another to position. The model learns to specialize.

Transformer Block

A transformer block combines:

  1. Multi-head self-attention — look at all positions
  2. Add & Norm — residual connection + layer normalization
  3. Feed-forward network — two linear layers with ReLU
  4. Add & Norm — another residual connection
Input → Self-Attention → Add & Norm → FFN → Add & Norm → Output
  ↓___________________________↑         ↓____________↑
        (residual)                      (residual)

Positional Encoding

Attention has no notion of position — "the cat sat" and "sat the cat" look the same. Positional encoding injects order:

# Sinusoidal positional encoding
position = torch.arange(seq_len).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model))
pe = torch.zeros(seq_len, d_model)
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)

Modern models often use learned positional embeddings instead.

Encoder vs Decoder

Component Sees Used By
Encoder all positions (bidirectional) BERT, classification, embeddings
Decoder only past positions (causal mask) GPT, text generation
Encoder-Decoder encoder sees all, decoder attends to encoder output T5, translation, summarization

Common Transformer Models

Model Architecture Main Use
BERT encoder-only classification, NER, embeddings
GPT decoder-only text generation
T5 encoder-decoder text-to-text tasks
ViT encoder on image patches image classification

Failure Pattern

Training a transformer from scratch on a small dataset. Transformers are data-hungry — for most practical tasks, fine-tuning a pretrained model is far more effective than training from scratch.

Another failure: ignoring the quadratic cost of attention. Self-attention scales as O(n²) with sequence length, which becomes a bottleneck for long documents.

Common Mistakes

  • forgetting the causal mask in decoder models, which lets the model "cheat" by looking at future tokens
  • not scaling the dot products by √d_k, leading to vanishingly small gradients through softmax
  • using too few attention heads for the model dimension
  • ignoring the positional encoding and wondering why word order does not matter

Practice

  1. Implement single-head self-attention from scratch and inspect the attention weights.
  2. Show what happens when you remove positional encoding from a sequence task.
  3. Compare a 2-head and 8-head attention and explain whether more heads help.
  4. Explain the difference between encoder attention (bidirectional) and decoder attention (causal).
  5. Fine-tune a pretrained BERT model and compare against training from scratch.

Case Study: GPT and ChatGPT

GPT models use causal attention to generate coherent text by predicting the next word based on all previous context. This enables conversational AI like ChatGPT, demonstrating how transformers revolutionized language generation and understanding.

Quick Quiz

  1. Why is attention better than RNNs for long sequences?
    a) It processes sequences faster
    b) It directly models dependencies without vanishing gradients
    c) It uses less memory
    d) It handles variable-length inputs better

  2. What does multi-head attention do?
    a) Increases model depth
    b) Runs multiple attention mechanisms in parallel
    c) Reduces sequence length
    d) Adds positional encoding

  3. How does positional encoding work in transformers?
    a) It sorts the tokens by importance
    b) It adds sinusoidal signals for sequence order
    c) It masks future tokens
    d) It normalizes attention weights

  4. In the code autocomplete scenario, why use transformers?
    a) For faster training
    b) To capture long-range dependencies in code
    c) For better memory efficiency
    d) For simpler implementation

Checkpoint

  • [ ] Implement self-attention and visualize attention weights
  • [ ] Build a simple transformer block and train on a toy task
  • [ ] Fine-tune BERT on a text classification dataset
  • [ ] Understand encoder vs decoder architectures
  • [ ] Handle positional encoding and causal masking

Further Reading

Runnable Example

This example stays local-only for now because the browser runner does not yet include PyTorch.

Longer Connection

Continue with Transfer and Fine-Tuning for practical fine-tuning of transformer backbones, and Text Representations and Order for the simpler text representations that serve as baselines before transformers.