Recurrent Networks and Sequences¶
Scenario: Predicting Stock Prices from Historical Data¶
You're building a model to forecast stock prices using daily trading data. The price tomorrow depends on recent trends—use RNNs to capture temporal dependencies and predict future values based on sequence patterns.
What This Is¶
Recurrent neural networks process sequences one step at a time, maintaining a hidden state that carries information from previous steps. They were the standard architecture for text, audio, and time series before transformers — and they are still useful when you need to understand how sequential models work at a fundamental level.
Objectives¶
- Understand how RNNs, LSTMs, and GRUs process sequential data
- Implement recurrent networks for sequence prediction tasks
- Handle variable-length sequences with padding and packing
- Apply teacher forcing for stable training
- Compare RNNs vs transformers for different sequence tasks
When You Use It¶
- understanding the foundations of sequence modeling
- processing variable-length sequences
- working with time series where each step depends on the recent past
- building a baseline before reaching for a transformer
- tasks where the sequential nature of processing is a feature, not a limitation
The Core Idea¶
At each time step t, the network: 1. Takes the current input x_t and the previous hidden state h_{t-1} 2. Produces a new hidden state h_t 3. Optionally produces an output
x₁ → [RNN] → h₁ → [RNN] → h₂ → [RNN] → h₃ → output
↑ ↑ ↑
h₀ h₁ h₂
The same weights are used at every step — this is weight sharing across time.
Vanilla RNN vs LSTM vs GRU¶
| Architecture | Memory | Key Idea | Weakness |
|---|---|---|---|
| Vanilla RNN | very short | simplest recurrence | vanishing gradients on long sequences |
| LSTM | long-range | cell state + gates control what to remember/forget | more parameters, slower |
| GRU | long-range | simplified gates, no separate cell state | slightly less expressive than LSTM |
LSTM Gates¶
An LSTM has three gates that control information flow:
- Forget gate: what to discard from the cell state
- Input gate: what new information to add
- Output gate: what part of the cell state to expose
import torch.nn as nn
lstm = nn.LSTM(
input_size=32, # feature dimension per step
hidden_size=64, # hidden state dimension
num_layers=2, # stacked LSTM layers
batch_first=True, # input shape: (batch, seq_len, features)
dropout=0.2, # dropout between layers
bidirectional=False,
)
GRU — The Simpler Alternative¶
gru = nn.GRU(
input_size=32,
hidden_size=64,
num_layers=2,
batch_first=True,
dropout=0.2,
)
GRU has fewer parameters than LSTM and often performs comparably. Start with GRU unless you have a specific reason to prefer LSTM.
Full Classification Pattern¶
class SequenceClassifier(nn.Module):
def __init__(self, input_dim, hidden_dim, num_classes):
super().__init__()
self.lstm = nn.LSTM(input_dim, hidden_dim, batch_first=True, num_layers=2, dropout=0.2)
self.classifier = nn.Linear(hidden_dim, num_classes)
def forward(self, x):
output, (h_n, c_n) = self.lstm(x)
# use the last hidden state for classification
last_hidden = h_n[-1]
return self.classifier(last_hidden)
Key decisions¶
- Last hidden state (
h_n[-1]): the summary of the entire sequence → for classification - All outputs (
output): hidden state at every step → for sequence-to-sequence or tagging - Bidirectional: process the sequence forward AND backward, concatenate hidden states
Bidirectional RNNs¶
lstm = nn.LSTM(input_size=32, hidden_size=64, bidirectional=True, batch_first=True)
# output shape: (batch, seq_len, 2 * hidden_size)
# h_n shape: (2 * num_layers, batch, hidden_size)
Bidirectional models see the full sequence at every position — useful for classification and tagging, but not for generation (where you cannot look ahead).
Handling Variable-Length Sequences¶
Real sequences have different lengths. PyTorch provides:
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
# Pack: skip padding during computation
packed = pack_padded_sequence(padded_input, lengths, batch_first=True, enforce_sorted=False)
output, (h_n, c_n) = lstm(packed)
# Unpack: restore to padded form
output_padded, _ = pad_packed_sequence(output, batch_first=True)
This prevents padding tokens from polluting the hidden state.
RNNs vs Transformers¶
| Aspect | RNNs | Transformers |
|---|---|---|
| Processing | sequential (step by step) | parallel (all at once) |
| Long-range deps | struggle (vanishing gradients) | handle well (direct attention) |
| Speed (training) | slower (cannot parallelize) | faster |
| Speed (inference) | efficient for streaming | expensive for long sequences |
| Data needs | work on smaller datasets | often need more data |
Quick Quiz¶
-
What's the main advantage of LSTM over vanilla RNN?
a) Faster training
b) Better handling of long-range dependencies
c) Fewer parameters
d) Simpler implementation -
When should you use bidirectional RNNs?
a) For text generation
b) For sequence classification where full context helps
c) For real-time processing
d) For variable-length sequences -
What does pack_padded_sequence do?
a) Adds padding to sequences
b) Removes padding during computation to avoid processing dummy tokens
c) Sorts sequences by length
d) Converts sequences to fixed length
Failure Pattern¶
Using a vanilla RNN on a long sequence and expecting it to capture dependencies from 50+ steps ago. The vanishing gradient problem makes this nearly impossible — use LSTM or GRU instead.
Another failure: not sorting or packing variable-length sequences, which makes the model process padding tokens as if they were real data.
Common Mistakes¶
- forgetting
batch_first=Trueand getting confused by tensor shapes - using the full output tensor when you only need the last hidden state
- applying dropout with only one layer (dropout between layers only works with
num_layers > 1) - trying to use a bidirectional RNN for autoregressive generation
Practice¶
- Build an LSTM classifier for a simple sequence task and inspect the last hidden state.
- Compare LSTM vs GRU on the same task and measure whether the difference matters.
- Make the model bidirectional and explain when this would help or hurt.
- Handle variable-length sequences with
pack_padded_sequenceand verify the output. - Explain why a vanilla RNN struggles with long sequences but LSTM does not.
Checkpoint¶
- [ ] Implement LSTM and GRU for sequence classification
- [ ] Handle variable-length sequences with packing/unpacking
- [ ] Use bidirectional RNNs for full context
- [ ] Compare RNN vs transformer approaches
- [ ] Apply teacher forcing for sequence generation
Runnable Example¶
Longer Connection¶
Continue with Attention and Transformers for the architecture that largely replaced RNNs for NLP, and Sequential Splits and Lag Features for the data preparation side of sequence tasks.
Further Reading¶
- Understanding LSTM Networks
- The Unreasonable Effectiveness of Recurrent Neural Networks
- PyTorch RNN documentation for implementation details