Recurrent Networks and Sequences¶

Scenario: Predicting Stock Prices from Historical Data¶

You're building a model to forecast stock prices using daily trading data. The price tomorrow depends on recent trends—use RNNs to capture temporal dependencies and predict future values based on sequence patterns.

What This Is¶

Recurrent neural networks process sequences one step at a time, maintaining a hidden state that carries information from previous steps. They were the standard architecture for text, audio, and time series before transformers — and they are still useful when you need to understand how sequential models work at a fundamental level.

Objectives¶

Understand how RNNs, LSTMs, and GRUs process sequential data
Implement recurrent networks for sequence prediction tasks
Handle variable-length sequences with padding and packing
Apply teacher forcing for stable training
Compare RNNs vs transformers for different sequence tasks

When You Use It¶

understanding the foundations of sequence modeling
processing variable-length sequences
working with time series where each step depends on the recent past
building a baseline before reaching for a transformer
tasks where the sequential nature of processing is a feature, not a limitation

The Core Idea¶

At each time step t, the network: 1. Takes the current input x_t and the previous hidden state h_{t-1} 2. Produces a new hidden state h_t 3. Optionally produces an output

x₁ → [RNN] → h₁ → [RNN] → h₂ → [RNN] → h₃ → output
       ↑              ↑              ↑
      h₀            h₁             h₂

The same weights are used at every step — this is weight sharing across time.

Vanilla RNN vs LSTM vs GRU¶

Architecture	Memory	Key Idea	Weakness
Vanilla RNN	very short	simplest recurrence	vanishing gradients on long sequences
LSTM	long-range	cell state + gates control what to remember/forget	more parameters, slower
GRU	long-range	simplified gates, no separate cell state	slightly less expressive than LSTM

LSTM Gates¶

An LSTM has three gates that control information flow:

Forget gate: what to discard from the cell state
Input gate: what new information to add
Output gate: what part of the cell state to expose

import torch.nn as nn

lstm = nn.LSTM(
    input_size=32,      # feature dimension per step
    hidden_size=64,     # hidden state dimension
    num_layers=2,       # stacked LSTM layers
    batch_first=True,   # input shape: (batch, seq_len, features)
    dropout=0.2,        # dropout between layers
    bidirectional=False,
)

GRU — The Simpler Alternative¶

gru = nn.GRU(
    input_size=32,
    hidden_size=64,
    num_layers=2,
    batch_first=True,
    dropout=0.2,
)

GRU has fewer parameters than LSTM and often performs comparably. Start with GRU unless you have a specific reason to prefer LSTM.

Full Classification Pattern¶

class SequenceClassifier(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_classes):
        super().__init__()
        self.lstm = nn.LSTM(input_dim, hidden_dim, batch_first=True, num_layers=2, dropout=0.2)
        self.classifier = nn.Linear(hidden_dim, num_classes)

    def forward(self, x):
        output, (h_n, c_n) = self.lstm(x)
        # use the last hidden state for classification
        last_hidden = h_n[-1]
        return self.classifier(last_hidden)

Key decisions¶

Last hidden state (h_n[-1]): the summary of the entire sequence → for classification
All outputs (output): hidden state at every step → for sequence-to-sequence or tagging
Bidirectional: process the sequence forward AND backward, concatenate hidden states

Bidirectional RNNs¶

lstm = nn.LSTM(input_size=32, hidden_size=64, bidirectional=True, batch_first=True)
# output shape: (batch, seq_len, 2 * hidden_size)
# h_n shape: (2 * num_layers, batch, hidden_size)

Bidirectional models see the full sequence at every position — useful for classification and tagging, but not for generation (where you cannot look ahead).

Handling Variable-Length Sequences¶

Real sequences have different lengths. PyTorch provides:

from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

# Pack: skip padding during computation
packed = pack_padded_sequence(padded_input, lengths, batch_first=True, enforce_sorted=False)
output, (h_n, c_n) = lstm(packed)

# Unpack: restore to padded form
output_padded, _ = pad_packed_sequence(output, batch_first=True)

This prevents padding tokens from polluting the hidden state.

RNNs vs Transformers¶

Aspect	RNNs	Transformers
Processing	sequential (step by step)	parallel (all at once)
Long-range deps	struggle (vanishing gradients)	handle well (direct attention)
Speed (training)	slower (cannot parallelize)	faster
Speed (inference)	efficient for streaming	expensive for long sequences
Data needs	work on smaller datasets	often need more data

Quick Quiz¶

What's the main advantage of LSTM over vanilla RNN?
a) Faster training
b) Better handling of long-range dependencies
c) Fewer parameters
d) Simpler implementation
When should you use bidirectional RNNs?
a) For text generation
b) For sequence classification where full context helps
c) For real-time processing
d) For variable-length sequences
What does pack_padded_sequence do?
a) Adds padding to sequences
b) Removes padding during computation to avoid processing dummy tokens
c) Sorts sequences by length
d) Converts sequences to fixed length

Failure Pattern¶

Using a vanilla RNN on a long sequence and expecting it to capture dependencies from 50+ steps ago. The vanishing gradient problem makes this nearly impossible — use LSTM or GRU instead.

Another failure: not sorting or packing variable-length sequences, which makes the model process padding tokens as if they were real data.

Common Mistakes¶

forgetting batch_first=True and getting confused by tensor shapes
using the full output tensor when you only need the last hidden state
applying dropout with only one layer (dropout between layers only works with num_layers > 1)
trying to use a bidirectional RNN for autoregressive generation

Practice¶

Build an LSTM classifier for a simple sequence task and inspect the last hidden state.
Compare LSTM vs GRU on the same task and measure whether the difference matters.
Make the model bidirectional and explain when this would help or hurt.
Handle variable-length sequences with pack_padded_sequence and verify the output.
Explain why a vanilla RNN struggles with long sequences but LSTM does not.

Checkpoint¶

[ ] Implement LSTM and GRU for sequence classification
[ ] Handle variable-length sequences with packing/unpacking
[ ] Use bidirectional RNNs for full context
[ ] Compare RNN vs transformer approaches
[ ] Apply teacher forcing for sequence generation

Runnable Example¶

This example stays local-only for now because the browser runner does not yet include PyTorch.

Longer Connection¶

Continue with Attention and Transformers for the architecture that largely replaced RNNs for NLP, and Sequential Splits and Lag Features for the data preparation side of sequence tasks.