Skip to content

Recurrent Networks and Sequences

Scenario: Predicting Stock Prices from Historical Data

You're building a model to forecast stock prices using daily trading data. The price tomorrow depends on recent trends—use RNNs to capture temporal dependencies and predict future values based on sequence patterns.

What This Is

Recurrent neural networks process sequences one step at a time, maintaining a hidden state that carries information from previous steps. They were the standard architecture for text, audio, and time series before transformers — and they are still useful when you need to understand how sequential models work at a fundamental level.

Objectives

  • Understand how RNNs, LSTMs, and GRUs process sequential data
  • Implement recurrent networks for sequence prediction tasks
  • Handle variable-length sequences with padding and packing
  • Apply teacher forcing for stable training
  • Compare RNNs vs transformers for different sequence tasks

When You Use It

  • understanding the foundations of sequence modeling
  • processing variable-length sequences
  • working with time series where each step depends on the recent past
  • building a baseline before reaching for a transformer
  • tasks where the sequential nature of processing is a feature, not a limitation

The Core Idea

At each time step t, the network: 1. Takes the current input x_t and the previous hidden state h_{t-1} 2. Produces a new hidden state h_t 3. Optionally produces an output

x₁ → [RNN] → h₁ → [RNN] → h₂ → [RNN] → h₃ → output
       ↑              ↑              ↑
      h₀            h₁             h₂

The same weights are used at every step — this is weight sharing across time.

Vanilla RNN vs LSTM vs GRU

Architecture Memory Key Idea Weakness
Vanilla RNN very short simplest recurrence vanishing gradients on long sequences
LSTM long-range cell state + gates control what to remember/forget more parameters, slower
GRU long-range simplified gates, no separate cell state slightly less expressive than LSTM

LSTM Gates

An LSTM has three gates that control information flow:

  • Forget gate: what to discard from the cell state
  • Input gate: what new information to add
  • Output gate: what part of the cell state to expose
import torch.nn as nn

lstm = nn.LSTM(
    input_size=32,      # feature dimension per step
    hidden_size=64,     # hidden state dimension
    num_layers=2,       # stacked LSTM layers
    batch_first=True,   # input shape: (batch, seq_len, features)
    dropout=0.2,        # dropout between layers
    bidirectional=False,
)

GRU — The Simpler Alternative

gru = nn.GRU(
    input_size=32,
    hidden_size=64,
    num_layers=2,
    batch_first=True,
    dropout=0.2,
)

GRU has fewer parameters than LSTM and often performs comparably. Start with GRU unless you have a specific reason to prefer LSTM.

Full Classification Pattern

class SequenceClassifier(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_classes):
        super().__init__()
        self.lstm = nn.LSTM(input_dim, hidden_dim, batch_first=True, num_layers=2, dropout=0.2)
        self.classifier = nn.Linear(hidden_dim, num_classes)

    def forward(self, x):
        output, (h_n, c_n) = self.lstm(x)
        # use the last hidden state for classification
        last_hidden = h_n[-1]
        return self.classifier(last_hidden)

Key decisions

  • Last hidden state (h_n[-1]): the summary of the entire sequence → for classification
  • All outputs (output): hidden state at every step → for sequence-to-sequence or tagging
  • Bidirectional: process the sequence forward AND backward, concatenate hidden states

Bidirectional RNNs

lstm = nn.LSTM(input_size=32, hidden_size=64, bidirectional=True, batch_first=True)
# output shape: (batch, seq_len, 2 * hidden_size)
# h_n shape: (2 * num_layers, batch, hidden_size)

Bidirectional models see the full sequence at every position — useful for classification and tagging, but not for generation (where you cannot look ahead).

Handling Variable-Length Sequences

Real sequences have different lengths. PyTorch provides:

from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

# Pack: skip padding during computation
packed = pack_padded_sequence(padded_input, lengths, batch_first=True, enforce_sorted=False)
output, (h_n, c_n) = lstm(packed)

# Unpack: restore to padded form
output_padded, _ = pad_packed_sequence(output, batch_first=True)

This prevents padding tokens from polluting the hidden state.

RNNs vs Transformers

Aspect RNNs Transformers
Processing sequential (step by step) parallel (all at once)
Long-range deps struggle (vanishing gradients) handle well (direct attention)
Speed (training) slower (cannot parallelize) faster
Speed (inference) efficient for streaming expensive for long sequences
Data needs work on smaller datasets often need more data

Quick Quiz

  1. What's the main advantage of LSTM over vanilla RNN?
    a) Faster training
    b) Better handling of long-range dependencies
    c) Fewer parameters
    d) Simpler implementation

  2. When should you use bidirectional RNNs?
    a) For text generation
    b) For sequence classification where full context helps
    c) For real-time processing
    d) For variable-length sequences

  3. What does pack_padded_sequence do?
    a) Adds padding to sequences
    b) Removes padding during computation to avoid processing dummy tokens
    c) Sorts sequences by length
    d) Converts sequences to fixed length

Failure Pattern

Using a vanilla RNN on a long sequence and expecting it to capture dependencies from 50+ steps ago. The vanishing gradient problem makes this nearly impossible — use LSTM or GRU instead.

Another failure: not sorting or packing variable-length sequences, which makes the model process padding tokens as if they were real data.

Common Mistakes

  • forgetting batch_first=True and getting confused by tensor shapes
  • using the full output tensor when you only need the last hidden state
  • applying dropout with only one layer (dropout between layers only works with num_layers > 1)
  • trying to use a bidirectional RNN for autoregressive generation

Practice

  1. Build an LSTM classifier for a simple sequence task and inspect the last hidden state.
  2. Compare LSTM vs GRU on the same task and measure whether the difference matters.
  3. Make the model bidirectional and explain when this would help or hurt.
  4. Handle variable-length sequences with pack_padded_sequence and verify the output.
  5. Explain why a vanilla RNN struggles with long sequences but LSTM does not.

Checkpoint

  • [ ] Implement LSTM and GRU for sequence classification
  • [ ] Handle variable-length sequences with packing/unpacking
  • [ ] Use bidirectional RNNs for full context
  • [ ] Compare RNN vs transformer approaches
  • [ ] Apply teacher forcing for sequence generation

Runnable Example

This example stays local-only for now because the browser runner does not yet include PyTorch.

Longer Connection

Continue with Attention and Transformers for the architecture that largely replaced RNNs for NLP, and Sequential Splits and Lag Features for the data preparation side of sequence tasks.

Further Reading