Skip to content

Convolutional Neural Networks

Scenario: Detecting Pneumonia in Chest X-Rays

You're a machine learning engineer at a hospital. Radiologists are overwhelmed, so you're building an AI to classify chest X-rays as normal or pneumonia-positive. The images are 224×224 pixels, and you need a model that captures local patterns like lung textures and fluid accumulations. A fully connected network would ignore spatial relationships—learn CNNs to build an effective image classifier.

Learning Objectives

By the end of this module (30-45 minutes), you should be able to: - Explain how convolutions extract spatial features from images. - Build a basic CNN architecture with conv layers, pooling, and batch norm. - Choose appropriate kernel sizes, strides, and padding for different tasks. - Diagnose common CNN training issues like vanishing gradients or overfitting. - Implement global average pooling to reduce parameters in the classifier head.

Prerequisites: Basic PyTorch (nn.Module, Sequential); understanding of images as tensors. Difficulty: Intermediate.

What This Is

Convolutional neural networks detect spatial patterns — edges, textures, shapes — by sliding small learned filters across the input. Unlike fully connected layers, convolutions share weights across positions, which makes them efficient and translation-invariant.

The core insight is that local patterns matter more than global position. A vertical edge is a vertical edge whether it appears on the left or the right of the image.

When You Use It

  • classifying or analyzing images
  • extracting features from grid-structured data
  • building a backbone for detection, segmentation, or generation
  • when the input has spatial structure that a fully connected layer would ignore

Architecture Building Blocks

Layer What It Does Key Parameters
nn.Conv2d applies learned filters to extract spatial features in_channels, out_channels, kernel_size, stride, padding
nn.MaxPool2d downsamples by taking the maximum in each window kernel_size, stride
nn.AdaptiveAvgPool2d pools to a fixed output size regardless of input size output_size
nn.BatchNorm2d normalizes per channel for stable training num_features
nn.Dropout2d drops entire channels to regularize p
nn.Flatten reshapes spatial output into a vector for the classifier head

How Convolution Works

A 3×3 filter slides across the input, computing a dot product at each position:

Input (5×5)         Filter (3×3)        Output (3×3)
┌─────────────┐     ┌─────────┐         ┌─────────┐
│ . . . . .   │     │ w w w   │         │ o o o   │
│ . . . . .   │  ×  │ w w w   │    =    │ o o o   │
│ . . . . .   │     │ w w w   │         │ o o o   │
│ . . . . .   │     └─────────┘         └─────────┘
│ . . . . .   │
└─────────────┘
  • Stride controls how far the filter moves each step (stride=2 halves the spatial size)
  • Padding adds zeros around the input to control the output size
  • padding="same" keeps the spatial dimensions unchanged

Minimal Architecture

import torch.nn as nn

model = nn.Sequential(
    nn.Conv2d(3, 32, kernel_size=3, padding=1),    # 3 channels in, 32 out
    nn.BatchNorm2d(32),
    nn.ReLU(),
    nn.MaxPool2d(2),                                 # halve spatial size

    nn.Conv2d(32, 64, kernel_size=3, padding=1),
    nn.BatchNorm2d(64),
    nn.ReLU(),
    nn.MaxPool2d(2),

    nn.AdaptiveAvgPool2d(1),                         # global average pool
    nn.Flatten(),
    nn.Linear(64, 10),                               # 10-class output
)

The Design Ladder

  1. Start small: 2–3 conv blocks with batch norm and pooling
  2. Check if it overfits one batch — if not, the architecture is broken
  3. Add depth gradually — more layers, more filters
  4. Use pretrained backbones (ResNet, EfficientNet) before designing from scratch

Common Architectures

Architecture Depth Key Idea
LeNet shallow the original: conv → pool → conv → pool → FC
VGG 16–19 layers stack 3×3 convs, double channels
ResNet 18–152 layers skip connections to enable extreme depth
EfficientNet scalable compound scaling of width, depth, resolution

Receptive Field

Each layer "sees" a larger region of the input. Stacking 3×3 convolutions grows the receptive field: - 1 layer: 3×3 - 2 layers: 5×5 - 3 layers: 7×7

This is why small filters stacked deep are preferred over large filters — same receptive field, fewer parameters.

Failure Pattern

Building a CNN with no pooling or stride, keeping the spatial size constant throughout. The model becomes extremely slow and memory-hungry without learning hierarchical features.

Another failure: flattening too early. If you flatten a large feature map, the classifier head has millions of parameters and overfits immediately.

Common Mistakes

  • forgetting padding=1 with 3×3 kernels, which shrinks the feature map each layer
  • using too many fully connected layers after convolutions instead of global average pooling
  • not using batch normalization, which makes deep CNNs much harder to train
  • applying the model to inputs of the wrong spatial size without AdaptiveAvgPool2d

Practice

  1. Build a 3-layer CNN and train it on a small image dataset.
  2. Replace MaxPool2d with a strided convolution and compare.
  3. Remove batch normalization and observe the effect on convergence.
  4. Overfit a single batch to verify the architecture is correctly wired.
  5. Swap your custom CNN for a pretrained ResNet and compare accuracy.

Case Study: CNNs in Self-Driving Cars

Tesla's Autopilot uses CNNs to process camera feeds, detecting lanes, pedestrians, and traffic signs. Early versions struggled with edge cases like rain or low light, but adding data augmentation and deeper architectures improved robustness—showing how CNN design impacts real-world safety.

Quick Quiz

  1. Why do CNNs use shared weights across positions?
    a) To increase model size
    b) To make the model translation-invariant and reduce parameters
    c) To handle variable input sizes
    d) To speed up training

  2. What happens if you forget padding in a 3×3 conv layer?
    a) The model becomes slower
    b) Spatial dimensions shrink each layer
    c) Memory usage increases
    d) Accuracy improves

  3. How does global average pooling help in CNN classifiers?
    a) It increases the number of parameters
    b) It reduces the feature map to a fixed-size vector without millions of parameters
    c) It adds more convolutional layers
    d) It normalizes the features

  4. In the pneumonia detection scenario, why is a CNN better than an MLP?
    a) CNNs are faster to train
    b) CNNs capture spatial patterns like lung textures
    c) CNNs use less memory
    d) CNNs are easier to implement

Checkpoint

  • [ ] Build and train a basic CNN on image data
  • [ ] Experiment with different kernel sizes and observe output shapes
  • [ ] Implement global average pooling for efficient classification
  • [ ] Add batch normalization and dropout for stable training
  • [ ] Compare custom CNN vs pretrained backbone performance

Further Reading

Runnable Example

This example stays local-only for now because the browser runner does not yet include PyTorch.

Longer Connection

Continue with Transfer and Fine-Tuning for using pretrained CNN backbones, and Vision Augmentation and Shift Robustness for making CNNs more robust.