Convolutional Neural Networks¶

Scenario: Detecting Pneumonia in Chest X-Rays¶

You're a machine learning engineer at a hospital. Radiologists are overwhelmed, so you're building an AI to classify chest X-rays as normal or pneumonia-positive. The images are 224×224 pixels, and you need a model that captures local patterns like lung textures and fluid accumulations. A fully connected network would ignore spatial relationships—learn CNNs to build an effective image classifier.

Learning Objectives¶

By the end of this module (30-45 minutes), you should be able to: - Explain how convolutions extract spatial features from images. - Build a basic CNN architecture with conv layers, pooling, and batch norm. - Choose appropriate kernel sizes, strides, and padding for different tasks. - Diagnose common CNN training issues like vanishing gradients or overfitting. - Implement global average pooling to reduce parameters in the classifier head.

Prerequisites: Basic PyTorch (nn.Module, Sequential); understanding of images as tensors. Difficulty: Intermediate.

What This Is¶

Convolutional neural networks detect spatial patterns — edges, textures, shapes — by sliding small learned filters across the input. Unlike fully connected layers, convolutions share weights across positions, which makes them efficient and translation-invariant.

The core insight is that local patterns matter more than global position. A vertical edge is a vertical edge whether it appears on the left or the right of the image.

When You Use It¶

classifying or analyzing images
extracting features from grid-structured data
building a backbone for detection, segmentation, or generation
when the input has spatial structure that a fully connected layer would ignore

Architecture Building Blocks¶

Layer	What It Does	Key Parameters
`nn.Conv2d`	applies learned filters to extract spatial features	`in_channels`, `out_channels`, `kernel_size`, `stride`, `padding`
`nn.MaxPool2d`	downsamples by taking the maximum in each window	`kernel_size`, `stride`
`nn.AdaptiveAvgPool2d`	pools to a fixed output size regardless of input size	`output_size`
`nn.BatchNorm2d`	normalizes per channel for stable training	`num_features`
`nn.Dropout2d`	drops entire channels to regularize	`p`
`nn.Flatten`	reshapes spatial output into a vector for the classifier head	—

How Convolution Works¶

A 3×3 filter slides across the input, computing a dot product at each position:

Input (5×5)         Filter (3×3)        Output (3×3)
┌─────────────┐     ┌─────────┐         ┌─────────┐
│ . . . . .   │     │ w w w   │         │ o o o   │
│ . . . . .   │  ×  │ w w w   │    =    │ o o o   │
│ . . . . .   │     │ w w w   │         │ o o o   │
│ . . . . .   │     └─────────┘         └─────────┘
│ . . . . .   │
└─────────────┘

Stride controls how far the filter moves each step (stride=2 halves the spatial size)
Padding adds zeros around the input to control the output size
padding="same" keeps the spatial dimensions unchanged

Minimal Architecture¶

import torch.nn as nn

model = nn.Sequential(
    nn.Conv2d(3, 32, kernel_size=3, padding=1),    # 3 channels in, 32 out
    nn.BatchNorm2d(32),
    nn.ReLU(),
    nn.MaxPool2d(2),                                 # halve spatial size

    nn.Conv2d(32, 64, kernel_size=3, padding=1),
    nn.BatchNorm2d(64),
    nn.ReLU(),
    nn.MaxPool2d(2),

    nn.AdaptiveAvgPool2d(1),                         # global average pool
    nn.Flatten(),
    nn.Linear(64, 10),                               # 10-class output
)

The Design Ladder¶

Start small: 2–3 conv blocks with batch norm and pooling
Check if it overfits one batch — if not, the architecture is broken
Add depth gradually — more layers, more filters
Use pretrained backbones (ResNet, EfficientNet) before designing from scratch

Common Architectures¶

Architecture	Depth	Key Idea
LeNet	shallow	the original: conv → pool → conv → pool → FC
VGG	16–19 layers	stack 3×3 convs, double channels
ResNet	18–152 layers	skip connections to enable extreme depth
EfficientNet	scalable	compound scaling of width, depth, resolution

Receptive Field¶

Each layer "sees" a larger region of the input. Stacking 3×3 convolutions grows the receptive field: - 1 layer: 3×3 - 2 layers: 5×5 - 3 layers: 7×7

This is why small filters stacked deep are preferred over large filters — same receptive field, fewer parameters.

Failure Pattern¶

Building a CNN with no pooling or stride, keeping the spatial size constant throughout. The model becomes extremely slow and memory-hungry without learning hierarchical features.

Another failure: flattening too early. If you flatten a large feature map, the classifier head has millions of parameters and overfits immediately.

Common Mistakes¶

forgetting padding=1 with 3×3 kernels, which shrinks the feature map each layer
using too many fully connected layers after convolutions instead of global average pooling
not using batch normalization, which makes deep CNNs much harder to train
applying the model to inputs of the wrong spatial size without AdaptiveAvgPool2d

Practice¶

Build a 3-layer CNN and train it on a small image dataset.
Replace MaxPool2d with a strided convolution and compare.
Remove batch normalization and observe the effect on convergence.
Overfit a single batch to verify the architecture is correctly wired.
Swap your custom CNN for a pretrained ResNet and compare accuracy.

Case Study: CNNs in Self-Driving Cars¶

Tesla's Autopilot uses CNNs to process camera feeds, detecting lanes, pedestrians, and traffic signs. Early versions struggled with edge cases like rain or low light, but adding data augmentation and deeper architectures improved robustness—showing how CNN design impacts real-world safety.

Quick Quiz¶

Why do CNNs use shared weights across positions?
a) To increase model size
b) To make the model translation-invariant and reduce parameters
c) To handle variable input sizes
d) To speed up training
What happens if you forget padding in a 3×3 conv layer?
a) The model becomes slower
b) Spatial dimensions shrink each layer
c) Memory usage increases
d) Accuracy improves
How does global average pooling help in CNN classifiers?
a) It increases the number of parameters
b) It reduces the feature map to a fixed-size vector without millions of parameters
c) It adds more convolutional layers
d) It normalizes the features
In the pneumonia detection scenario, why is a CNN better than an MLP?
a) CNNs are faster to train
b) CNNs capture spatial patterns like lung textures
c) CNNs use less memory
d) CNNs are easier to implement

Checkpoint¶

[ ] Build and train a basic CNN on image data
[ ] Experiment with different kernel sizes and observe output shapes
[ ] Implement global average pooling for efficient classification
[ ] Add batch normalization and dropout for stable training
[ ] Compare custom CNN vs pretrained backbone performance

Runnable Example¶

This example stays local-only for now because the browser runner does not yet include PyTorch.

Longer Connection¶

Continue with Transfer and Fine-Tuning for using pretrained CNN backbones, and Vision Augmentation and Shift Robustness for making CNNs more robust.