Object Detection Basics¶

What This Is¶

Object detection goes beyond classification — instead of asking "what is in the image?", it asks "what is in the image, and where is it?" The model outputs bounding boxes with class labels and confidence scores.

The key shift: classification produces one label per image; detection produces multiple boxes per image, each with its own label.

When You Use It¶

locating and counting objects in images
building surveillance, autonomous driving, or quality inspection systems
when you need to know where things are, not just what they are
when multiple objects of different classes appear in a single image

Detection vs Classification vs Segmentation¶

Task	Output	Granularity
Classification	one label per image	image-level
Object Detection	bounding boxes + labels	box-level
Semantic Segmentation	class label per pixel	pixel-level
Instance Segmentation	object mask per instance	pixel-level, instance-aware

How Detection Works¶

Most modern detectors follow one of two paradigms:

One-Stage Detectors (YOLO, SSD, RetinaNet)¶

Process the entire image in a single pass
Fast, suitable for real-time applications
Divide the image into a grid, predict boxes and classes at each cell

Two-Stage Detectors (Faster R-CNN, Mask R-CNN)¶

First stage proposes candidate regions
Second stage classifies and refines each region
More accurate but slower

Key Concepts¶

Bounding Boxes¶

A detection output is typically: [x_min, y_min, x_max, y_max, class_id, confidence]

Intersection over Union (IoU)¶

IoU measures how much a predicted box overlaps with the ground truth:

IoU = Area of Overlap / Area of Union

IoU = 1.0: perfect match
IoU > 0.5: typically considered a correct detection
IoU = 0.0: no overlap at all

Non-Maximum Suppression (NMS)¶

Detectors often produce multiple overlapping boxes for the same object. NMS keeps only the highest-confidence box and suppresses duplicates:

Sort boxes by confidence
Keep the top box
Remove all boxes with IoU > threshold with the kept box
Repeat for remaining boxes

Mean Average Precision (mAP)¶

The standard detection metric: 1. For each class, compute precision-recall curve 2. Compute average precision (area under the PR curve) 3. Average across all classes

Common Architectures¶

Model	Type	Speed	Accuracy	Best For
YOLOv8	one-stage	very fast	good	real-time applications
SSD	one-stage	fast	moderate	embedded/mobile
RetinaNet	one-stage	medium	good	handling class imbalance (focal loss)
Faster R-CNN	two-stage	slower	high	when accuracy matters most
DETR	transformer-based	medium	high	end-to-end, no NMS needed

Anchor Boxes¶

Most detectors use predefined anchor boxes at multiple scales and aspect ratios: - Small anchors catch small objects - Large anchors catch large objects - The model predicts offsets from these anchors, not absolute coordinates

DETR is a notable exception — it uses learned object queries instead of anchors.

Failure Pattern¶

Training a detector on images where objects are always centered and large, then deploying on images with small, occluded, or densely packed objects. The model never learned to detect those cases.

Another failure: using IoU threshold of 0.5 during training but evaluating at 0.75, which makes the model look much worse than expected.

Common Mistakes¶

confusing detection confidence with classification accuracy
not applying NMS, leading to many duplicate detections
training on unbalanced classes without techniques like focal loss
evaluating mAP at only one IoU threshold when the task requires precise localization

Practice¶

Explain the difference between one-stage and two-stage detectors and when each is preferable.
Compute IoU between two bounding boxes by hand.
Describe what NMS does and why it is necessary.
Explain why mAP is preferred over simple accuracy for detection tasks.
Compare YOLO and Faster R-CNN for a specific use case and justify your choice.

Runnable Example¶

Longer Connection¶

Continue with Convolutional Neural Networks for the backbone architectures that power detectors, and Vision Augmentation and Shift Robustness for making detection robust to real-world conditions.