Object Detection Basics¶
What This Is¶
Object detection goes beyond classification — instead of asking "what is in the image?", it asks "what is in the image, and where is it?" The model outputs bounding boxes with class labels and confidence scores.
The key shift: classification produces one label per image; detection produces multiple boxes per image, each with its own label.
When You Use It¶
- locating and counting objects in images
- building surveillance, autonomous driving, or quality inspection systems
- when you need to know where things are, not just what they are
- when multiple objects of different classes appear in a single image
Detection vs Classification vs Segmentation¶
| Task | Output | Granularity |
|---|---|---|
| Classification | one label per image | image-level |
| Object Detection | bounding boxes + labels | box-level |
| Semantic Segmentation | class label per pixel | pixel-level |
| Instance Segmentation | object mask per instance | pixel-level, instance-aware |
How Detection Works¶
Most modern detectors follow one of two paradigms:
One-Stage Detectors (YOLO, SSD, RetinaNet)¶
- Process the entire image in a single pass
- Fast, suitable for real-time applications
- Divide the image into a grid, predict boxes and classes at each cell
Two-Stage Detectors (Faster R-CNN, Mask R-CNN)¶
- First stage proposes candidate regions
- Second stage classifies and refines each region
- More accurate but slower
Key Concepts¶
Bounding Boxes¶
A detection output is typically: [x_min, y_min, x_max, y_max, class_id, confidence]
Intersection over Union (IoU)¶
IoU measures how much a predicted box overlaps with the ground truth:
IoU = Area of Overlap / Area of Union
- IoU = 1.0: perfect match
- IoU > 0.5: typically considered a correct detection
- IoU = 0.0: no overlap at all
Non-Maximum Suppression (NMS)¶
Detectors often produce multiple overlapping boxes for the same object. NMS keeps only the highest-confidence box and suppresses duplicates:
- Sort boxes by confidence
- Keep the top box
- Remove all boxes with IoU > threshold with the kept box
- Repeat for remaining boxes
Mean Average Precision (mAP)¶
The standard detection metric: 1. For each class, compute precision-recall curve 2. Compute average precision (area under the PR curve) 3. Average across all classes
Common Architectures¶
| Model | Type | Speed | Accuracy | Best For |
|---|---|---|---|---|
| YOLOv8 | one-stage | very fast | good | real-time applications |
| SSD | one-stage | fast | moderate | embedded/mobile |
| RetinaNet | one-stage | medium | good | handling class imbalance (focal loss) |
| Faster R-CNN | two-stage | slower | high | when accuracy matters most |
| DETR | transformer-based | medium | high | end-to-end, no NMS needed |
Anchor Boxes¶
Most detectors use predefined anchor boxes at multiple scales and aspect ratios: - Small anchors catch small objects - Large anchors catch large objects - The model predicts offsets from these anchors, not absolute coordinates
DETR is a notable exception — it uses learned object queries instead of anchors.
Failure Pattern¶
Training a detector on images where objects are always centered and large, then deploying on images with small, occluded, or densely packed objects. The model never learned to detect those cases.
Another failure: using IoU threshold of 0.5 during training but evaluating at 0.75, which makes the model look much worse than expected.
Common Mistakes¶
- confusing detection confidence with classification accuracy
- not applying NMS, leading to many duplicate detections
- training on unbalanced classes without techniques like focal loss
- evaluating mAP at only one IoU threshold when the task requires precise localization
Practice¶
- Explain the difference between one-stage and two-stage detectors and when each is preferable.
- Compute IoU between two bounding boxes by hand.
- Describe what NMS does and why it is necessary.
- Explain why mAP is preferred over simple accuracy for detection tasks.
- Compare YOLO and Faster R-CNN for a specific use case and justify your choice.
Runnable Example¶
Longer Connection¶
Continue with Convolutional Neural Networks for the backbone architectures that power detectors, and Vision Augmentation and Shift Robustness for making detection robust to real-world conditions.