Skip to content

Object Detection Basics

What This Is

Object detection goes beyond classification — instead of asking "what is in the image?", it asks "what is in the image, and where is it?" The model outputs bounding boxes with class labels and confidence scores.

The key shift: classification produces one label per image; detection produces multiple boxes per image, each with its own label.

When You Use It

  • locating and counting objects in images
  • building surveillance, autonomous driving, or quality inspection systems
  • when you need to know where things are, not just what they are
  • when multiple objects of different classes appear in a single image

Detection vs Classification vs Segmentation

Task Output Granularity
Classification one label per image image-level
Object Detection bounding boxes + labels box-level
Semantic Segmentation class label per pixel pixel-level
Instance Segmentation object mask per instance pixel-level, instance-aware

How Detection Works

Most modern detectors follow one of two paradigms:

One-Stage Detectors (YOLO, SSD, RetinaNet)

  • Process the entire image in a single pass
  • Fast, suitable for real-time applications
  • Divide the image into a grid, predict boxes and classes at each cell

Two-Stage Detectors (Faster R-CNN, Mask R-CNN)

  • First stage proposes candidate regions
  • Second stage classifies and refines each region
  • More accurate but slower

Key Concepts

Bounding Boxes

A detection output is typically: [x_min, y_min, x_max, y_max, class_id, confidence]

Intersection over Union (IoU)

IoU measures how much a predicted box overlaps with the ground truth:

IoU = Area of Overlap / Area of Union
  • IoU = 1.0: perfect match
  • IoU > 0.5: typically considered a correct detection
  • IoU = 0.0: no overlap at all

Non-Maximum Suppression (NMS)

Detectors often produce multiple overlapping boxes for the same object. NMS keeps only the highest-confidence box and suppresses duplicates:

  1. Sort boxes by confidence
  2. Keep the top box
  3. Remove all boxes with IoU > threshold with the kept box
  4. Repeat for remaining boxes

Mean Average Precision (mAP)

The standard detection metric: 1. For each class, compute precision-recall curve 2. Compute average precision (area under the PR curve) 3. Average across all classes

Common Architectures

Model Type Speed Accuracy Best For
YOLOv8 one-stage very fast good real-time applications
SSD one-stage fast moderate embedded/mobile
RetinaNet one-stage medium good handling class imbalance (focal loss)
Faster R-CNN two-stage slower high when accuracy matters most
DETR transformer-based medium high end-to-end, no NMS needed

Anchor Boxes

Most detectors use predefined anchor boxes at multiple scales and aspect ratios: - Small anchors catch small objects - Large anchors catch large objects - The model predicts offsets from these anchors, not absolute coordinates

DETR is a notable exception — it uses learned object queries instead of anchors.

Failure Pattern

Training a detector on images where objects are always centered and large, then deploying on images with small, occluded, or densely packed objects. The model never learned to detect those cases.

Another failure: using IoU threshold of 0.5 during training but evaluating at 0.75, which makes the model look much worse than expected.

Common Mistakes

  • confusing detection confidence with classification accuracy
  • not applying NMS, leading to many duplicate detections
  • training on unbalanced classes without techniques like focal loss
  • evaluating mAP at only one IoU threshold when the task requires precise localization

Practice

  1. Explain the difference between one-stage and two-stage detectors and when each is preferable.
  2. Compute IoU between two bounding boxes by hand.
  3. Describe what NMS does and why it is necessary.
  4. Explain why mAP is preferred over simple accuracy for detection tasks.
  5. Compare YOLO and Faster R-CNN for a specific use case and justify your choice.

Runnable Example

Longer Connection

Continue with Convolutional Neural Networks for the backbone architectures that power detectors, and Vision Augmentation and Shift Robustness for making detection robust to real-world conditions.