Audio Models¶

What This Is¶

This page is about the first real model choice after handcrafted audio features stop being enough. The main question is not "which giant model is popular?" The main question is:

do I need a spectral baseline
a frozen pretrained audio encoder
or a fine-tuned audio model

The right answer depends on the task, the clip structure, and how much labeled data you actually have.

When You Use It¶

clip-level sound classification
speech recognition or transcription
audio embedding reuse for similarity or retrieval
deciding when to move beyond FFT or spectrogram features

Start With The Task Type¶

Use the task to choose the first serious model:

Task	Best first move	What to inspect
short sound classification	spectral baseline, then frozen encoder if needed	confusion between similar sound classes
speech transcription	pretrained speech model	decoded examples and word error rate
audio retrieval or similarity	frozen embeddings	nearest-neighbor quality and cluster structure
small labeled dataset	frozen encoder before fine-tuning	whether reuse already solves most of the task

Do not jump straight to fine-tuning if the task can be answered by a cheaper baseline.

Representation Ladder¶

Use this ladder in order:

handcrafted spectral features
frozen pretrained encoder
partial or full fine-tuning

That ordering matters because each step adds cost, failure modes, and debugging burden.

If handcrafted features already separate the task well, the encoder may be unnecessary. If the frozen encoder already works, full fine-tuning may be waste.

Minimal Encoder Pattern¶

This is the shortest frozen-encoder pattern:

import torch
from transformers import AutoModel, AutoProcessor

processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base")
encoder = AutoModel.from_pretrained("facebook/wav2vec2-base")

inputs = processor(waveform, sampling_rate=16000, return_tensors="pt")

with torch.no_grad():
    hidden = encoder(**inputs).last_hidden_state

clip_embedding = hidden.mean(dim=1)

The point of this pattern is not the exact model name. The point is to extract a reusable representation first, then test whether a small classifier or retrieval rule is already enough.

What To Inspect First¶

Before you trust the model, inspect:

sample rate and any resampling step
clip duration and padding
how much silence or background noise is present
whether the task is actually speech or non-speech
whether the frozen encoder beats the simpler spectral baseline

If those checks are missing, model choice is still guesswork.

A Good Decision Note¶

After one serious run, the learner should be able to say:

what the task was
which baseline was tried first
whether frozen features were already strong enough
what error pattern remained
whether fine-tuning is justified or not

Failure Pattern¶

The common failure is using a speech model just because it is available.

That usually breaks in one of these ways:

the clip is not speech, so the encoder prior is mismatched
the sample rate or waveform normalization is wrong
the model is fine-tuned on too little labeled data
one impressive decoded example hides unstable overall behavior

Common Mistakes¶

skipping the spectral baseline from Audio Windows and Spectral Features
using the wrong sample rate for the pretrained model
treating padding and truncation as harmless
comparing transcripts by eye without checking a real metric
fine-tuning a large encoder before testing frozen embeddings
assuming a pretrained speech encoder is automatically a good environmental-sound model

Runnable Example¶

Use the local example:

academy/.venv/bin/python academy/examples/audio-models-demo.py

That example is intentionally small. Its job is to make loading, preprocessing, and encoder inference visible before the full workflow gets heavier.

Practice¶

Run a spectral baseline and write down where it fails.
Extract frozen audio embeddings and compare them against the baseline.
Decide whether the task looks like classification, transcription, or retrieval.
Explain whether fine-tuning is justified with the amount of labeled data you have.
Name one preprocessing mistake that could invalidate the comparison.

Longer Connection¶

Use Audio Windows and Spectral Features for the baseline feature ladder, Vision and Audio Workflows for full multimodal workflow practice, and Speech and Audio Encoders when the task clearly needs a deeper encoder-reuse route.