Audio Models¶
What This Is¶
This page is about the first real model choice after handcrafted audio features stop being enough. The main question is not "which giant model is popular?" The main question is:
- do I need a spectral baseline
- a frozen pretrained audio encoder
- or a fine-tuned audio model
The right answer depends on the task, the clip structure, and how much labeled data you actually have.
When You Use It¶
- clip-level sound classification
- speech recognition or transcription
- audio embedding reuse for similarity or retrieval
- deciding when to move beyond FFT or spectrogram features
Start With The Task Type¶
Use the task to choose the first serious model:
| Task | Best first move | What to inspect |
|---|---|---|
| short sound classification | spectral baseline, then frozen encoder if needed | confusion between similar sound classes |
| speech transcription | pretrained speech model | decoded examples and word error rate |
| audio retrieval or similarity | frozen embeddings | nearest-neighbor quality and cluster structure |
| small labeled dataset | frozen encoder before fine-tuning | whether reuse already solves most of the task |
Do not jump straight to fine-tuning if the task can be answered by a cheaper baseline.
Representation Ladder¶
Use this ladder in order:
- handcrafted spectral features
- frozen pretrained encoder
- partial or full fine-tuning
That ordering matters because each step adds cost, failure modes, and debugging burden.
If handcrafted features already separate the task well, the encoder may be unnecessary. If the frozen encoder already works, full fine-tuning may be waste.
Minimal Encoder Pattern¶
This is the shortest frozen-encoder pattern:
import torch
from transformers import AutoModel, AutoProcessor
processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base")
encoder = AutoModel.from_pretrained("facebook/wav2vec2-base")
inputs = processor(waveform, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
hidden = encoder(**inputs).last_hidden_state
clip_embedding = hidden.mean(dim=1)
The point of this pattern is not the exact model name. The point is to extract a reusable representation first, then test whether a small classifier or retrieval rule is already enough.
What To Inspect First¶
Before you trust the model, inspect:
- sample rate and any resampling step
- clip duration and padding
- how much silence or background noise is present
- whether the task is actually speech or non-speech
- whether the frozen encoder beats the simpler spectral baseline
If those checks are missing, model choice is still guesswork.
A Good Decision Note¶
After one serious run, the learner should be able to say:
- what the task was
- which baseline was tried first
- whether frozen features were already strong enough
- what error pattern remained
- whether fine-tuning is justified or not
Failure Pattern¶
The common failure is using a speech model just because it is available.
That usually breaks in one of these ways:
- the clip is not speech, so the encoder prior is mismatched
- the sample rate or waveform normalization is wrong
- the model is fine-tuned on too little labeled data
- one impressive decoded example hides unstable overall behavior
Common Mistakes¶
- skipping the spectral baseline from Audio Windows and Spectral Features
- using the wrong sample rate for the pretrained model
- treating padding and truncation as harmless
- comparing transcripts by eye without checking a real metric
- fine-tuning a large encoder before testing frozen embeddings
- assuming a pretrained speech encoder is automatically a good environmental-sound model
Runnable Example¶
Use the local example:
academy/.venv/bin/python academy/examples/audio-models-demo.py
That example is intentionally small. Its job is to make loading, preprocessing, and encoder inference visible before the full workflow gets heavier.
Practice¶
- Run a spectral baseline and write down where it fails.
- Extract frozen audio embeddings and compare them against the baseline.
- Decide whether the task looks like classification, transcription, or retrieval.
- Explain whether fine-tuning is justified with the amount of labeled data you have.
- Name one preprocessing mistake that could invalidate the comparison.
Longer Connection¶
Use Audio Windows and Spectral Features for the baseline feature ladder, Vision and Audio Workflows for full multimodal workflow practice, and Speech and Audio Encoders when the task clearly needs a deeper encoder-reuse route.