Skip to content

Track 17

Speech and Audio Encoders

This track moves past clip-level summaries into timing-aware audio work: pooling, weak labels, boundary inspection, and reuse decisions that should be justified by artifacts instead of guessed.

Primary Goal

Keep The Boundary Visible

If the representation erases the timing signal, the model is already being asked to recover information that was thrown away upstream.

You Will Practice

Pooling, Weak Labels, Probes

Log-band features, clip-versus-frame behavior, weak-label encoders, frozen probes, and small adaptation choices.

Best First Move

Compare Clip And Frame Views

Run the short encoder demo first, then inspect what the clip summary hides before you touch the full workflow.

Use This Track When

  • clip-level audio accuracy is no longer enough to trust the result
  • the task depends on local timing, onset, or short events inside a longer clip
  • your labels are weak at the clip level and you still need boundary evidence
  • you need to decide whether a frozen encoder is enough or a small update is justified

What This Track Is Training

This track trains one practical rule:

  • keep the time axis visible for as long as the task needs it

That means the learner should move through this order:

  1. clip summary and pooling choice
  2. frame-level score behavior
  3. weak-label encoder training
  4. boundary inspection on positive clips
  5. frozen encoder plus probe
  6. light adaptation only if the frozen path stops short

First Session

Use this order:

  1. Audio Windows and Spectral Features
  2. Audio Models
  3. run:
academy/.venv/bin/python academy/examples/speech-audio-encoder-recipes/audio_encoder_demo.py
  1. inspect whether mean pooling and max pooling tell the same event story
  2. write one short note on what the clip summary throws away and what the frame view recovers

Full Track Loop

For the complete workflow:

  1. run the short encoder demo first
  2. run the lab entry point:
academy/.venv/bin/python academy/labs/speech-and-audio-encoders/src/speech_audio_encoder_workflow.py
  1. inspect the clip-level summary before reading the boundary plot
  2. inspect frame-boundary scores before deciding whether to fine-tune the encoder
  3. finish the matching exercises in academy/exercises/speech-and-audio-encoders/
  4. keep one short note with the chosen pooling rule, the boundary evidence, and the frozen-versus-adapted decision

What To Inspect

By the end of the track, the learner should have inspected:

  • whether mean pooling or max pooling keeps the event signal more visible
  • whether clip accuracy and boundary quality disagree
  • whether frame scores peak near the true event boundary on positive clips
  • whether the frozen probe beats the scratch baseline on the same split
  • whether adaptation helps enough to justify changing the encoder

Common Failure Modes

  • collapsing the time axis too early and then blaming the model
  • reading clip-level accuracy as if it proved the boundary estimate is good
  • checking boundary behavior without separating positive clips from the rest
  • fine-tuning before the frozen-probe result is understood
  • keeping the stronger model when the frozen encoder already answers the downstream task

Exit Standard

Before leaving this track, the learner should be able to:

  • explain what the clip summary hid and what the frame view recovered
  • justify the chosen pooling rule with a real artifact
  • say whether the weak-label encoder produced a usable boundary signal
  • defend a frozen, probed, or adapted encoder path without guessing
  • name the smallest next change worth testing if the current representation still falls short