Skip to content

Track 08

Vision and Audio Workflows

This expansion track turns modality work into two concrete decisions: does the vision recipe survive the shift that matters, and does the audio representation keep the local structure the label depends on?

Primary Goal

Match Representation To Structure

The point is to keep the useful spatial or temporal structure visible long enough that the model choice is honest.

You Will Practice

Shift Tests, Augmentation, Windows

Clean-versus-shifted vision evaluation, training-only augmentation, and raw-versus-windowed audio features on the same split.

Best First Move

Run One Vision Demo And One Audio Demo

Use the short examples first so the full lab feels like a decision chain, not a jump into heavier modality code.

Use This Track When

  • you already know the core workflow and now need modality-specific judgment
  • clean vision accuracy feels better than the real robustness story
  • a global audio feature view may be hiding the local pattern the label depends on
  • you want to compare representation choices before escalating model size

What This Track Is Training

This track trains one practical rule:

  • do not let the representation throw away the signal before the model even sees it

That means the learner should move through this order:

  1. clean versus shifted vision behavior
  2. training-only augmentation as an invariance decision
  3. raw, global spectral, and windowed audio views
  4. one fixed split across all recipe comparisons
  5. one selected recipe per modality with an artifact that justifies it

First Session

Use this order:

  1. Vision Augmentation and Shift Robustness
  2. Audio Windows and Spectral Features
  3. run:
academy/.venv/bin/python academy/examples/vision-audio-recipes/vision_shift_augmentation_demo.py
academy/.venv/bin/python academy/examples/vision-audio-recipes/audio_window_fft_demo.py
  1. inspect where the clean score misleads you and where the global audio view loses local structure
  2. write one short note on whether the main risk is invariance, representation, or both

Full Track Loop

For the complete workflow:

  1. run the two short modality examples first
  2. run the lab entry point:
academy/.venv/bin/python academy/labs/vision-and-audio-workflows/src/vision_audio_workflow.py
  1. inspect the clean-versus-shifted gap before reading the "winner" in vision
  2. inspect the raw_wave, global_fft, and windowed_fft comparison before increasing model size in audio
  3. finish the matching exercises in academy/exercises/vision-and-audio-workflows/
  4. keep one short note with the selected recipe per modality and the artifact that justified each decision

What To Inspect

By the end of the track, the learner should have inspected:

  • the clean-to-shifted drop for each vision recipe
  • whether augmentation helped the simple baseline enough that a larger model stopped being necessary
  • whether the shifted evaluation stayed fixed while the training recipe changed
  • whether a global FFT erased the ordering information that a windowed FFT preserved
  • whether the selected recipe is also the smallest one that still keeps the signal visible

Common Failure Modes

  • choosing the vision recipe by clean accuracy alone
  • applying augmentation or perturbation logic to the evaluation side
  • changing the shifted evaluation while also changing the training recipe
  • using a global spectral feature on an order-sensitive audio task
  • reaching for a bigger model before checking whether the input view is already wrong

Exit Standard

Before leaving this track, the learner should be able to:

  • explain why clean vision accuracy is weak evidence when shift matters
  • show which augmentation or model change actually reduced the shifted gap
  • explain why the chosen audio feature view preserves the task structure
  • defend one selected recipe per modality with a table or plot instead of a slogan
  • name the smallest next change worth testing if the current recipe still falls short