Track 08
Vision and Audio Workflows
This expansion track turns modality work into two concrete decisions: does the vision recipe survive the shift that matters, and does the audio representation keep the local structure the label depends on?
Primary Goal
Match Representation To Structure
The point is to keep the useful spatial or temporal structure visible long enough that the model choice is honest.
You Will Practice
Shift Tests, Augmentation, Windows
Clean-versus-shifted vision evaluation, training-only augmentation, and raw-versus-windowed audio features on the same split.
Best First Move
Run One Vision Demo And One Audio Demo
Use the short examples first so the full lab feels like a decision chain, not a jump into heavier modality code.
Use This Track When¶
- you already know the core workflow and now need modality-specific judgment
- clean vision accuracy feels better than the real robustness story
- a global audio feature view may be hiding the local pattern the label depends on
- you want to compare representation choices before escalating model size
What This Track Is Training¶
This track trains one practical rule:
- do not let the representation throw away the signal before the model even sees it
That means the learner should move through this order:
- clean versus shifted vision behavior
- training-only augmentation as an invariance decision
- raw, global spectral, and windowed audio views
- one fixed split across all recipe comparisons
- one selected recipe per modality with an artifact that justifies it
First Session¶
Use this order:
academy/.venv/bin/python academy/examples/vision-audio-recipes/vision_shift_augmentation_demo.py
academy/.venv/bin/python academy/examples/vision-audio-recipes/audio_window_fft_demo.py
- inspect where the clean score misleads you and where the global audio view loses local structure
- write one short note on whether the main risk is invariance, representation, or both
Full Track Loop¶
For the complete workflow:
- run the two short modality examples first
- run the lab entry point:
academy/.venv/bin/python academy/labs/vision-and-audio-workflows/src/vision_audio_workflow.py
- inspect the clean-versus-shifted gap before reading the "winner" in vision
- inspect the
raw_wave,global_fft, andwindowed_fftcomparison before increasing model size in audio - finish the matching exercises in
academy/exercises/vision-and-audio-workflows/ - keep one short note with the selected recipe per modality and the artifact that justified each decision
What To Inspect¶
By the end of the track, the learner should have inspected:
- the clean-to-shifted drop for each vision recipe
- whether augmentation helped the simple baseline enough that a larger model stopped being necessary
- whether the shifted evaluation stayed fixed while the training recipe changed
- whether a global FFT erased the ordering information that a windowed FFT preserved
- whether the selected recipe is also the smallest one that still keeps the signal visible
Common Failure Modes¶
- choosing the vision recipe by clean accuracy alone
- applying augmentation or perturbation logic to the evaluation side
- changing the shifted evaluation while also changing the training recipe
- using a global spectral feature on an order-sensitive audio task
- reaching for a bigger model before checking whether the input view is already wrong
Exit Standard¶
Before leaving this track, the learner should be able to:
- explain why clean vision accuracy is weak evidence when shift matters
- show which augmentation or model change actually reduced the shifted gap
- explain why the chosen audio feature view preserves the task structure
- defend one selected recipe per modality with a table or plot instead of a slogan
- name the smallest next change worth testing if the current recipe still falls short