Track 08

Vision and Audio Workflows

This expansion track turns modality work into two concrete decisions: does the vision recipe survive the shift that matters, and does the audio representation keep the local structure the label depends on?

Open Vision Topic Run Fast Examples Use The Study Plan

Primary Goal

Match Representation To Structure

The point is to keep the useful spatial or temporal structure visible long enough that the model choice is honest.

You Will Practice

Shift Tests, Augmentation, Windows

Clean-versus-shifted vision evaluation, training-only augmentation, and raw-versus-windowed audio features on the same split.

Best First Move

Run One Vision Demo And One Audio Demo

Use the short examples first so the full lab feels like a decision chain, not a jump into heavier modality code.

Use This Track When¶

you already know the core workflow and now need modality-specific judgment
clean vision accuracy feels better than the real robustness story
a global audio feature view may be hiding the local pattern the label depends on
you want to compare representation choices before escalating model size

What This Track Is Training¶

This track trains one practical rule:

do not let the representation throw away the signal before the model even sees it

That means the learner should move through this order:

clean versus shifted vision behavior
training-only augmentation as an invariance decision
raw, global spectral, and windowed audio views
one fixed split across all recipe comparisons
one selected recipe per modality with an artifact that justifies it

First Session¶

Use this order:

academy/.venv/bin/python academy/examples/vision-audio-recipes/vision_shift_augmentation_demo.py
academy/.venv/bin/python academy/examples/vision-audio-recipes/audio_window_fft_demo.py

inspect where the clean score misleads you and where the global audio view loses local structure
write one short note on whether the main risk is invariance, representation, or both

Full Track Loop¶

For the complete workflow:

run the two short modality examples first
run the lab entry point:

academy/.venv/bin/python academy/labs/vision-and-audio-workflows/src/vision_audio_workflow.py

inspect the clean-versus-shifted gap before reading the "winner" in vision
inspect the raw_wave, global_fft, and windowed_fft comparison before increasing model size in audio
finish the matching exercises in academy/exercises/vision-and-audio-workflows/
keep one short note with the selected recipe per modality and the artifact that justified each decision

What To Inspect¶

By the end of the track, the learner should have inspected:

the clean-to-shifted drop for each vision recipe
whether augmentation helped the simple baseline enough that a larger model stopped being necessary
whether the shifted evaluation stayed fixed while the training recipe changed
whether a global FFT erased the ordering information that a windowed FFT preserved
whether the selected recipe is also the smallest one that still keeps the signal visible

Common Failure Modes¶

choosing the vision recipe by clean accuracy alone
applying augmentation or perturbation logic to the evaluation side
changing the shifted evaluation while also changing the training recipe
using a global spectral feature on an order-sensitive audio task
reaching for a bigger model before checking whether the input view is already wrong

Exit Standard¶

Before leaving this track, the learner should be able to:

explain why clean vision accuracy is weak evidence when shift matters
show which augmentation or model change actually reduced the shifted gap
explain why the chosen audio feature view preserves the task structure
defend one selected recipe per modality with a table or plot instead of a slogan
name the smallest next change worth testing if the current recipe still falls short