Audio Windows and Spectral Features¶
What This Is¶
Audio tasks often depend on how frequencies change over time, not only which frequencies exist overall. Windowed spectral features keep local structure that one global spectrum can hide.
The key distinction is between "what is present somewhere in the clip" and "what happens at a particular time." If the label depends on order, onset, or short-lived events, a single global transform is often too blunt. If the label mostly depends on stable tone content, a clip-level spectral summary can be enough.
When You Use It¶
- building a fast audio baseline
- checking whether local time order matters
- comparing raw waveform features against spectral features
- deciding whether the bottleneck is the feature view or the model family
- turning variable-length clips into a fixed-size matrix for a linear model
Core Tools¶
numpy.fft.rfftfor real-valued FFT featuresnumpy.fft.rfftfreqfor interpreting the frequency binsscipy.signal.windows.get_windowfor Hann, Hamming, Kaiser, and similar windowsscipy.signal.stftorscipy.signal.spectrogramfor frame-by-frame time-frequency viewsscipy.signal.welchfor a more stable clip-level spectral summaryStandardScalerandPipelinefor honest preprocessingLogisticRegressionas a strong first baselineFunctionTransformerwhen you want log-compression inside the model path
Window First¶
Before you take an FFT, decide how long each frame should be and what window to apply. The window controls leakage between neighboring frequencies; the frame length controls how much local time you keep.
from scipy.signal.windows import get_window
frame = x[start : start + nperseg]
window = get_window("hann", nperseg, fftbins=True)
windowed_frame = frame * window
Useful choices:
hannis a good default when you want a clean first baseline.hammingis similar and also common for short audio frames.kaiseris useful when you want to trade main-lobe width against sidelobe suppression more deliberately.
Useful trick:
- keep the window periodic for FFT work, which is what
get_window(..., fftbins=True)gives you by default. - make the frame long enough to capture the event, but not so long that onset timing disappears.
- if the class depends on order, use multiple windows and keep them in time order.
Resolution And Hop Size¶
Three numbers control most of the first audio baseline:
nperseg: how many samples belong to one framenoverlap: how much neighboring frames overlaphop = nperseg - noverlap: how far the window moves each step
Two quantities are worth computing explicitly:
frequency_resolution = sample_rate / nperseg
nyquist = sample_rate / 2
hop_seconds = (nperseg - noverlap) / sample_rate
Interpret them like this:
- larger
npersegimproves frequency resolution but blurs short events in time - smaller
npersegimproves time localization but makes nearby frequencies harder to separate - Nyquist is the highest frequency you can represent without aliasing
- a smaller hop gives denser time coverage but more correlated frames and more compute
That is the basic time-frequency tradeoff. You are choosing how much detail to keep along each axis.
FFT Features¶
import numpy as np
spectrum = np.abs(np.fft.rfft(windowed_frame))
freqs = np.fft.rfftfreq(nperseg, d=1 / sample_rate)
What this gives you:
rfftis the efficient choice for real-valued audio.rfftfreqlets you map bins back to real frequency units.abs(...)turns the complex FFT into magnitudes that are easier to use in a linear model.
Common follow-up:
- apply
np.log1pto compress the dynamic range when a few bins dominate. - keep the FFT length fixed across clips so the feature columns line up.
- if the clip length varies, extract a fixed number of windows or aggregate the per-window features.
Spectral Summaries¶
For a stable clip-level baseline, Welch-style summaries are often easier to defend than one raw FFT of the entire clip.
from scipy.signal import welch
freqs, power = welch(x, fs=sample_rate, window="hann", nperseg=256, noverlap=128)
log_power = np.log1p(power)
This is useful when:
- the clip is noisy and you want a smoother estimate than one frame gives you
- the signal is mostly stationary over the clip
- you want a compact baseline before trying a time-frequency matrix
If the task is more order-sensitive, use a windowed matrix instead of collapsing immediately:
from scipy import signal
freqs, times, spec = signal.spectrogram(
x,
fs=sample_rate,
window="hann",
nperseg=256,
noverlap=128,
scaling="density",
mode="psd",
)
Then summarize the matrix with a small set of statistics:
spec_mean = spec.mean(axis=1)
spec_std = spec.std(axis=1)
spec_max = spec.max(axis=1)
features = np.concatenate([spec_mean, spec_std, spec_max], axis=0)
Why The Demo Matters¶
The AI Academy audio demo is built around a deliberate counterexample:
- both classes contain the same two tones overall
- the label depends on which tone appears first and which appears second
That means a global FFT sees almost the same clip-level frequency content for both classes. A windowed FFT can separate them because it keeps the order of the two halves.
This is the central lesson of the topic:
- if the label depends on temporal order, clip-level spectral summaries can be too blunt even when the right frequencies are present
- a much bigger model is not the first fix if the representation already threw away the time structure
Feature Ladder¶
A practical audio-feature ladder is:
- raw waveform summary
- global FFT or Welch power summary
- windowed FFT or spectrogram statistics
- log-mel features
- MFCC-style compressed cepstral features
Move up the ladder only when the simpler rung is clearly missing the signal you need.
Baseline Ladder¶
The strongest first move is usually not a deep model. It is a clean baseline ladder:
- raw waveform summary
- clip-level spectral summary
- windowed spectral features
- simple linear classifier on the fixed feature matrix
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, StandardScaler
model = Pipeline(
[
("log1p", FunctionTransformer(np.log1p, validate=False)),
("scale", StandardScaler()),
("clf", LogisticRegression(max_iter=1000)),
]
)
Why this pattern works:
FunctionTransformerkeeps the log compression inside the same model path.StandardScalerprevents a few large-frequency features from dominating the objective.LogisticRegressionis a strong test of whether the representation is good enough.
If the model only works after heavy tuning, the feature view is probably still weak.
End-To-End Baseline¶
The shortest defensible workflow is:
import numpy as np
from scipy import signal
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, StandardScaler
def clip_to_window_stats(x, sample_rate):
_, _, spec = signal.spectrogram(
x,
fs=sample_rate,
window="hann",
nperseg=256,
noverlap=128,
scaling="density",
mode="psd",
)
return np.concatenate([spec.mean(axis=1), spec.std(axis=1), spec.max(axis=1)])
X_train = np.vstack([clip_to_window_stats(x, sample_rate) for x in train_clips])
X_valid = np.vstack([clip_to_window_stats(x, sample_rate) for x in valid_clips])
model = Pipeline(
[
("log1p", FunctionTransformer(np.log1p, validate=False)),
("scale", StandardScaler()),
("clf", LogisticRegression(max_iter=1000)),
]
)
model.fit(X_train, y_train)
valid_scores = model.predict_proba(X_valid)[:, 1]
Use that artifact to compare:
- a global FFT or Welch summary baseline
- a windowed summary baseline on the same split
- one short table that says whether the order-sensitive representation actually earned the extra complexity
Failure Pattern¶
Using one full-spectrum summary on an order-sensitive audio task, then concluding the problem needs a much bigger model when the feature view was the real issue.
Another failure mode is normalizing away the local differences you were trying to preserve. A good feature view should make the signal easier to read, not flatter than necessary.
Another trap is mixing different clip lengths without fixing the feature shape. If the model sees a different number of frequency bins or windows from one example to the next, the representation is still incomplete.
If a time shift does not change the features at all, you may have summarized away the very signal the task depends on. If a tiny shift destroys the result, the windowing choice is probably too brittle.
Practice¶
- Compare a global FFT baseline against a windowed FFT baseline.
- Explain why the global feature loses the class-defining information.
- Change the window size and report when the result gets worse.
- Try one stronger baseline only after the feature view is already sensible.
- Explain whether the task is really about frequency content, timing, or both.
- Name one case where a raw waveform model would still be the better first try.
- Compare
welchagainst a per-window FFT table and explain which one is easier to defend. - Try
hannandhammingand check whether the result changes materially. - Fit a linear model before you try a neural model and see whether the feature view is already strong enough.
Runnable Example¶
Open the matching example in AI Academy and run it from the platform.
Library Notes¶
np.fft.rfftis the usual first choice when the signal is real-valued and you only need the nonnegative frequencies.np.fft.rfftfreqtells you where those bins sit in real frequency units.np.abs(...)turns the complex spectrum into magnitudes that are easier to feed into a linear model.np.concatenate(...)is the simplest way to stack window features once the windows are extracted.StandardScaleris often important when one frequency band has a much larger numeric range than the others.Pipelinehelps keep preprocessing attached to the model so the comparison is more honest.welchis useful when you want a smoother spectral estimate from noisy audio.LogisticRegressionis often enough to test whether the representation is the bottleneck.
Questions To Ask¶
- Does the label depend on the order of events, or just the presence of a tone or frequency band?
- What happens if you swap the two windows?
- Does a larger window make the model better on validation but worse on a shifted test?
- Would a simple linear classifier still be useful, or do you actually need a sequence model?
- Are you measuring one clip-level score when you should be measuring a window-aware score?
- Is the model failing because the representation is weak, or because the classifier is underpowered?
- Do the strongest bins align with the frequencies you expected from the task?
- Does log-compression help more than adding more windows?
- If you shuffle the windows, does the score collapse as expected?
- Would a shifted clip or a shorter clip still produce a sensible feature matrix?
Longer Connection¶
Continue with Vision and Audio Workflows for the full audio feature comparison alongside the vision robustness workflow.