Audio Windows and Spectral Features¶

What This Is¶

Audio tasks often depend on how frequencies change over time, not only which frequencies exist overall. Windowed spectral features keep local structure that one global spectrum can hide.

The key distinction is between "what is present somewhere in the clip" and "what happens at a particular time." If the label depends on order, onset, or short-lived events, a single global transform is often too blunt. If the label mostly depends on stable tone content, a clip-level spectral summary can be enough.

When You Use It¶

building a fast audio baseline
checking whether local time order matters
comparing raw waveform features against spectral features
deciding whether the bottleneck is the feature view or the model family
turning variable-length clips into a fixed-size matrix for a linear model

Core Tools¶

numpy.fft.rfft for real-valued FFT features
numpy.fft.rfftfreq for interpreting the frequency bins
scipy.signal.windows.get_window for Hann, Hamming, Kaiser, and similar windows
scipy.signal.stft or scipy.signal.spectrogram for frame-by-frame time-frequency views
scipy.signal.welch for a more stable clip-level spectral summary
StandardScaler and Pipeline for honest preprocessing
LogisticRegression as a strong first baseline
FunctionTransformer when you want log-compression inside the model path

Window First¶

Before you take an FFT, decide how long each frame should be and what window to apply. The window controls leakage between neighboring frequencies; the frame length controls how much local time you keep.

from scipy.signal.windows import get_window

frame = x[start : start + nperseg]
window = get_window("hann", nperseg, fftbins=True)
windowed_frame = frame * window

Useful choices:

hann is a good default when you want a clean first baseline.
hamming is similar and also common for short audio frames.
kaiser is useful when you want to trade main-lobe width against sidelobe suppression more deliberately.

Useful trick:

keep the window periodic for FFT work, which is what get_window(..., fftbins=True) gives you by default.
make the frame long enough to capture the event, but not so long that onset timing disappears.
if the class depends on order, use multiple windows and keep them in time order.

Resolution And Hop Size¶

Three numbers control most of the first audio baseline:

nperseg: how many samples belong to one frame
noverlap: how much neighboring frames overlap
hop = nperseg - noverlap: how far the window moves each step

Two quantities are worth computing explicitly:

frequency_resolution = sample_rate / nperseg
nyquist = sample_rate / 2
hop_seconds = (nperseg - noverlap) / sample_rate

Interpret them like this:

larger nperseg improves frequency resolution but blurs short events in time
smaller nperseg improves time localization but makes nearby frequencies harder to separate
Nyquist is the highest frequency you can represent without aliasing
a smaller hop gives denser time coverage but more correlated frames and more compute

That is the basic time-frequency tradeoff. You are choosing how much detail to keep along each axis.

FFT Features¶

import numpy as np

spectrum = np.abs(np.fft.rfft(windowed_frame))
freqs = np.fft.rfftfreq(nperseg, d=1 / sample_rate)

What this gives you:

rfft is the efficient choice for real-valued audio.
rfftfreq lets you map bins back to real frequency units.
abs(...) turns the complex FFT into magnitudes that are easier to use in a linear model.

Common follow-up:

apply np.log1p to compress the dynamic range when a few bins dominate.
keep the FFT length fixed across clips so the feature columns line up.
if the clip length varies, extract a fixed number of windows or aggregate the per-window features.

Spectral Summaries¶

For a stable clip-level baseline, Welch-style summaries are often easier to defend than one raw FFT of the entire clip.

from scipy.signal import welch

freqs, power = welch(x, fs=sample_rate, window="hann", nperseg=256, noverlap=128)
log_power = np.log1p(power)

This is useful when:

the clip is noisy and you want a smoother estimate than one frame gives you
the signal is mostly stationary over the clip
you want a compact baseline before trying a time-frequency matrix

If the task is more order-sensitive, use a windowed matrix instead of collapsing immediately:

from scipy import signal

freqs, times, spec = signal.spectrogram(
    x,
    fs=sample_rate,
    window="hann",
    nperseg=256,
    noverlap=128,
    scaling="density",
    mode="psd",
)

Then summarize the matrix with a small set of statistics:

spec_mean = spec.mean(axis=1)
spec_std = spec.std(axis=1)
spec_max = spec.max(axis=1)
features = np.concatenate([spec_mean, spec_std, spec_max], axis=0)

Why The Demo Matters¶

The AI Academy audio demo is built around a deliberate counterexample:

both classes contain the same two tones overall
the label depends on which tone appears first and which appears second

That means a global FFT sees almost the same clip-level frequency content for both classes. A windowed FFT can separate them because it keeps the order of the two halves.

This is the central lesson of the topic:

if the label depends on temporal order, clip-level spectral summaries can be too blunt even when the right frequencies are present
a much bigger model is not the first fix if the representation already threw away the time structure

Feature Ladder¶

A practical audio-feature ladder is:

raw waveform summary
global FFT or Welch power summary
windowed FFT or spectrogram statistics
log-mel features
MFCC-style compressed cepstral features

Move up the ladder only when the simpler rung is clearly missing the signal you need.

Baseline Ladder¶

The strongest first move is usually not a deep model. It is a clean baseline ladder:

raw waveform summary
clip-level spectral summary
windowed spectral features
simple linear classifier on the fixed feature matrix

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, StandardScaler

model = Pipeline(
    [
        ("log1p", FunctionTransformer(np.log1p, validate=False)),
        ("scale", StandardScaler()),
        ("clf", LogisticRegression(max_iter=1000)),
    ]
)

Why this pattern works:

FunctionTransformer keeps the log compression inside the same model path.
StandardScaler prevents a few large-frequency features from dominating the objective.
LogisticRegression is a strong test of whether the representation is good enough.

If the model only works after heavy tuning, the feature view is probably still weak.

End-To-End Baseline¶

The shortest defensible workflow is:

import numpy as np
from scipy import signal
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, StandardScaler

def clip_to_window_stats(x, sample_rate):
    _, _, spec = signal.spectrogram(
        x,
        fs=sample_rate,
        window="hann",
        nperseg=256,
        noverlap=128,
        scaling="density",
        mode="psd",
    )
    return np.concatenate([spec.mean(axis=1), spec.std(axis=1), spec.max(axis=1)])

X_train = np.vstack([clip_to_window_stats(x, sample_rate) for x in train_clips])
X_valid = np.vstack([clip_to_window_stats(x, sample_rate) for x in valid_clips])

model = Pipeline(
    [
        ("log1p", FunctionTransformer(np.log1p, validate=False)),
        ("scale", StandardScaler()),
        ("clf", LogisticRegression(max_iter=1000)),
    ]
)
model.fit(X_train, y_train)
valid_scores = model.predict_proba(X_valid)[:, 1]

Use that artifact to compare:

a global FFT or Welch summary baseline
a windowed summary baseline on the same split
one short table that says whether the order-sensitive representation actually earned the extra complexity

Failure Pattern¶

Using one full-spectrum summary on an order-sensitive audio task, then concluding the problem needs a much bigger model when the feature view was the real issue.

Another failure mode is normalizing away the local differences you were trying to preserve. A good feature view should make the signal easier to read, not flatter than necessary.

Another trap is mixing different clip lengths without fixing the feature shape. If the model sees a different number of frequency bins or windows from one example to the next, the representation is still incomplete.

If a time shift does not change the features at all, you may have summarized away the very signal the task depends on. If a tiny shift destroys the result, the windowing choice is probably too brittle.

Practice¶

Compare a global FFT baseline against a windowed FFT baseline.
Explain why the global feature loses the class-defining information.
Change the window size and report when the result gets worse.
Try one stronger baseline only after the feature view is already sensible.
Explain whether the task is really about frequency content, timing, or both.
Name one case where a raw waveform model would still be the better first try.
Compare welch against a per-window FFT table and explain which one is easier to defend.
Try hann and hamming and check whether the result changes materially.
Fit a linear model before you try a neural model and see whether the feature view is already strong enough.

Runnable Example¶

Open the matching example in AI Academy and run it from the platform.

Library Notes¶

np.fft.rfft is the usual first choice when the signal is real-valued and you only need the nonnegative frequencies.
np.fft.rfftfreq tells you where those bins sit in real frequency units.
np.abs(...) turns the complex spectrum into magnitudes that are easier to feed into a linear model.
np.concatenate(...) is the simplest way to stack window features once the windows are extracted.
StandardScaler is often important when one frequency band has a much larger numeric range than the others.
Pipeline helps keep preprocessing attached to the model so the comparison is more honest.
welch is useful when you want a smoother spectral estimate from noisy audio.
LogisticRegression is often enough to test whether the representation is the bottleneck.

Questions To Ask¶

Does the label depend on the order of events, or just the presence of a tone or frequency band?
What happens if you swap the two windows?
Does a larger window make the model better on validation but worse on a shifted test?
Would a simple linear classifier still be useful, or do you actually need a sequence model?
Are you measuring one clip-level score when you should be measuring a window-aware score?
Is the model failing because the representation is weak, or because the classifier is underpowered?
Do the strongest bins align with the frequencies you expected from the task?
Does log-compression help more than adding more windows?
If you shuffle the windows, does the score collapse as expected?
Would a shifted clip or a shorter clip still produce a sensible feature matrix?

Longer Connection¶

Continue with Vision and Audio Workflows for the full audio feature comparison alongside the vision robustness workflow.