Multi-Modal Fusion¶

What This Is¶

Multi-modal fusion is about one decision:

should you combine modalities at all, and if so, should the combination happen early, late, or in the middle of the model

Fusion is only useful when the second modality adds complementary signal. If one modality already solves the task, fusion can add complexity without improving judgment.

When You Use It¶

the task has text plus tabular, image plus text, or similar mixed inputs
one modality alone has plateaued
the modalities plausibly carry different evidence
you are ready to compare fused models against single-modality baselines honestly

Start With The Baseline Rule¶

Do not fuse first. Start with:

best single-modality baseline A
best single-modality baseline B
only then compare a fusion strategy

If you skip that order, you do not know whether fusion helped or merely added noise.

Fusion Choice Table¶

Strategy	Better first use	Main risk
early fusion	feature spaces are already aligned and scale-controlled	one modality dominates the input
late fusion	each modality already has a decent standalone model	misses deeper cross-modal interactions
intermediate fusion	learned encoders need to interact directly	harder to debug and justify

Minimal Pattern¶

Late fusion is often the first honest comparison:

text_proba = text_model.predict_proba(X_text)[:, 1]
tab_proba = tab_model.predict_proba(X_tab)[:, 1]
fused_proba = 0.5 * text_proba + 0.5 * tab_proba

The point is not that averaging is always best. The point is that it gives you a clean first test of whether the second modality is adding anything at all.

What To Inspect First¶

Before moving to more complex fusion, inspect:

whether each modality is already strong on its own
whether the new modality fixes specific failure cases
whether probability scales are comparable
whether the fused model wins on the same split and metric as the single-modality baselines

If the added modality does not rescue any meaningful slice, fusion may not be worth it.

Failure Pattern¶

The common failure is adding a second modality because it sounds richer.

That often creates these problems:

raw feature scales dominate each other
noisy modality hurts a clean baseline
the fused system becomes harder to explain without real gain
calibration is lost when probabilities are combined carelessly

Common Mistakes¶

fusing before building the strongest single-modality baselines
concatenating features without normalization or scale checks
evaluating fused and single-modality models on different splits
assuming a second modality always adds complementary signal
using intermediate fusion before a simple late-fusion comparison

A Good Fusion Note¶

After one comparison, the learner should be able to say:

what each modality contributed on its own
why fusion was tried
which strategy was used first
which slice improved, if any
whether the extra complexity is justified

Practice¶

Build the strongest baseline for each modality alone.
Compare late fusion against the better single-modality baseline.
Decide whether early fusion is justified by the feature structure.
Name one case where the second modality would probably hurt.

Runnable Example¶

Longer Connection¶

Continue with Baseline-First Task Solving for the single-modality baseline discipline and Vision and Text Encoders when the fusion question depends on pretrained representations.