Multi-Modal Fusion¶
What This Is¶
Multi-modal fusion is about one decision:
- should you combine modalities at all, and if so, should the combination happen early, late, or in the middle of the model
Fusion is only useful when the second modality adds complementary signal. If one modality already solves the task, fusion can add complexity without improving judgment.
When You Use It¶
- the task has text plus tabular, image plus text, or similar mixed inputs
- one modality alone has plateaued
- the modalities plausibly carry different evidence
- you are ready to compare fused models against single-modality baselines honestly
Start With The Baseline Rule¶
Do not fuse first. Start with:
- best single-modality baseline A
- best single-modality baseline B
- only then compare a fusion strategy
If you skip that order, you do not know whether fusion helped or merely added noise.
Fusion Choice Table¶
| Strategy | Better first use | Main risk |
|---|---|---|
| early fusion | feature spaces are already aligned and scale-controlled | one modality dominates the input |
| late fusion | each modality already has a decent standalone model | misses deeper cross-modal interactions |
| intermediate fusion | learned encoders need to interact directly | harder to debug and justify |
Minimal Pattern¶
Late fusion is often the first honest comparison:
text_proba = text_model.predict_proba(X_text)[:, 1]
tab_proba = tab_model.predict_proba(X_tab)[:, 1]
fused_proba = 0.5 * text_proba + 0.5 * tab_proba
The point is not that averaging is always best. The point is that it gives you a clean first test of whether the second modality is adding anything at all.
What To Inspect First¶
Before moving to more complex fusion, inspect:
- whether each modality is already strong on its own
- whether the new modality fixes specific failure cases
- whether probability scales are comparable
- whether the fused model wins on the same split and metric as the single-modality baselines
If the added modality does not rescue any meaningful slice, fusion may not be worth it.
Failure Pattern¶
The common failure is adding a second modality because it sounds richer.
That often creates these problems:
- raw feature scales dominate each other
- noisy modality hurts a clean baseline
- the fused system becomes harder to explain without real gain
- calibration is lost when probabilities are combined carelessly
Common Mistakes¶
- fusing before building the strongest single-modality baselines
- concatenating features without normalization or scale checks
- evaluating fused and single-modality models on different splits
- assuming a second modality always adds complementary signal
- using intermediate fusion before a simple late-fusion comparison
A Good Fusion Note¶
After one comparison, the learner should be able to say:
- what each modality contributed on its own
- why fusion was tried
- which strategy was used first
- which slice improved, if any
- whether the extra complexity is justified
Practice¶
- Build the strongest baseline for each modality alone.
- Compare late fusion against the better single-modality baseline.
- Decide whether early fusion is justified by the feature structure.
- Name one case where the second modality would probably hurt.
Runnable Example¶
Longer Connection¶
Continue with Baseline-First Task Solving for the single-modality baseline discipline and Vision and Text Encoders when the fusion question depends on pretrained representations.