Dimensionality Reduction¶
Scenario: Visualizing High-Dimensional Genomics Data¶
You're analyzing gene expression data with thousands of features per sample. Raw data is too complex to visualize—use dimensionality reduction to compress it into 2D/3D views, revealing patterns in disease clusters and guiding biomarker discovery.
What This Is¶
Dimensionality reduction projects high-dimensional data into fewer dimensions while preserving as much useful structure as possible. The goal is to simplify the data for visualization, faster modeling, or noise reduction — not to create a black box.
When You Use It¶
- visualizing high-dimensional clusters or class separations
- removing noisy or redundant features before modeling
- compressing features for faster training
- debugging a model by inspecting what the data "looks like" in 2D
- checking whether classes are separable before building a complex model
The Three Main Tools¶
| Method | What It Preserves | Best For |
|---|---|---|
| PCA | global variance | linear compression, denoising, preprocessing |
| t-SNE | local neighborhood structure | 2D visualization of clusters |
| UMAP | local + some global structure | 2D/3D visualization, faster than t-SNE |
PCA — Start Here¶
PCA finds the directions of maximum variance and projects data onto them. It is linear, fast, and invertible.
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X_scaled)
print(f"Explained variance: {pca.explained_variance_ratio_}")
print(f"Total: {pca.explained_variance_ratio_.sum():.3f}")
When PCA helps¶
- features are correlated and redundant
- the first few components capture most of the variance
- you need a fast, interpretable reduction step
When PCA fails¶
- the structure is nonlinear (e.g., spirals, nested clusters)
- most variance is noise, and signal lives in a small subspace
- you use too few components and lose the discriminative structure
Choosing n_components¶
pca = PCA(n_components=0.95) # keep 95% of variance
X_reduced = pca.fit_transform(X_scaled)
print(f"Components kept: {pca.n_components_}")
The explained variance curve is your decision tool. If it drops sharply, you can compress aggressively. If it decays slowly, compression will lose signal.
t-SNE — For Visualization¶
t-SNE maps points so that nearby neighbors in high-dimensional space stay close in 2D. It is nonlinear and non-invertible.
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, perplexity=30, random_state=0)
X_2d = tsne.fit_transform(X_scaled)
What perplexity controls¶
- low perplexity (~5): focuses on very local neighborhoods, can create tight artificial clusters
- medium perplexity (~30): default, balances local and moderate-range structure
- high perplexity (~100): reveals broader patterns but can blur close neighbors
What to trust and what not to trust¶
- ✅ cluster membership: if two groups are clearly separated, that usually means something
- ❌ distances between clusters: these are not meaningful
- ❌ cluster sizes and shapes: these depend on density, not real geometry
- ❌ axes: they have no fixed interpretation
UMAP — Faster Alternative¶
UMAP is similar in spirit to t-SNE but usually faster and can preserve more global structure.
import umap
reducer = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=0)
X_2d = reducer.fit_transform(X_scaled)
Key parameters¶
n_neighbors: how many neighbors define "local" (higher = more global)min_dist: how tightly points cluster (lower = tighter, more cluster separation)
The Reduction Ladder¶
- Start with PCA to check variance structure and get a fast linear baseline
- Try t-SNE or UMAP only for visualization — do not use them as preprocessing for classifiers
- Always scale first with
StandardScalerbefore any reduction method
Failure Pattern¶
Using t-SNE output as features for a classifier. t-SNE is a visualization tool, not a feature engineering step. The embedding changes with hyperparameters and random seeds, and it cannot transform new data.
Another failure: interpreting inter-cluster distances in a t-SNE plot as meaningful. They are not.
Common Mistakes¶
- forgetting to scale data before PCA (PCA is sensitive to feature scales)
- using too many PCA components and keeping all the noise
- treating t-SNE clusters as ground truth for labeling
- using t-SNE on very large datasets without downsampling first (it is slow)
- comparing two t-SNE plots with different perplexity values as if they show the same thing
Practice¶
- Apply PCA to a dataset and plot the explained variance curve. How many components capture 90%?
- Visualize the same data with t-SNE at perplexity 5, 30, and 100. How does the picture change?
- Compare PCA 2D versus UMAP 2D on the same data. Which reveals more structure?
- Use PCA as a preprocessing step before logistic regression and compare against the full feature set.
- Explain why you should not use t-SNE embeddings as input features for a model.
Case Study: Image Compression with PCA¶
Image processing tools like JPEG use PCA-like methods to compress photos by reducing dimensions while preserving visual quality. This enables faster storage and transmission without losing key details.
Expanded Quick Quiz¶
Why scale data before PCA?
Answer: PCA is sensitive to feature scales; unscaled data can make high-variance features dominate the components.
When should you use t-SNE over PCA?
Answer: For visualizing non-linear structures and local neighborhoods; PCA is better for linear variance preservation.
What does UMAP offer over t-SNE?
Answer: Faster computation and better preservation of global structure, making it scalable for larger datasets.
In the genomics scenario, why reduce dimensions?
Answer: To visualize and analyze high-dimensional gene data, identifying clusters that might indicate diseases.
Progress Checkpoint¶
- [ ] Applied PCA and analyzed explained variance.
- [ ] Generated t-SNE and UMAP visualizations.
- [ ] Compared methods on the same dataset.
- [ ] Used reduction for preprocessing in a model.
- [ ] Answered quiz questions without peeking.
Milestone: Complete this to unlock "Clustering and Low-Dimensional Views" in the Classical ML track. Share your dimensionality reduction plot in the academy Discord!
Further Reading¶
- Scikit-Learn Decomposition Guide.
- t-SNE and UMAP papers.
- PCA tutorials for beginners.
Runnable Example¶
Longer Connection¶
Continue with Clustering and Low-Dimensional Views for the clustering step that often follows reduction, and Feature Selection for an alternative way to reduce dimensionality by dropping features instead of transforming them.