Dimensionality Reduction¶

Scenario: Visualizing High-Dimensional Genomics Data¶

You're analyzing gene expression data with thousands of features per sample. Raw data is too complex to visualize—use dimensionality reduction to compress it into 2D/3D views, revealing patterns in disease clusters and guiding biomarker discovery.

What This Is¶

Dimensionality reduction projects high-dimensional data into fewer dimensions while preserving as much useful structure as possible. The goal is to simplify the data for visualization, faster modeling, or noise reduction — not to create a black box.

When You Use It¶

visualizing high-dimensional clusters or class separations
removing noisy or redundant features before modeling
compressing features for faster training
debugging a model by inspecting what the data "looks like" in 2D
checking whether classes are separable before building a complex model

The Three Main Tools¶

Method	What It Preserves	Best For
PCA	global variance	linear compression, denoising, preprocessing
t-SNE	local neighborhood structure	2D visualization of clusters
UMAP	local + some global structure	2D/3D visualization, faster than t-SNE

PCA — Start Here¶

PCA finds the directions of maximum variance and projects data onto them. It is linear, fast, and invertible.

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X_scaled)

print(f"Explained variance: {pca.explained_variance_ratio_}")
print(f"Total:              {pca.explained_variance_ratio_.sum():.3f}")

When PCA helps¶

features are correlated and redundant
the first few components capture most of the variance
you need a fast, interpretable reduction step

When PCA fails¶

the structure is nonlinear (e.g., spirals, nested clusters)
most variance is noise, and signal lives in a small subspace
you use too few components and lose the discriminative structure

Choosing n_components¶

pca = PCA(n_components=0.95)  # keep 95% of variance
X_reduced = pca.fit_transform(X_scaled)
print(f"Components kept: {pca.n_components_}")

The explained variance curve is your decision tool. If it drops sharply, you can compress aggressively. If it decays slowly, compression will lose signal.

t-SNE — For Visualization¶

t-SNE maps points so that nearby neighbors in high-dimensional space stay close in 2D. It is nonlinear and non-invertible.

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=30, random_state=0)
X_2d = tsne.fit_transform(X_scaled)

What perplexity controls¶

low perplexity (~5): focuses on very local neighborhoods, can create tight artificial clusters
medium perplexity (~30): default, balances local and moderate-range structure
high perplexity (~100): reveals broader patterns but can blur close neighbors

What to trust and what not to trust¶

✅ cluster membership: if two groups are clearly separated, that usually means something
❌ distances between clusters: these are not meaningful
❌ cluster sizes and shapes: these depend on density, not real geometry
❌ axes: they have no fixed interpretation

UMAP — Faster Alternative¶

UMAP is similar in spirit to t-SNE but usually faster and can preserve more global structure.

import umap

reducer = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=0)
X_2d = reducer.fit_transform(X_scaled)

Key parameters¶

n_neighbors: how many neighbors define "local" (higher = more global)
min_dist: how tightly points cluster (lower = tighter, more cluster separation)

The Reduction Ladder¶

Start with PCA to check variance structure and get a fast linear baseline
Try t-SNE or UMAP only for visualization — do not use them as preprocessing for classifiers
Always scale first with StandardScaler before any reduction method

Failure Pattern¶

Using t-SNE output as features for a classifier. t-SNE is a visualization tool, not a feature engineering step. The embedding changes with hyperparameters and random seeds, and it cannot transform new data.

Another failure: interpreting inter-cluster distances in a t-SNE plot as meaningful. They are not.

Common Mistakes¶

forgetting to scale data before PCA (PCA is sensitive to feature scales)
using too many PCA components and keeping all the noise
treating t-SNE clusters as ground truth for labeling
using t-SNE on very large datasets without downsampling first (it is slow)
comparing two t-SNE plots with different perplexity values as if they show the same thing

Practice¶

Apply PCA to a dataset and plot the explained variance curve. How many components capture 90%?
Visualize the same data with t-SNE at perplexity 5, 30, and 100. How does the picture change?
Compare PCA 2D versus UMAP 2D on the same data. Which reveals more structure?
Use PCA as a preprocessing step before logistic regression and compare against the full feature set.
Explain why you should not use t-SNE embeddings as input features for a model.

Case Study: Image Compression with PCA¶

Image processing tools like JPEG use PCA-like methods to compress photos by reducing dimensions while preserving visual quality. This enables faster storage and transmission without losing key details.

Expanded Quick Quiz¶

Why scale data before PCA?

Answer: PCA is sensitive to feature scales; unscaled data can make high-variance features dominate the components.

When should you use t-SNE over PCA?

Answer: For visualizing non-linear structures and local neighborhoods; PCA is better for linear variance preservation.

What does UMAP offer over t-SNE?

Answer: Faster computation and better preservation of global structure, making it scalable for larger datasets.

In the genomics scenario, why reduce dimensions?

Answer: To visualize and analyze high-dimensional gene data, identifying clusters that might indicate diseases.

Progress Checkpoint¶

[ ] Applied PCA and analyzed explained variance.
[ ] Generated t-SNE and UMAP visualizations.
[ ] Compared methods on the same dataset.
[ ] Used reduction for preprocessing in a model.
[ ] Answered quiz questions without peeking.

Milestone: Complete this to unlock "Clustering and Low-Dimensional Views" in the Classical ML track. Share your dimensionality reduction plot in the academy Discord!

Runnable Example¶

Longer Connection¶

Continue with Clustering and Low-Dimensional Views for the clustering step that often follows reduction, and Feature Selection for an alternative way to reduce dimensionality by dropping features instead of transforming them.