Advanced Clustering and Dimensionality Reduction¶

Scenario: Segmenting Customer Behavior for Marketing¶

You're analyzing customer purchase data for a retail company with thousands of features (demographics, browsing history, transactions). Simple methods fail on non-spherical groups—use advanced clustering and dimensionality reduction to uncover hidden segments, visualize patterns, and target marketing campaigns effectively.

What This Is¶

Clustering groups data points by similarity, while dimensionality reduction compresses high-dimensional data into fewer features. These techniques reveal patterns in unlabeled data, reduce noise, and prepare data for visualization or downstream tasks. Unlike simple k-means, advanced methods handle non-spherical clusters, high dimensions, and manifold structures.

When You Use It¶

exploring unlabeled datasets for hidden groups
visualizing high-dimensional data (e.g., t-SNE for embeddings)
preprocessing before supervised learning (e.g., reducing features)
detecting anomalies or outliers in complex data
comparing cluster quality across methods

Learning Objectives¶

By the end of this topic, you should be able to:

Choose appropriate clustering algorithms for different data shapes.
Apply dimensionality reduction for visualization and preprocessing.
Evaluate cluster quality using metrics and visualizations.
Handle scalability issues in large datasets.
Interpret results to guide model decisions.

Tooling¶

sklearn.cluster.DBSCAN for density-based clustering
sklearn.cluster.AgglomerativeClustering for hierarchical clustering
sklearn.manifold.TSNE for 2D/3D visualization
umap.UMAP for faster, scalable manifold learning
sklearn.decomposition.PCA for linear reduction (baseline)
sklearn.metrics.silhouette_score for cluster evaluation
matplotlib or seaborn for plotting clusters

Minimal Example¶

from sklearn.cluster import DBSCAN
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Assume X is your high-dimensional data (e.g., 50 features)
# Reduce to 2D for visualization
tsne = TSNE(n_components=2, random_state=42)
X_embedded = tsne.fit_transform(X)

# Cluster in original space (or embedded)
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X)

# Plot
plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c=labels, cmap='viridis')
plt.title('Clusters via DBSCAN + t-SNE')
plt.show()

Key Concepts Explained¶

Clustering Types¶

Density-Based (DBSCAN): Finds arbitrary-shaped clusters by density. Great for non-spherical data but sensitive to eps and min_samples.
Hierarchical (Agglomerative): Builds a tree of clusters. Useful for understanding nested groups; visualize with dendrograms.
Manifold-Based (UMAP/t-SNE): Preserves local/global structure. t-SNE for exploration; UMAP for preprocessing (faster, preserves more global structure).

Dimensionality Reduction¶

Linear (PCA): Fast baseline; assumes linear relationships.
Non-Linear (UMAP/t-SNE): Captures complex manifolds. Use t-SNE for viz (slow); UMAP for embedding (scalable).

Evaluation¶

Silhouette Score: Measures cluster cohesion (-1 to 1; higher is better).
Visual Inspection: Plot embeddings; check for meaningful separations.
Stability: Re-run with different seeds; compare consistency.

What Can Go Wrong¶

Wrong Algorithm: Spherical clusters? Use k-means. Arbitrary shapes? DBSCAN.
Parameter Tuning: DBSCAN's eps via k-distance plot; t-SNE perplexity affects local structure.
Scalability: t-SNE is O(n²); use UMAP for large data.
Over-Interpretation: Clusters may not reflect real groups—validate with domain knowledge.

Inspection Habits¶

Visualize before/after reduction.
Compute silhouette scores for each method.
Check cluster sizes—imbalanced clusters may indicate issues.
Compare with random baselines.

Quick Quiz¶

When should you use DBSCAN over k-means?

Answer: For non-spherical or noisy clusters. DBSCAN handles arbitrary shapes and outliers without assuming cluster count.

What's the main difference between t-SNE and UMAP?

Answer: t-SNE preserves local structure better for viz; UMAP is faster, preserves global structure, and scales better.

How do you evaluate clustering quality?

Answer: Use silhouette score for cohesion/separation, and visualize embeddings to check meaningful groupings.

Case Study: Netflix Recommendations via Clustering¶

Netflix uses clustering on user viewing data to group similar preferences, enabling personalized recommendations. Dimensionality reduction helps visualize and refine these groups, improving user engagement by 10-20%.

Expanded Quick Quiz¶

Why use DBSCAN over k-means for clustering?

Answer: DBSCAN handles arbitrary-shaped clusters and noise points, unlike k-means which assumes spherical clusters.

How does t-SNE differ from PCA for dimensionality reduction?

Answer: t-SNE preserves local structure for visualization, while PCA focuses on global variance for linear reduction.

What does silhouette score measure?

Answer: How similar an object is to its own cluster vs. other clusters; higher scores indicate better-defined clusters.

In the customer segmentation scenario, why reduce dimensions first?

Answer: To remove noise and speed up clustering, making it easier to identify meaningful customer groups.

Progress Checkpoint¶

[ ] Applied DBSCAN and hierarchical clustering to sample data.
[ ] Used t-SNE or UMAP for dimensionality reduction and visualization.
[ ] Evaluated clusters with silhouette scores and plots.
[ ] Interpreted results for decision-making (e.g., marketing segments).
[ ] Answered quiz questions without peeking.

Milestone: Complete this to unlock "Dimensionality Reduction" in the Classical ML track. Share your cluster visualization in the academy Discord!

Runnable Example¶

See examples/classical-ml-recipes/advanced_clustering_demo.py for a complete workflow.