Skip to content

Advanced Clustering and Dimensionality Reduction

Scenario: Segmenting Customer Behavior for Marketing

You're analyzing customer purchase data for a retail company with thousands of features (demographics, browsing history, transactions). Simple methods fail on non-spherical groups—use advanced clustering and dimensionality reduction to uncover hidden segments, visualize patterns, and target marketing campaigns effectively.

What This Is

Clustering groups data points by similarity, while dimensionality reduction compresses high-dimensional data into fewer features. These techniques reveal patterns in unlabeled data, reduce noise, and prepare data for visualization or downstream tasks. Unlike simple k-means, advanced methods handle non-spherical clusters, high dimensions, and manifold structures.

When You Use It

  • exploring unlabeled datasets for hidden groups
  • visualizing high-dimensional data (e.g., t-SNE for embeddings)
  • preprocessing before supervised learning (e.g., reducing features)
  • detecting anomalies or outliers in complex data
  • comparing cluster quality across methods

Learning Objectives

By the end of this topic, you should be able to:

  • Choose appropriate clustering algorithms for different data shapes.
  • Apply dimensionality reduction for visualization and preprocessing.
  • Evaluate cluster quality using metrics and visualizations.
  • Handle scalability issues in large datasets.
  • Interpret results to guide model decisions.

Tooling

  • sklearn.cluster.DBSCAN for density-based clustering
  • sklearn.cluster.AgglomerativeClustering for hierarchical clustering
  • sklearn.manifold.TSNE for 2D/3D visualization
  • umap.UMAP for faster, scalable manifold learning
  • sklearn.decomposition.PCA for linear reduction (baseline)
  • sklearn.metrics.silhouette_score for cluster evaluation
  • matplotlib or seaborn for plotting clusters

Minimal Example

from sklearn.cluster import DBSCAN
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Assume X is your high-dimensional data (e.g., 50 features)
# Reduce to 2D for visualization
tsne = TSNE(n_components=2, random_state=42)
X_embedded = tsne.fit_transform(X)

# Cluster in original space (or embedded)
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X)

# Plot
plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c=labels, cmap='viridis')
plt.title('Clusters via DBSCAN + t-SNE')
plt.show()

Key Concepts Explained

Clustering Types

  • Density-Based (DBSCAN): Finds arbitrary-shaped clusters by density. Great for non-spherical data but sensitive to eps and min_samples.
  • Hierarchical (Agglomerative): Builds a tree of clusters. Useful for understanding nested groups; visualize with dendrograms.
  • Manifold-Based (UMAP/t-SNE): Preserves local/global structure. t-SNE for exploration; UMAP for preprocessing (faster, preserves more global structure).

Dimensionality Reduction

  • Linear (PCA): Fast baseline; assumes linear relationships.
  • Non-Linear (UMAP/t-SNE): Captures complex manifolds. Use t-SNE for viz (slow); UMAP for embedding (scalable).

Evaluation

  • Silhouette Score: Measures cluster cohesion (-1 to 1; higher is better).
  • Visual Inspection: Plot embeddings; check for meaningful separations.
  • Stability: Re-run with different seeds; compare consistency.

What Can Go Wrong

  • Wrong Algorithm: Spherical clusters? Use k-means. Arbitrary shapes? DBSCAN.
  • Parameter Tuning: DBSCAN's eps via k-distance plot; t-SNE perplexity affects local structure.
  • Scalability: t-SNE is O(n²); use UMAP for large data.
  • Over-Interpretation: Clusters may not reflect real groups—validate with domain knowledge.

Inspection Habits

  • Visualize before/after reduction.
  • Compute silhouette scores for each method.
  • Check cluster sizes—imbalanced clusters may indicate issues.
  • Compare with random baselines.

Quick Quiz

When should you use DBSCAN over k-means?

Answer: For non-spherical or noisy clusters. DBSCAN handles arbitrary shapes and outliers without assuming cluster count.

What's the main difference between t-SNE and UMAP?

Answer: t-SNE preserves local structure better for viz; UMAP is faster, preserves global structure, and scales better.

How do you evaluate clustering quality?

Answer: Use silhouette score for cohesion/separation, and visualize embeddings to check meaningful groupings.

Further Reading

  • UMAP Documentation for manifold learning.
  • DBSCAN Paper for density-based clustering.
  • Experiment with parameters on toy datasets (e.g., moons or blobs from sklearn).

See examples/classical-ml-recipes/advanced_clustering_demo.py for a complete workflow.

Case Study: Netflix Recommendations via Clustering

Netflix uses clustering on user viewing data to group similar preferences, enabling personalized recommendations. Dimensionality reduction helps visualize and refine these groups, improving user engagement by 10-20%.

Expanded Quick Quiz

Why use DBSCAN over k-means for clustering?

Answer: DBSCAN handles arbitrary-shaped clusters and noise points, unlike k-means which assumes spherical clusters.

How does t-SNE differ from PCA for dimensionality reduction?

Answer: t-SNE preserves local structure for visualization, while PCA focuses on global variance for linear reduction.

What does silhouette score measure?

Answer: How similar an object is to its own cluster vs. other clusters; higher scores indicate better-defined clusters.

In the customer segmentation scenario, why reduce dimensions first?

Answer: To remove noise and speed up clustering, making it easier to identify meaningful customer groups.

Progress Checkpoint

  • [ ] Applied DBSCAN and hierarchical clustering to sample data.
  • [ ] Used t-SNE or UMAP for dimensionality reduction and visualization.
  • [ ] Evaluated clusters with silhouette scores and plots.
  • [ ] Interpreted results for decision-making (e.g., marketing segments).
  • [ ] Answered quiz questions without peeking.

Milestone: Complete this to unlock "Dimensionality Reduction" in the Classical ML track. Share your cluster visualization in the academy Discord!

Further Reading

  • Scikit-Learn Clustering Guide.
  • t-SNE paper for manifold learning.
  • UMAP documentation for scalable reduction.

Runnable Example

See examples/classical-ml-recipes/advanced_clustering_demo.py for a complete workflow.