Clustering and Low-Dimensional Views¶
Scenario: Analyzing Social Media User Groups¶
You're studying user engagement data from a social platform with features like post frequency, interaction types, and demographics. Data lacks labels—use clustering to identify user segments and low-dimensional views to visualize them, informing targeted content strategies.
What This Is¶
Clustering groups unlabeled data by structure, while low-dimensional views help you inspect patterns visually. Both are useful, but neither should be treated as proof by itself.
The second layer is the geometry story: you are deciding whether the data wants centroid groups, density groups, or a hierarchical merge story. Low-dimensional views help you inspect that story before you commit to a cluster label as if it were ground truth.
When You Use It¶
- exploring unlabeled structure
- checking whether groups look compact, elongated, nested, or density-based
- comparing a label assignment against a visual summary
- inspecting patterns before choosing a method or a parameter value
- validating whether a plot is telling the same story as a non-visual metric
Tooling¶
KMeansDBSCANAgglomerativeClusteringPCATSNEsilhouette_scoresilhouette_samplespairwise_distances
Library Notes¶
KMeansis a centroid-based default when you expect compact, roughly spherical groups. It exposeslabels_,cluster_centers_, andinertia_, so you can inspect both assignment and fit quality.DBSCANis useful when density matters or you want noise points to stay unlabeled. It usesepsandmin_samples, and itslabels_can include-1for noise.AgglomerativeClusteringhelps when nested or hierarchical group structure makes sense. It can be read as a merge story throughchildren_and, when computed,distances_.PCAis a linear projection. It is the most stable low-dimensional view here, andexplained_variance_ratio_tells you how much of the original variation your view keeps.TSNEis a local neighborhood view. It is good for inspection, but the picture can look more separated than the true geometry deserves.silhouette_scoresummarizes how well points sit inside their assigned cluster compared with their nearest other cluster.silhouette_sampleslets you inspect the score per point.pairwise_distancesis useful when you want to inspect geometry directly or when a method can use a precomputed distance matrix.
Method Choice Heuristics¶
- use
KMeanswhen the groups are roughly compact and you want a fast baseline - use
DBSCANwhen the cluster shape is irregular, when density matters, or when you want to mark noise explicitly - use
AgglomerativeClusteringwhen nested structure is plausible, when you want a merge hierarchy, or when you need to compare linkage choices - use
PCAwhen you want a stable inspection view that is easy to reason about - use
TSNEwhen you want a local neighborhood picture and can tolerate more visual drama - use
silhouette_scorewhen you want a quick non-visual check across candidate clusterings - use
pairwise_distanceswhen you need to inspect the geometry itself instead of trusting a plot
Minimal Example¶
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
clusters = make_pipeline(
StandardScaler(),
KMeans(n_clusters=3, n_init=10, random_state=0),
).fit_predict(X)
This is the safe baseline pattern when the scale of features matters. n_init matters because k-means runs more than once and keeps the best inertia among the runs.
Core Functions¶
KMeans¶
Use KMeans when you want a strong baseline for compact clusters. It is easy to explain, fast enough for many teaching cases, and it gives you direct inspection hooks.
What to inspect:
labels_to see which samples were assigned togethercluster_centers_to see what each cluster actually meansinertia_to compare fit quality across differentn_clusters
Common mistakes:
- clustering raw features with very different scales
- treating
cluster_centers_like a guarantee of truth instead of a summary of the fit - using one run and assuming the answer is stable
- reading a low inertia score as if it automatically means the cluster story is meaningful
Good questions:
- Are the groups compact enough for centroid-based splitting?
- Does changing
n_initorrandom_statechange the answer? - Do the centers make sense in the original feature space?
Practical trick:
- if the problem is sparse or high-dimensional, try several initializations and compare the labels before you trust the result
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
kmeans = KMeans(n_clusters=4, n_init=10, random_state=0)
labels = kmeans.fit_predict(X_scaled)
score = silhouette_score(X_scaled, labels)
DBSCAN¶
Use DBSCAN when the groups are not compact spheres, when you expect noise, or when clusters are better defined by density than by distance to a center.
What to inspect:
labels_for cluster membership-1labels for noise points- how sensitive the result is to
eps - how many points become noise when
min_sampleschanges
Common mistakes:
- choosing
epsby habit instead of by the feature scale - setting
epstoo large and merging everything - setting
min_samplestoo low and calling weak structure a cluster - using distance-based methods on unscaled data and then blaming the algorithm
Good questions:
- Do I want noise to remain noise?
- Are the clusters shaped like blobs, chains, or irregular patches?
- Does the result still make sense if
epsshifts a little?
Practical trick:
- if you already have a distance matrix,
metric="precomputed"is available pairwise_distanceshelps you inspect the scale before you pickeps
from sklearn.cluster import DBSCAN
labels = DBSCAN(eps=0.6, min_samples=8).fit_predict(X_scaled)
AgglomerativeClustering¶
Use AgglomerativeClustering when you want a merge tree, when you suspect nested structure, or when you want to compare linkage behavior.
What to inspect:
labels_for the final groupingchildren_for the merge structuredistances_when you want to reason about the tree or build a dendrogram-like inspection
Common mistakes:
- assuming every hierarchical method behaves the same way
- using
wardwith a non-Euclidean metric - asking for a fixed
n_clusterswhen a merge threshold would be a better teaching signal - forgetting that connectivity and metric choices shape the result
Good questions:
- Do I want a fixed number of clusters, or do I want to stop merging at a distance threshold?
- Does the hierarchy tell a meaningful story, or is it just a forced tree?
- Which linkage, average, complete, single, or ward, matches the data geometry?
Practical trick:
distance_thresholdis useful when the goal is to inspect where merging should stop instead of guessing a cluster count first
from sklearn.cluster import AgglomerativeClustering
model = AgglomerativeClustering(n_clusters=4, linkage="ward")
labels = model.fit_predict(X_scaled)
PCA¶
Use PCA when you want a stable, linear summary of the data and a reliable inspection view.
What to inspect:
explained_variance_ratio_to see how much information the projection keepscomponents_to understand which directions are driving the view- the 2D scatter to see broad structure, not final proof
Common mistakes:
- treating a 2D PCA plot as final evidence that clusters are real
- forgetting that PCA centers the data but does not scale each feature for you
- assuming a high-looking separation in 2D means the original space is equally clean
Good questions:
- How much variance did the first two components keep?
- Is the structure visible in PCA because the signal is truly linear, or because the data is simple?
- Does the same grouping survive when I inspect another view?
Practical trick:
- use PCA first when you need a stable map, then compare it with a more local method
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca_2d = pca.fit_transform(X_scaled)
kept = pca.explained_variance_ratio_.sum()
TSNE¶
Use TSNE when you want to inspect local neighborhoods. It is often a better visual story than PCA, but it is not a reliable clustering validator.
What to inspect:
- whether nearby points stay nearby
- whether apparent islands survive small parameter changes
- whether the picture changes a lot across random seeds
Common mistakes:
- reading global distances as if the map were metric-preserving
- comparing cluster sizes or gaps too literally
- trusting one run without checking stability
Good questions:
- Do I care about local neighborhood structure more than global shape?
- Does the apparent separation survive a second run?
- Am I using this for inspection or pretending it is proof?
Practical trick:
perplexityis a major control knob, and different values can change the result substantiallyinit="pca"is often a more globally stable start than random initialization
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, perplexity=30, init="pca", random_state=0)
tsne_2d = tsne.fit_transform(X_scaled)
silhouette_score and silhouette_samples¶
Use these when you want a non-visual check that can compare clusterings on the same data. The score is easier to compare than a plot, but it still reflects the assumptions of the chosen distance metric.
What to inspect:
- whether the mean score is positive and meaningfully above zero
- whether a few weak points are dragging the score down
- whether a cluster has many negative or near-zero sample scores
Common mistakes:
- treating silhouette as universal truth
- comparing scores from incompatible preprocessing pipelines
- forgetting that the score only makes sense when there are at least two labels and not every point has its own label
Practical trick:
- compare two clusterings on the same data, then inspect the per-sample scores for the weak points
from sklearn.metrics import silhouette_score
score = silhouette_score(X_scaled, labels)
pairwise_distances¶
Use this when you want to inspect the geometry directly, or when you want a method to work from a distance matrix instead of raw features.
Useful patterns:
- checking whether the scale of distances is sensible before choosing
eps - feeding
metric="precomputed"workflows - understanding whether the data has a natural neighborhood structure
Common mistake:
- using the same distance metric for every problem without checking whether it matches the representation
Worked Pattern¶
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.metrics import silhouette_score
labels_kmeans = KMeans(n_clusters=3, n_init=10, random_state=0).fit_predict(X_scaled)
labels_dbscan = DBSCAN(eps=0.5, min_samples=8).fit_predict(X_scaled)
labels_agg = AgglomerativeClustering(n_clusters=3, linkage="ward").fit_predict(X_scaled)
pca_2d = PCA(n_components=2).fit_transform(X_scaled)
tsne_2d = TSNE(n_components=2, init="pca", random_state=0).fit_transform(X_scaled)
score = silhouette_score(X_scaled, labels_kmeans)
PCA is the more stable summary. t-SNE is often the more dramatic inspection view. You usually want both, and you usually want at least one non-visual metric beside them.
What To Inspect¶
- whether the clusters are compact, elongated, nested, or density-based
- whether noise points should remain separate from main groups
- whether a 2D projection preserves the neighborhood story you care about
- whether the cluster count came from data structure or from habit
- whether cluster labels change when you change
random_state,n_init, or the preprocessing - whether the silhouette pattern is consistent across clusters or weak in one region
- whether the feature scales are dominating the geometry
- whether the chosen metric matches the story you want to tell
Common Misreads¶
- a clean 2D picture does not prove clean clusters
- a noisy picture does not mean the method failed
- a good
KMeanssilhouette does not mean the geometry is actually centroid-like - t-SNE can make almost any dataset look more separated than it really is
- a low silhouette score can mean the method is wrong, or it can mean the data does not cluster cleanly at all
- a stable hierarchy does not mean the leaves are meaningful classes
- cluster count is not a substitute for understanding geometry
The correct question is not "Does the plot look good?" It is "What part of the method story does the plot actually support?"
Failure Pattern¶
Reading too much meaning into a pretty 2D plot. A low-dimensional view is a tool for inspection, not final evidence about the geometry of the data.
Another failure pattern is forcing every dataset into KMeans because it is familiar. Geometry should drive the choice, not popularity.
Another failure pattern is skipping scaling and then trying to interpret the output as if the algorithm made a deep discovery. Often it just followed the largest raw units.
Another failure pattern is treating one score as final. A good workflow checks labels, centers or merges, the projection, and at least one distance-based score together.
Questions To Ask¶
- What shape do I expect before I run the method?
- Do I want centers, density, or a hierarchy?
- What is the smallest change I can make to test stability?
- Are the clusters meaningful in the original feature space, or only in the plot?
- Does the noise label matter here, or is it a warning sign that the data representation is weak?
- If the method changed its answer, would I know whether the cause was geometry, scaling, or random initialization?
Practice¶
- Compare
KMeansandDBSCANon differently shaped data. - Generate both PCA and t-SNE views for the same dataset.
- Explain one thing each method reveals and one thing it can hide.
- Describe a situation where a cluster plot looks good but the method choice is still wrong.
- Explain what noise points tell you that
KMeanscannot. - Name one reason to distrust a 2D plot even when it is visually neat.
- Explain how you would decide between
KMeansandDBSCANon an unfamiliar dataset. - Describe one reason a good PCA plot can still be misleading.
- Explain why
silhouette_samplescan be more useful than a single average score. - Describe what you would inspect before deciding that
epsis too small or too large. - Explain why
wardlinkage has a stricter metric requirement than some other linkage choices. - Describe a case where
distance_thresholdgives a better lesson than pickingn_clustersfirst.
Case Study: Customer Segmentation in E-Commerce¶
Amazon uses clustering and dimensionality reduction to segment customers based on purchase history, enabling personalized recommendations. Visual inspections ensure segments are meaningful, driving billions in targeted sales.
Expanded Quick Quiz¶
Why use DBSCAN over KMeans?
Answer: DBSCAN handles non-spherical clusters and identifies noise, while KMeans assumes spherical groups.
How does t-SNE differ from PCA?
Answer: t-SNE preserves local neighborhoods for visualization; PCA preserves global variance linearly.
What does a high silhouette score indicate?
Answer: Points are well-matched to their clusters and poorly matched to others, indicating good clustering.
In the social media scenario, why visualize clusters?
Answer: To inspect group structures and ensure they align with user behavior patterns for effective strategies.
Progress Checkpoint¶
- [ ] Applied KMeans, DBSCAN, and AgglomerativeClustering to data.
- [ ] Generated PCA and t-SNE visualizations.
- [ ] Computed silhouette scores and analyzed per-point samples.
- [ ] Interpreted results to choose the best method.
- [ ] Answered quiz questions without peeking.
Milestone: Complete this to unlock "Advanced Clustering and Dimensionality Reduction" in the Classical ML track. Share your cluster visualization in the academy Discord!
Further Reading¶
- Scikit-Learn Clustering Guide.
- t-SNE and UMAP papers.
- Dimensionality Reduction tutorials.
Runnable Example¶
Open the matching example in AI Academy and run it from the platform.
Run the same idea in the browser:
Inspect how the cluster labels change across methods and how the 2D view can simplify the geometry. Check whether the same dataset looks stable under PCA but more variable under t-SNE.
Common Trick¶
If the 2D picture looks too clean, check the original feature space and the method assumptions again. Good visualization can make weak geometry look stronger than it is.
Try at least one non-visual metric as a check. If the plot and the metric disagree, investigate both before deciding the cluster story is settled.
If KMeans looks unstable, increase n_init, scale the data, and compare the silhouette score before changing the number of clusters.
If DBSCAN labels almost everything as noise, check the feature scale and then re-check eps and min_samples rather than switching methods immediately.
If t-SNE looks dramatic, rerun it with another seed and compare the broad neighborhood story instead of the exact point locations.
If hierarchical clustering seems too rigid, try a different linkage or use distance_threshold to inspect where the merge story actually changes.
Longer Connection¶
Continue with SVM and Advanced Clustering for a fuller classical-method workflow.