Clustering and Low-Dimensional Views¶

You're studying user engagement data from a social platform with features like post frequency, interaction types, and demographics. Data lacks labels—use clustering to identify user segments and low-dimensional views to visualize them, informing targeted content strategies.

What This Is¶

Clustering groups unlabeled data by structure, while low-dimensional views help you inspect patterns visually. Both are useful, but neither should be treated as proof by itself.

The second layer is the geometry story: you are deciding whether the data wants centroid groups, density groups, or a hierarchical merge story. Low-dimensional views help you inspect that story before you commit to a cluster label as if it were ground truth.

When You Use It¶

exploring unlabeled structure
checking whether groups look compact, elongated, nested, or density-based
comparing a label assignment against a visual summary
inspecting patterns before choosing a method or a parameter value
validating whether a plot is telling the same story as a non-visual metric

Tooling¶

KMeans
DBSCAN
AgglomerativeClustering
PCA
TSNE
silhouette_score
silhouette_samples
pairwise_distances

Library Notes¶

KMeans is a centroid-based default when you expect compact, roughly spherical groups. It exposes labels_, cluster_centers_, and inertia_, so you can inspect both assignment and fit quality.
DBSCAN is useful when density matters or you want noise points to stay unlabeled. It uses eps and min_samples, and its labels_ can include -1 for noise.
AgglomerativeClustering helps when nested or hierarchical group structure makes sense. It can be read as a merge story through children_ and, when computed, distances_.
PCA is a linear projection. It is the most stable low-dimensional view here, and explained_variance_ratio_ tells you how much of the original variation your view keeps.
TSNE is a local neighborhood view. It is good for inspection, but the picture can look more separated than the true geometry deserves.
silhouette_score summarizes how well points sit inside their assigned cluster compared with their nearest other cluster. silhouette_samples lets you inspect the score per point.
pairwise_distances is useful when you want to inspect geometry directly or when a method can use a precomputed distance matrix.

Method Choice Heuristics¶

use KMeans when the groups are roughly compact and you want a fast baseline
use DBSCAN when the cluster shape is irregular, when density matters, or when you want to mark noise explicitly
use AgglomerativeClustering when nested structure is plausible, when you want a merge hierarchy, or when you need to compare linkage choices
use PCA when you want a stable inspection view that is easy to reason about
use TSNE when you want a local neighborhood picture and can tolerate more visual drama
use silhouette_score when you want a quick non-visual check across candidate clusterings
use pairwise_distances when you need to inspect the geometry itself instead of trusting a plot

Minimal Example¶

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

clusters = make_pipeline(
    StandardScaler(),
    KMeans(n_clusters=3, n_init=10, random_state=0),
).fit_predict(X)

This is the safe baseline pattern when the scale of features matters. n_init matters because k-means runs more than once and keeps the best inertia among the runs.

Core Functions¶

`KMeans`¶

Use KMeans when you want a strong baseline for compact clusters. It is easy to explain, fast enough for many teaching cases, and it gives you direct inspection hooks.

What to inspect:

labels_ to see which samples were assigned together
cluster_centers_ to see what each cluster actually means
inertia_ to compare fit quality across different n_clusters

Common mistakes:

clustering raw features with very different scales
treating cluster_centers_ like a guarantee of truth instead of a summary of the fit
using one run and assuming the answer is stable
reading a low inertia score as if it automatically means the cluster story is meaningful

Good questions:

Are the groups compact enough for centroid-based splitting?
Does changing n_init or random_state change the answer?
Do the centers make sense in the original feature space?

Practical trick:

if the problem is sparse or high-dimensional, try several initializations and compare the labels before you trust the result

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

kmeans = KMeans(n_clusters=4, n_init=10, random_state=0)
labels = kmeans.fit_predict(X_scaled)
score = silhouette_score(X_scaled, labels)

`DBSCAN`¶

Use DBSCAN when the groups are not compact spheres, when you expect noise, or when clusters are better defined by density than by distance to a center.

What to inspect:

labels_ for cluster membership
-1 labels for noise points
how sensitive the result is to eps
how many points become noise when min_samples changes

Common mistakes:

choosing eps by habit instead of by the feature scale
setting eps too large and merging everything
setting min_samples too low and calling weak structure a cluster
using distance-based methods on unscaled data and then blaming the algorithm

Good questions:

Do I want noise to remain noise?
Are the clusters shaped like blobs, chains, or irregular patches?
Does the result still make sense if eps shifts a little?

Practical trick:

if you already have a distance matrix, metric="precomputed" is available
pairwise_distances helps you inspect the scale before you pick eps

from sklearn.cluster import DBSCAN

labels = DBSCAN(eps=0.6, min_samples=8).fit_predict(X_scaled)

`AgglomerativeClustering`¶

Use AgglomerativeClustering when you want a merge tree, when you suspect nested structure, or when you want to compare linkage behavior.

What to inspect:

labels_ for the final grouping
children_ for the merge structure
distances_ when you want to reason about the tree or build a dendrogram-like inspection

Common mistakes:

assuming every hierarchical method behaves the same way
using ward with a non-Euclidean metric
asking for a fixed n_clusters when a merge threshold would be a better teaching signal
forgetting that connectivity and metric choices shape the result

Good questions:

Do I want a fixed number of clusters, or do I want to stop merging at a distance threshold?
Does the hierarchy tell a meaningful story, or is it just a forced tree?
Which linkage, average, complete, single, or ward, matches the data geometry?

Practical trick:

distance_threshold is useful when the goal is to inspect where merging should stop instead of guessing a cluster count first

from sklearn.cluster import AgglomerativeClustering

model = AgglomerativeClustering(n_clusters=4, linkage="ward")
labels = model.fit_predict(X_scaled)

`PCA`¶

Use PCA when you want a stable, linear summary of the data and a reliable inspection view.

What to inspect:

explained_variance_ratio_ to see how much information the projection keeps
components_ to understand which directions are driving the view
the 2D scatter to see broad structure, not final proof

Common mistakes:

treating a 2D PCA plot as final evidence that clusters are real
forgetting that PCA centers the data but does not scale each feature for you
assuming a high-looking separation in 2D means the original space is equally clean

Good questions:

How much variance did the first two components keep?
Is the structure visible in PCA because the signal is truly linear, or because the data is simple?
Does the same grouping survive when I inspect another view?

Practical trick:

use PCA first when you need a stable map, then compare it with a more local method

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca_2d = pca.fit_transform(X_scaled)
kept = pca.explained_variance_ratio_.sum()

`TSNE`¶

Use TSNE when you want to inspect local neighborhoods. It is often a better visual story than PCA, but it is not a reliable clustering validator.

What to inspect:

whether nearby points stay nearby
whether apparent islands survive small parameter changes
whether the picture changes a lot across random seeds

Common mistakes:

reading global distances as if the map were metric-preserving
comparing cluster sizes or gaps too literally
trusting one run without checking stability

Good questions:

Do I care about local neighborhood structure more than global shape?
Does the apparent separation survive a second run?
Am I using this for inspection or pretending it is proof?

Practical trick:

perplexity is a major control knob, and different values can change the result substantially
init="pca" is often a more globally stable start than random initialization

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=30, init="pca", random_state=0)
tsne_2d = tsne.fit_transform(X_scaled)

`silhouette_score` and `silhouette_samples`¶

Use these when you want a non-visual check that can compare clusterings on the same data. The score is easier to compare than a plot, but it still reflects the assumptions of the chosen distance metric.

What to inspect:

whether the mean score is positive and meaningfully above zero
whether a few weak points are dragging the score down
whether a cluster has many negative or near-zero sample scores

Common mistakes:

treating silhouette as universal truth
comparing scores from incompatible preprocessing pipelines
forgetting that the score only makes sense when there are at least two labels and not every point has its own label

Practical trick:

compare two clusterings on the same data, then inspect the per-sample scores for the weak points

from sklearn.metrics import silhouette_score

score = silhouette_score(X_scaled, labels)

`pairwise_distances`¶

Use this when you want to inspect the geometry directly, or when you want a method to work from a distance matrix instead of raw features.

Useful patterns:

checking whether the scale of distances is sensible before choosing eps
feeding metric="precomputed" workflows
understanding whether the data has a natural neighborhood structure

Common mistake:

using the same distance metric for every problem without checking whether it matches the representation

Worked Pattern¶

from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.metrics import silhouette_score

labels_kmeans = KMeans(n_clusters=3, n_init=10, random_state=0).fit_predict(X_scaled)
labels_dbscan = DBSCAN(eps=0.5, min_samples=8).fit_predict(X_scaled)
labels_agg = AgglomerativeClustering(n_clusters=3, linkage="ward").fit_predict(X_scaled)
pca_2d = PCA(n_components=2).fit_transform(X_scaled)
tsne_2d = TSNE(n_components=2, init="pca", random_state=0).fit_transform(X_scaled)
score = silhouette_score(X_scaled, labels_kmeans)

PCA is the more stable summary. t-SNE is often the more dramatic inspection view. You usually want both, and you usually want at least one non-visual metric beside them.

What To Inspect¶

whether the clusters are compact, elongated, nested, or density-based
whether noise points should remain separate from main groups
whether a 2D projection preserves the neighborhood story you care about
whether the cluster count came from data structure or from habit
whether cluster labels change when you change random_state, n_init, or the preprocessing
whether the silhouette pattern is consistent across clusters or weak in one region
whether the feature scales are dominating the geometry
whether the chosen metric matches the story you want to tell

Common Misreads¶

a clean 2D picture does not prove clean clusters
a noisy picture does not mean the method failed
a good KMeans silhouette does not mean the geometry is actually centroid-like
t-SNE can make almost any dataset look more separated than it really is
a low silhouette score can mean the method is wrong, or it can mean the data does not cluster cleanly at all
a stable hierarchy does not mean the leaves are meaningful classes
cluster count is not a substitute for understanding geometry

The correct question is not "Does the plot look good?" It is "What part of the method story does the plot actually support?"

Failure Pattern¶

Reading too much meaning into a pretty 2D plot. A low-dimensional view is a tool for inspection, not final evidence about the geometry of the data.

Another failure pattern is forcing every dataset into KMeans because it is familiar. Geometry should drive the choice, not popularity.

Another failure pattern is skipping scaling and then trying to interpret the output as if the algorithm made a deep discovery. Often it just followed the largest raw units.

Another failure pattern is treating one score as final. A good workflow checks labels, centers or merges, the projection, and at least one distance-based score together.

Questions To Ask¶

What shape do I expect before I run the method?
Do I want centers, density, or a hierarchy?
What is the smallest change I can make to test stability?
Are the clusters meaningful in the original feature space, or only in the plot?
Does the noise label matter here, or is it a warning sign that the data representation is weak?
If the method changed its answer, would I know whether the cause was geometry, scaling, or random initialization?

Practice¶

Compare KMeans and DBSCAN on differently shaped data.
Generate both PCA and t-SNE views for the same dataset.
Explain one thing each method reveals and one thing it can hide.
Describe a situation where a cluster plot looks good but the method choice is still wrong.
Explain what noise points tell you that KMeans cannot.
Name one reason to distrust a 2D plot even when it is visually neat.
Explain how you would decide between KMeans and DBSCAN on an unfamiliar dataset.
Describe one reason a good PCA plot can still be misleading.
Explain why silhouette_samples can be more useful than a single average score.
Describe what you would inspect before deciding that eps is too small or too large.
Explain why ward linkage has a stricter metric requirement than some other linkage choices.
Describe a case where distance_threshold gives a better lesson than picking n_clusters first.

Case Study: Customer Segmentation in E-Commerce¶

Amazon uses clustering and dimensionality reduction to segment customers based on purchase history, enabling personalized recommendations. Visual inspections ensure segments are meaningful, driving billions in targeted sales.

Expanded Quick Quiz¶

Why use DBSCAN over KMeans?

Answer: DBSCAN handles non-spherical clusters and identifies noise, while KMeans assumes spherical groups.

How does t-SNE differ from PCA?

Answer: t-SNE preserves local neighborhoods for visualization; PCA preserves global variance linearly.

What does a high silhouette score indicate?

Answer: Points are well-matched to their clusters and poorly matched to others, indicating good clustering.

In the social media scenario, why visualize clusters?

Answer: To inspect group structures and ensure they align with user behavior patterns for effective strategies.

Progress Checkpoint¶

[ ] Applied KMeans, DBSCAN, and AgglomerativeClustering to data.
[ ] Generated PCA and t-SNE visualizations.
[ ] Computed silhouette scores and analyzed per-point samples.
[ ] Interpreted results to choose the best method.
[ ] Answered quiz questions without peeking.

Milestone: Complete this to unlock "Advanced Clustering and Dimensionality Reduction" in the Classical ML track. Share your cluster visualization in the academy Discord!

Runnable Example¶

Open the matching example in AI Academy and run it from the platform.

Run the same idea in the browser:

Inspect how the cluster labels change across methods and how the 2D view can simplify the geometry. Check whether the same dataset looks stable under PCA but more variable under t-SNE.

Common Trick¶

If the 2D picture looks too clean, check the original feature space and the method assumptions again. Good visualization can make weak geometry look stronger than it is.

Try at least one non-visual metric as a check. If the plot and the metric disagree, investigate both before deciding the cluster story is settled.

If KMeans looks unstable, increase n_init, scale the data, and compare the silhouette score before changing the number of clusters.

If DBSCAN labels almost everything as noise, check the feature scale and then re-check eps and min_samples rather than switching methods immediately.

If t-SNE looks dramatic, rerun it with another seed and compare the broad neighborhood story instead of the exact point locations.

If hierarchical clustering seems too rigid, try a different linkage or use distance_threshold to inspect where the merge story actually changes.

Longer Connection¶

Continue with SVM and Advanced Clustering for a fuller classical-method workflow.

Clustering and Low-Dimensional Views¶

Scenario: Analyzing Social Media User Groups¶

What This Is¶

When You Use It¶

Tooling¶

Library Notes¶

Method Choice Heuristics¶

Minimal Example¶

Core Functions¶

KMeans¶

DBSCAN¶

AgglomerativeClustering¶

PCA¶

TSNE¶

silhouette_score and silhouette_samples¶

pairwise_distances¶

Worked Pattern¶

What To Inspect¶

Common Misreads¶

Failure Pattern¶

Questions To Ask¶

Practice¶

Case Study: Customer Segmentation in E-Commerce¶

Expanded Quick Quiz¶

Progress Checkpoint¶

Further Reading¶

Runnable Example¶

Common Trick¶

Longer Connection¶

`KMeans`¶

`DBSCAN`¶

`AgglomerativeClustering`¶

`PCA`¶

`TSNE`¶

`silhouette_score` and `silhouette_samples`¶

`pairwise_distances`¶