Skip to content

Precalculated app: incomplete hover tooltips and KMeans labels overwritten when clustering by column #20

@NetZissou

Description

@NetZissou

Summary

Two bugs in the precalculated embeddings app: (1) hover tooltips show only partial metadata, and (2) when clustering by column values, KMeans labels are overwritten with raw column values -- destroying the ability to evaluate whether KMeans successfully separated records by category.

Bug 1: Hover tooltip shows only partial metadata

Root cause: Hard-coded [:8] slice in shared/components/visualization.py:87:

metadata_cols = [c for c in df_plot.columns if c not in skip_cols][:8]

With the example dataset's 17 metadata columns, fields like species, scientific_name, common_name are cut off. Users must click a point and check the Data Preview panel to see them.

Fix: Remove/raise the cap, or add a user-selectable tooltip column picker.

Bug 2: KMeans labels overwritten + "null" in tooltips when clustering by column

The goal of "Use column values" clustering is to evaluate whether KMeans can separate records by a column's categories (column = ground truth, KMeans = prediction). The current code defeats this by overwriting KMeans labels with the raw column values (sidebar.py:691-693).

Example -- clustering by kingdom (5 unique values, n_clusters=5)

                        KMeans label    After overwrite
Row   kingdom           (from embeddings) (from column)   Problem
----- ------------------ ----------------- --------------- ----------------------
 0    Animalia          2                 0 (Animalia)    KMeans said cluster 2
 1    Plantae           2                 3 (Plantae)     Same KMeans cluster, diff color
 2    Animalia          0                 0 (Animalia)    KMeans said cluster 0
 3    NaN               1                 5 (Unknown)     Phantom 6th cluster

After the overwrite:

  • Colors show taxonomy, not KMeans output -- you can't see what KMeans actually found
  • Two points in the same KMeans cluster get different colors (rows 0 and 1) -- exactly the comparison the user wants to see, but it's invisible
  • Rows 0 and 2 look identical (both "Animalia") but KMeans put them in different clusters -- this disagreement is the evaluation signal, and it's destroyed
  • Null rows create a phantom 6th cluster that KMeans never assigned

Additionally, raw NaN values in the original column leak into the tooltip and render as "null" in Altair/Vega-Lite, which users interpret as the cluster name being null (even though cluster_name is correctly set to "Unknown").

Proposed fix: two distinct modes

Mode KMeans Colors Purpose
Cluster by column count (fix existing) Yes, n_clusters from column KMeans labels (preserve, don't overwrite) Evaluate: do embedding clusters align with taxonomy?
Label by column (new) No, skip KMeans Raw column values Explore: how does a column distribute in embedding space?

For the evaluation mode, column values become comparison metadata (in tooltip + Data Preview), not the cluster assignment.

Affected Files

  • shared/components/visualization.py:87 -- tooltip [:8] cap
  • apps/precalculated/components/sidebar.py:555-602 -- cluster method UI
  • apps/precalculated/components/sidebar.py:672-697 -- label overwrite logic
  • apps/precalculated/components/sidebar.py:732-747 -- create_cluster_dataframe() null handling

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions