Skip to content

HDBSCAN stability selection over-fragments mobility stops #289

@paco-barreras

Description

@paco-barreras

HDBSCAN currently over-fragments a realistic mobility fixture when it uses the default stability selection path.

The test that exposes this is test_st_hdbscan_ground_truth in nomad/tests/test_stop_detection.py. With the current _base_cdf, this produced 36 clusters in the baseline run. An unrelated chronology-drop experiment reduced that only to 30, so that is not the right fix.

The graph diagnostics are important: this is not the same problem as the initially disconnected two-stop fixture. On gc_3_stops.csv, both the neighbor graph G and the HDBSCAN hierarchy graph H start as one connected component. The fragmentation happens later through stability selection.

The likely root cause is the stability measure. _base_cdf(eps) = 1 - 1/eps is the CDF of a density proportional to 1/eps^2, and cluster lifetime contribution is computed as F(eps_max) - F(eps_min). That means very small epsilon lifetimes can dominate the score, which is not what we want for human-mobility stops. A cluster surviving only between sub-meter or few-meter scales can beat a more meaningful stop-scale cluster.

A quick diagnostic sweep, without changing the implementation, found that the old piecewise 5-200 style CDF, a simple linear 5-200 window, and a truncated exponential 5-200 window each produced 3 clusters on the ground-truth fixture. That is encouraging, but this needs a careful design rather than a quick patch.

Things to resolve before implementing:

  • Keep delta_roam is None as stability-based cluster selection, not a flat epsilon cut.
  • Decide the mobility-specific CDF shape and defaults from the scientific interpretation of stop scale.
  • Handle units honestly: lat/lon graph weights are meters, projected x/y may be meters, and Garden City x/y is in 15m blocks.
  • Avoid adding ad-hoc schema or coordinate fallback logic inside hdbscan.py; column resolution should stay in the existing helper path.

For now the ground-truth HDBSCAN test is expected to fail so this branch can finish the separate scoped fixes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions