HDBSCAN currently over-fragments a realistic mobility fixture when it uses the default stability selection path.
The test that exposes this is test_st_hdbscan_ground_truth in nomad/tests/test_stop_detection.py. With the current _base_cdf, this produced 36 clusters in the baseline run. An unrelated chronology-drop experiment reduced that only to 30, so that is not the right fix.
The graph diagnostics are important: this is not the same problem as the initially disconnected two-stop fixture. On gc_3_stops.csv, both the neighbor graph G and the HDBSCAN hierarchy graph H start as one connected component. The fragmentation happens later through stability selection.
The likely root cause is the stability measure. _base_cdf(eps) = 1 - 1/eps is the CDF of a density proportional to 1/eps^2, and cluster lifetime contribution is computed as F(eps_max) - F(eps_min). That means very small epsilon lifetimes can dominate the score, which is not what we want for human-mobility stops. A cluster surviving only between sub-meter or few-meter scales can beat a more meaningful stop-scale cluster.
A quick diagnostic sweep, without changing the implementation, found that the old piecewise 5-200 style CDF, a simple linear 5-200 window, and a truncated exponential 5-200 window each produced 3 clusters on the ground-truth fixture. That is encouraging, but this needs a careful design rather than a quick patch.
Things to resolve before implementing:
- Keep
delta_roam is None as stability-based cluster selection, not a flat epsilon cut.
- Decide the mobility-specific CDF shape and defaults from the scientific interpretation of stop scale.
- Handle units honestly: lat/lon graph weights are meters, projected x/y may be meters, and Garden City x/y is in 15m blocks.
- Avoid adding ad-hoc schema or coordinate fallback logic inside
hdbscan.py; column resolution should stay in the existing helper path.
For now the ground-truth HDBSCAN test is expected to fail so this branch can finish the separate scoped fixes.
HDBSCAN currently over-fragments a realistic mobility fixture when it uses the default stability selection path.
The test that exposes this is
test_st_hdbscan_ground_truthinnomad/tests/test_stop_detection.py. With the current_base_cdf, this produced 36 clusters in the baseline run. An unrelated chronology-drop experiment reduced that only to 30, so that is not the right fix.The graph diagnostics are important: this is not the same problem as the initially disconnected two-stop fixture. On
gc_3_stops.csv, both the neighbor graphGand the HDBSCAN hierarchy graphHstart as one connected component. The fragmentation happens later through stability selection.The likely root cause is the stability measure.
_base_cdf(eps) = 1 - 1/epsis the CDF of a density proportional to1/eps^2, and cluster lifetime contribution is computed asF(eps_max) - F(eps_min). That means very small epsilon lifetimes can dominate the score, which is not what we want for human-mobility stops. A cluster surviving only between sub-meter or few-meter scales can beat a more meaningful stop-scale cluster.A quick diagnostic sweep, without changing the implementation, found that the old piecewise 5-200 style CDF, a simple linear 5-200 window, and a truncated exponential 5-200 window each produced 3 clusters on the ground-truth fixture. That is encouraging, but this needs a careful design rather than a quick patch.
Things to resolve before implementing:
delta_roam is Noneas stability-based cluster selection, not a flat epsilon cut.hdbscan.py; column resolution should stay in the existing helper path.For now the ground-truth HDBSCAN test is expected to fail so this branch can finish the separate scoped fixes.