Skip to content

frs_habitat_classify path-1 should accept known-spawning as a rearing trigger #189

@NewGraphEnvironment

Description

@NewGraphEnvironment

Problem

frs_habitat_classify path-1 (rearing-on-spawning) only triggers when h.spawning IS TRUE (modelled, rule-based). bcfishpass's habitat_linear_<sp> path-1 is broader:

WHERE (h.spawning IS TRUE or coalesce(hk.spawning_st, 0) = 1)

hk.spawning_st comes from bcfishpass.streams_habitat_known, populated from user_habitat_classification.csv. So bcfp's "rule-based" habitat_linear_<sp> already includes operator-known spawning as a path-1 trigger.

Link's apply_habitat_overlay: no config is meant to match bcfp's rule-based output. But because fresh's classify doesn't read user_habitat_classification, link misses the known-spawning-trigger rearing credits that bcfp produces.

Three logical modes (only two exist in fresh)

Mode Modelled rule Known-spawning triggers rearing Final overlay
bcfp habitat_linear_<sp> yes yes no
bcfp streams_habitat_linear yes yes yes
fresh apply_habitat_overlay: no yes no no
fresh apply_habitat_overlay: yes yes no (only post-overlay) yes

Bcfp's rule-based table corresponds to a third mode that fresh doesn't currently produce.

Concrete case

link MORR ST gap, 2026-04-30. bcfp credits ~60 km of ST rearing across the top 10 streams alone via the hk-trigger path:

blue_line_key n rearing segs (with hk.spawning_st = 1) rearing km
360885316 (Morice River) 179 35.24
360885021 (Gosnell Creek) 59 12.14
360819468 24 5.44
360837468 16 3.71
...

bcfishpass.streams_habitat_known on MORR has 353 ST segments with spawning_st = 1 (from 31 distinct rows in user_habitat_classification.csv expanded across DRM ranges). Link's classify path-1 doesn't see them, so the rearing-on-spawning credit on those streams doesn't fire.

Proposed solution — reuse frs_habitat_overlay, just call it earlier

frs_habitat_overlay already does exactly the OR-additive operation we need: source-table shape (blue_line_key, drm, urm, species_code, spawning, ...) matches user_habitat_classification.csv; bridge mode does a 3-way range join into streams_habitat; updates are FALSE → TRUE only, never reversed.

Today's pipeline order:

classify → cluster → connected_waterbody → overlay (apply_habitat_overlay=yes only)

Today's apply_habitat_overlay=yes mode mutates streams_habitat.spawning AFTER cluster + connected_waterbody have already run. By the time those phases query WHERE h.spawning IS TRUE, they only see modelled spawning — which is why fresh's apply_habitat_overlay=yes mode produces a different output than bcfp's habitat_linear_<sp> despite reaching the same final overlay state.

Fix: shift the overlay call to run BEFORE cluster + connected_waterbody when the user wants bcfp-style hk-trigger semantics. Once streams_habitat.spawning carries known-spawning at that point, every downstream phase reads it naturally — cluster's label_connect IS TRUE checks, .frs_connected_waterbody Phase 2's WHERE hs.spawning IS TRUE, classify's path-1 rearing-on-spawning. No code changes inside any of those phases.

One real gap in frs_habitat_overlay

bcfp's hk-trigger uses range overlap, not strict containment:

-- frs_habitat_overlay's current bridge predicate (CONTAINMENT)
s.downstream_route_measure >= k.downstream_route_measure
AND s.upstream_route_measure   <= k.upstream_route_measure

-- bcfp hk-trigger semantics (OVERLAP)
s.upstream_route_measure   >= k.downstream_route_measure
AND s.downstream_route_measure <= k.upstream_route_measure

Add a range_mode = c("contain", "overlap") arg to frs_habitat_overlay. Default "contain" preserves today's apply_habitat_overlay=yes behaviour exactly. New "overlap" mode matches bcfp.

Orchestration — when to apply

Add a known_habitat_when = c("post-cluster", "post-classify", "both") option to frs_habitat (defaults to "post-cluster" = today's behaviour). When "post-classify" or "both", frs_habitat calls frs_habitat_overlay(known_habitat, range_mode = "overlap") between classify and cluster. With "both", it calls overlay twice — once post-classify, once post-cluster — and idempotent OR makes that safe (TRUE → TRUE is a no-op).

apply_habitat_overlay = no config keeps known_habitat_when = "post-cluster" and skips the overlay entirely (status quo).

Implementation surface

Change File Approx lines
range_mode arg + the alternative SQL predicate R/frs_habitat_overlay.R ~10–15
known_habitat_when option threaded through frs_habitat orchestrator (with conditional pre-cluster overlay call) R/frs_habitat.R (around line 1153 .frs_run_connectivity) ~30
Tests tests/testthat/test-frs_habitat_overlay.R (new range-mode cases) + tests/testthat/test-frs_habitat.R (timing-mode cases) ~50–80

Zero changes to frs_habitat_classify, frs_cluster, .frs_connected_waterbody. They keep reading h.spawning IS TRUE as before; the OR-in is upstream of them when timing is "post-classify" or "both".

Safety property

frs_habitat_overlay already mutates streams_habitat.spawning (in apply_habitat_overlay=yes mode). The proposed change shifts WHEN that mutation fires, not WHETHER. With default known_habitat_when = "post-cluster", behaviour is bit-identical to today. With "post-classify", the mutation happens earlier in the pipeline, giving cluster + connected_waterbody the augmented spawning set — same eventual streams_habitat.spawning content, just visible to those phases.

OR-additive throughout: rows can flip FALSE → TRUE, never reverse. No risk of shrinking output.

Test plan (fresh-side, self-contained)

Test Bar
All existing tests pass with default known_habitat_when = "post-cluster" Bit-identical fresh suite output
range_mode = "overlap" produces the larger expected row set on a synthetic fixture (segment range partially intersects known range) Unit test in test-frs_habitat_overlay.R
range_mode = "contain" (default) produces existing behaviour byte-identical Unit test
known_habitat_when = "post-classify" on a synthetic fixture: cluster sees augmented spawning → segments that would have been stripped get preserved Integration test in test-frs_habitat.R
known_habitat_when = "both" is idempotent (output identical to "post-classify" on the same input) Integration test
Defensive: empty known_habitat, missing species column, table doesn't exist Standard error-path tests

These prove the safety property + the range-mode SQL + the timing semantics without needing the live bcfp tunnel.

Parity verification (link-side, follow-up)

The bcfp parity claim itself — "MORR ST credits ~60 km of rule-based rearing matching bcfishpass.habitat_linear_st via the hk-trigger path" — requires link's compare_bcfishpass_wsg.R + the live tunnel. Tracked separately in link#132. After fresh ships the timing arg, link's lnk_pipeline_classify exposes known_habitat_when = "post-classify" to bundles that want bcfp parity (and "post-cluster" for bundles that want today's apply_habitat_overlay semantics). MORR ST runs through the compare apparatus to confirm.

Reproduction

Tunnel DB: bcfishpass on localhost:63333 (db_newgraph, rebuilt Mondays). WSG: MORR. Species: ST. bcfishpass.streams_habitat_known provides the trigger source. Link's bcfishpass-bundle ships inst/extdata/configs/bcfishpass/overrides/user_habitat_classification.csv — same data, just not currently consumed by classify until link's wiring lands in #132.

Related

This is the third mechanism in the same parity-investigation slice. fresh#186/#187 fixed link's over-credits; this fixes the under-credit. Together they should close most of the remaining MORR ST / BABL ST / MORR CO gap.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions