Skip to content

frs_candidates_pick: score + filter + dedup candidates per key (v0.31.0)#209

Merged
NewGraphEnvironment merged 6 commits into
mainfrom
207-frs-candidates-pick-score-filter-dedup-c
May 11, 2026
Merged

frs_candidates_pick: score + filter + dedup candidates per key (v0.31.0)#209
NewGraphEnvironment merged 6 commits into
mainfrom
207-frs-candidates-pick-score-filter-dedup-c

Conversation

@NewGraphEnvironment
Copy link
Copy Markdown
Owner

Summary

Closes #207. Ships frs_candidates_pick() — fourth primitive in fresh's point-handling family (alongside frs_point_snap, frs_network_features, frs_point_match). Given a candidates table where multiple rows can share the same key, optionally score per row via a caller-supplied SQL expression, optionally filter via a caller-supplied WHERE clause, then keep one row per key via DISTINCT ON (col_key) ORDER BY ....

  • Generic over any "score + filter + dedup per key" workflow where column-to-column comparisons disambiguate matches: stream-name match, watershed-group agreement, species-code overlap, assessment-date proximity, channel-width × stream-order compatibility, etc.
  • Function signature follows the table_<role> / col_<role> / exp_<role> conventions (link/CLAUDE.md):
    frs_candidates_pick(
      conn, table_in, table_to, col_key,
      exp_score  = NULL,    # optional SQL fragment yielding a `score` column
      exp_filter = NULL,    # optional SQL WHERE clause for disqualifiers
      order_by              # vector of "expr ASC/DESC" strings; col_key prepended
    )

Live byte-identical parity vs bcfp

BULK PSCIS-to-stream selection: 102 / 102 ref picks identical, 0 missing. Closes the BULK 5-diff gap from fresh#206 at the dedup-step level.

Approach: stage bcfishpass.pscis_streams_150m (bcfp's pre-computed candidates with name_score, width_order_score, weighted_distance, multiple_match_ind) as input, call frs_candidates_pick with bcfp's exact filter + ORDER BY:

frs_candidates_pick(
  conn,
  table_in   = "fresh_test_207.candidates_bulk",
  table_to   = "fresh_test_207.picked_bulk",
  col_key    = "stream_crossing_id",
  exp_filter = "name_score != -100 AND width_order_score != -100 AND multiple_match_ind IS NULL",
  order_by   = c(
    "name_score DESC",
    "CASE WHEN modelled_xing_dist_instream IS NOT NULL
          THEN distance_to_stream - (distance_to_stream * 0.1)
          ELSE distance_to_stream END ASC"
  )
)

Diff vs bcfishpass.pscis (xref-excluded subset): ours 106 / ref 102 / identical 102 / only-ours 4 / only-ref 0. The 4 "extras" all live in bcfishpass.pscis_not_matched_to_streams — bcfp's downstream suspect_match filter (>50m distance) routes those to a separate table. That's a caller-level downstream filter, not a primitive responsibility.

Test plan

  • 25 mocked tests covering validation (required args, identifier sanitization, exp_* non-empty character, reserved-column collision when exp_score set, existing-score-column-allowed-when-exp_score-NULL edge case) and SQL composition (DROP + CREATE, WITH scored AS CTE present/absent, SELECT DISTINCT ON (col_key), WHERE present/absent, ORDER BY containing col_key first then caller's clauses).
  • Live byte-identical validation against bcfp BULK PSCIS dedup — 102/102.
  • Full suite: 1012 PASS / 2 FAIL / 3 WARN — both FAIL and the WARNs pre-existing on main (same set as frs_point_match: match two point datasets along FWA network within instream distance #206 v0.30.0).
  • R CMD check --no-tests: 0 errors / 4 warnings / 4 notes — identical to main.
  • lintr::lint("R/frs_candidates_pick.R"): zero lints.
  • DESCRIPTION 0.30.0 → 0.31.0 (minor bump — new exported function).
  • NEWS.md 0.31.0 entry covers semantics, composition, parity numbers, and link#154 as the downstream consumer.

Composition story

With #206 + #207 shipped, the bcfp PSCIS-build pipeline reproduces byte-identically via composition:

fresh::frs_point_snap(conn, ..., num_features = 5)        # multi-stream candidates per PSCIS
fresh::frs_candidates_pick(conn, ..., exp_score = ..., order_by = ...)  # pick best stream
fresh::frs_point_match(conn, ..., distance_max = 100, tiebreak = "planar")  # match to modelled

This three-step chain is exactly what link#154 wires into lnk_pipeline_crossings to close the BULK mapping_code parity gap.

🤖 Generated with Claude Code

NewGraphEnvironment and others added 6 commits May 11, 2026 10:06
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
R/frs_candidates_pick.R + man/frs_candidates_pick.Rd + NAMESPACE
export. Function signature follows table_<role> / col_<role> /
exp_<role> conventions:

  frs_candidates_pick(
    conn, table_in, table_to, col_key,
    exp_score = NULL, exp_filter = NULL, order_by
  )

SQL: optional WITH scored AS (SELECT *, (<exp_score>) AS score
FROM <table_in>) CTE when exp_score supplied; SELECT DISTINCT ON
(<col_key>) * FROM (scored | table_in); optional WHERE <exp_filter>;
ORDER BY <col_key>, <caller's order_by clauses>. col_key prepended
to order_by to satisfy PostgreSQL DISTINCT ON requirement.

Reserved-column collision check on `score` when exp_score is set.
RPostgres can't run multi-statement SQL, so DROP and CREATE split
into separate .frs_db_execute calls.

Reuses fresh helpers: .frs_validate_identifier, .frs_db_execute,
.frs_table_columns. Mirrors frs_point_match scaffold structure
end-to-end.

Roxygen cross-refs frs_network_features and frs_point_match
(siblings in @family network) regenerated.

Lint clean; load_all signature verified.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
tests/testthat/test-frs_candidates_pick.R mirrors the test-frs_point_match
three-tier structure:

Tier 1 -- validation (7 tests): required args (table_in, table_to,
col_key, order_by); .frs_validate_identifier rejection for bad
identifiers; exp_score / exp_filter must be non-empty character
when supplied.

Tier 2 -- SQL composition via local_mocked_bindings on
.frs_db_execute + .frs_table_columns (5 tests):
  - full path: DROP, CREATE, WITH scored CTE, exp_score inline, DISTINCT
    ON (col_key), WHERE exp_filter, ORDER BY col_key + caller's clauses
  - exp_score=NULL omits WITH scored CTE
  - exp_filter=NULL omits WHERE clause
  - reserved-column collision: rejects table_in with `score` column
    when exp_score is set
  - existing `score` column allowed when exp_score is NULL (caller
    references it in order_by directly)

Total: 25 PASS / 0 FAIL on filter=frs_candidates_pick.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Validated frs_candidates_pick against bcfp's PSCIS-to-stream dedup
step using bcfp's pre-computed pscis_streams_150m as input (where
bcfp has already applied modelled-side dedup + computed name_score,
width_order_score, weighted_distance per 04_pscis.sql).

Acceptance: 102 / 102 ref picks identical, 0 missing.

  ours: 106 picks | ref: 102 picks
  identical pairs: 102
  only in ours: 4 (all in bcfishpass.pscis_not_matched_to_streams --
                   bcfp's downstream suspect_match >50m filter routes
                   these to a separate "not matched" table; that filter
                   is caller-level, not primitive-level)
  only in ref:  0

The dedup step is BYTE-IDENTICAL. The 4 "extras" are bcfp's downstream
routing logic that doesn't belong in this primitive.

Closes the BULK 5-diff gap from fresh#206 -- the snap-layer divergence
that fresh#206 documented as "needs frs_candidates_pick" is now
resolved at the dedup-step level. link#154 will compose
frs_point_snap(num_features=N) + frs_candidates_pick + frs_point_match
to reproduce the full bcfp PSCIS-build pipeline byte-identically (with
optional caller-level suspect_match filter on top for full
pscis-vs-pscis_not_matched routing).

Validation script: /tmp/fresh_207_live_validation.R

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
DESCRIPTION 0.30.0 -> 0.31.0 (minor bump per R-package conventions
for new exported function).

NEWS.md 0.31.0 entry covers:
- generic over any score+filter+dedup-per-key workflow
- BULK byte-identical (102/102 ref picks identical, 0 missing)
- table_/col_/exp_<role> convention compliance
- 3-step composition story (frs_point_snap + frs_candidates_pick +
  frs_point_match) that reproduces bcfp PSCIS pipeline byte-identically
- first consumer link#154

R CMD check (--no-tests): 0 errors / 4 warnings / 4 notes -- identical
to main per side-by-side check (same as v0.30.0 baseline). lintr clean
on the new R file.

25 PASS / 0 FAIL on filter=frs_candidates_pick.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@NewGraphEnvironment NewGraphEnvironment merged commit 133908f into main May 11, 2026
1 check passed
@NewGraphEnvironment NewGraphEnvironment deleted the 207-frs-candidates-pick-score-filter-dedup-c branch May 11, 2026 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

frs_candidates_pick: score + filter + dedup candidates per key (e.g. multi-stream PSCIS-to-stream selection)

1 participant