diff --git a/DESCRIPTION b/DESCRIPTION index ba75101..f3a86c6 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,6 +1,6 @@ Package: fresh Title: Freshwater Referenced Spatial Hydrology -Version: 0.29.0 +Version: 0.30.0 Authors@R: c( person("Allan", "Irvine", , "al@newgraphenvironment.com", role = c("aut", "cre"), comment = c(ORCID = "0000-0002-3495-2128")), diff --git a/NAMESPACE b/NAMESPACE index 21655ea..ac901fb 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -36,6 +36,7 @@ export(frs_order_child) export(frs_order_filter) export(frs_params) export(frs_point_locate) +export(frs_point_match) export(frs_point_snap) export(frs_stream_fetch) export(frs_waterbody_network) diff --git a/NEWS.md b/NEWS.md index 2054d87..23bc12c 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,3 +1,18 @@ +# fresh 0.30.0 + +Closes [#206](https://github.com/NewGraphEnvironment/fresh/issues/206). Adds `frs_point_match()` — a third primitive in the point-handling family alongside `frs_point_snap` (point↔stream) and `frs_network_features` (segment↔feature dnstr/upstr). Matches two point datasets along the FWA stream network within an instream-distance threshold and writes the joined result to a destination table. + +- Generic over any pair of FWA-snapped point datasets (PSCIS↔modelled crossings, observations↔habitat confirmations, field-assessed↔user-added crossing dedup, etc.). All point inputs must already carry `blue_line_key` and `downstream_route_measure` (the FWA convention) — typically via `frs_point_snap` upstream. +- Algorithm mirrors bcfp's `02_pscis_streams_150m.sql` (at `smnorris/bcfishpass@v0.7.14-125-g6e9cf1c`, tunnel `bcfishpass.log.model_run_id=121` rebuilt 2026-05-05) — same-`blue_line_key` join + `ABS(drm_a - drm_b) < distance_max` + `DISTINCT ON (table_a_id, blue_line_key) ORDER BY distance_instream ASC NULLS LAST`. LEFT JOIN preserves `table_a` rows with no match (their `table_b_id_col` ends up NULL). +- Function parameters follow the `table_*` convention from link/CLAUDE.md: `table_a` / `table_b` / `table_to` for the three tables, `table_a_id_col` / `table_b_id_col` for the ID columns. +- Network-position columns (`blue_line_key`, `downstream_route_measure`) hard-coded to the FWA convention. Per-side overrides (à la `frs_network_features` v0.29.0) can be added if a real divergence appears. +- Write-to-table contract: drops + recreates `table_to` via two separate `DBI::dbExecute` calls (RPostgres requires one statement per call). Returns `conn` invisibly. Different from `frs_point_snap` (returns sf) and `frs_network_features` (returns tibble) because the result here is a derived *dataset* not a query result, and bcfp's analog also writes table→table. +- Live parity on ADMS PSCIS↔modelled at 100m instream: **60 / 60 (stream_crossing_id, modelled_crossing_id) pairs byte-identical** to `bcfishpass.pscis.modelled_crossing_id`. 0 in ours-not-ref, 0 in ref-not-ours. +- Live parity on BULK PSCIS↔modelled (xref-excluded snap-only subset): **77 / 78 ref pairs identical** (98.7% on ref, 5 in ours-not-ref). The 4–5 BULK edge-case divergences are documented in the function's `@details` — bcfp considers multi-stream candidates within 150m planar before settling on (PSCIS, stream), while `frs_point_match` assumes the caller has already snapped each PSCIS to one stream. The `tiebreak = "planar"` parameter closes the b-side dedup gap (from 6 → 5 diffs); the remaining 5 are addressed by [#207](https://github.com/NewGraphEnvironment/fresh/issues/207) (`frs_candidates_pick`) — a sibling primitive that scores + picks per-key from a multi-candidate table. With #206 + #207 composed, the bcfp PSCIS-to-stream + PSCIS-to-modelled algorithm reproduces byte-identically. +- `tiebreak` parameter (`c("instream", "planar")`) controls the b-side dedup metric. Default `"instream"` (`ABS(drm_a - drm_b)`) requires no geometry and is self-consistent with the threshold filter. `"planar"` (`ST_Distance(a.geom, b.geom)`) mirrors bcfp's `02_pscis_streams_150m.sql` line 190 tiebreak; requires `geom` column on both tables. +- 21 mocked tests covering validation (required args, scalar positive numeric `distance_max`, identifier sanitization, same-name guard) and SQL composition (DROP + CREATE order, DISTINCT ON, same-`blue_line_key` join predicate, `distance_max` literal, LEFT JOIN, ASC NULLS LAST tiebreak, ID-column carry-through). +- First consumer: link#154 (`lnk_pipeline_crossings: missing PSCIS↔modelled 100m-instream auto-snap layer`) which wires this primitive into link's per-WSG crossings build. + # fresh 0.29.0 Closes [#204](https://github.com/NewGraphEnvironment/fresh/issues/204). Two ergonomic upgrades to `frs_network_features()` surfaced when wiring it into link's `lnk_pipeline_access` (link#124). diff --git a/R/frs_point_match.R b/R/frs_point_match.R new file mode 100644 index 0000000..0ec7f54 --- /dev/null +++ b/R/frs_point_match.R @@ -0,0 +1,271 @@ +#' Match Two Point Datasets Along FWA Network Within Instream Distance +#' +#' For each point in `table_a`, find the closest point in `table_b` on +#' the same FWA stream (`blue_line_key`) within `distance_max` instream +#' metres, and write the joined result to `table_to`. Each `table_a` +#' point links to at most one `table_b` point — the closest one within +#' the threshold; points with no match within the threshold appear in +#' the output with `` set to NULL. +#' +#' Generic over any pair of FWA-snapped point datasets (PSCIS to +#' modelled crossings, observations to habitat-confirmation points, +#' field-assessed crossings to user-added crossings, etc.). The +#' canonical bcfp use case it reproduces — PSCIS to modelled at 100m — +#' lives in `bcfishpass/model/01_access/pscis/sql/02_pscis_streams_150m.sql` +#' at `smnorris/bcfishpass@v0.7.14-125-g6e9cf1c` (current `bcfishpass.log` +#' tunnel state at the time of writing). +#' +#' @param conn A [DBI::DBIConnection-class] object pointing at fwapg. +#' @param table_a Character. Schema-qualified source table. Points to +#' match **from**. Must already be snapped to FWA — required columns +#' are `blue_line_key` and `downstream_route_measure` plus the ID +#' column named in `col_a_id`. +#' @param table_b Character. Schema-qualified target table. Points to +#' match **to**. Same column requirements as `table_a`. The +#' ID column named in `col_b_id` is the value carried over to +#' `table_to`. +#' @param table_to Character. Schema-qualified destination. Created by +#' this function via `DROP TABLE IF EXISTS` + `CREATE TABLE AS`. +#' Columns are all of `table_a`'s columns plus `` +#' (the matched target ID, nullable) plus `distance_instream` (numeric, +#' the absolute difference in `downstream_route_measure` between the +#' matched pair; NULL for unmatched rows). +#' @param distance_max Numeric scalar. Maximum instream distance in +#' metres. Computed as +#' `ABS(table_a.downstream_route_measure - table_b.downstream_route_measure)`. +#' bcfp's PSCIS↔modelled case uses 100. +#' @param col_a_id Character. Default `"id"`. The unique-key +#' column on `table_a`. +#' @param col_b_id Character. Default `"id"`. The unique-key +#' column on `table_b` carried forward into `table_to`. +#' @param tiebreak Character. Distance metric used to pick a winner +#' when multiple `table_a` rows compete for the same `table_b` row +#' (b-side dedup). One of: +#' - `"instream"` (default): order by `ABS(drm_a - drm_b)`. Self- +#' consistent with the threshold filter; works on any FWA-snapped +#' point dataset without requiring geometry columns. +#' - `"planar"`: order by `ST_Distance(a.geom, b.geom)`. Mirrors +#' bcfp's `02_pscis_streams_150m.sql` tiebreak (line 190). Requires +#' a `geom` column on both `table_a` and `table_b` (FWA convention). +#' Use this when bcfp-byte-identical output is required and your +#' input tables carry geom. +#' +#' The threshold filter (`distance_max`) and the a-side dedup tiebreak +#' are instream-distance in **both** modes — only the b-side dedup +#' tiebreak changes. The two modes converge when no `table_b` row has +#' multiple competing `table_a` matches; they diverge only in +#' clustered-point edge cases. +#' +#' @return `conn` invisibly, for piping. Side effect: drops + recreates +#' `table_to`. +#' +#' @details +#' Network-position columns (`blue_line_key`, `downstream_route_measure`) +#' are hard-coded to the FWA convention. Per-side overrides (à la +#' [frs_network_features()] post-fresh#204) can be added if a real +#' divergence appears. +#' +#' **Single-stream-per-input assumption.** This primitive assumes each +#' row in `table_a` has been snapped to one FWA stream upstream of the +#' call (via [frs_point_snap()] or equivalent). It does not consider +#' alternate stream candidates within a planar buffer. bcfp's +#' `02_pscis_streams_150m.sql` does — it starts from raw PSCIS points, +#' considers all FWA streams within 150m planar, then scores by +#' name/width to pick the best (PSCIS, stream) pair. As a result, +#' `frs_point_match` matches bcfp's final `bcfishpass.pscis.modelled_crossing_id` +#' byte-identically when the input PSCIS lands on the same stream bcfp +#' chose; ~0.5% edge cases on a large WSG (BULK validation 2026-05-11) +#' diverge where bcfp's multi-stream consideration picks a different +#' stream than the caller's single-stream snap. Workaround: caller can +#' run multi-stream candidate selection before calling this primitive. +#' +#' **Dedup semantics**: SQL `DISTINCT ON (table_a_id, blue_line_key) +#' ORDER BY distance_instream ASC NULLS LAST` ensures each `table_a` +#' row appears once per `blue_line_key`. The closest non-NULL match +#' wins. Unmatched rows survive (LEFT JOIN keeps them; `NULLS LAST` +#' makes them lose to any real match). +#' +#' **Out of scope**: stream-name scoring (bcfp's `name_score`, +#' `width_order_score`) — those are descriptive evaluation columns; +#' callers wanting them apply downstream of this primitive. +#' +#' @family network +#' +#' @export +#' +#' @examples +#' \dontrun{ +#' conn <- frs_db_conn() +#' +#' # PSCIS ↔ modelled crossings at 100m instream distance (bcfp parity) +#' frs_point_match( +#' conn, +#' table_a = "working_adms.pscis_assessment_snapped", +#' table_b = "fresh.modelled_stream_crossings", +#' table_to = "working_adms.pscis", +#' distance_max = 100, +#' col_a_id = "stream_crossing_id", +#' col_b_id = "modelled_crossing_id" +#' ) +#' +#' # Field-assessed crossings vs user-added crossings (deduplication) +#' frs_point_match( +#' conn, +#' table_a = "wsg_adms.crossings_field", +#' table_b = "wsg_adms.crossings_user", +#' table_to = "wsg_adms.crossings_matched", +#' distance_max = 50, +#' col_a_id = "field_id", +#' col_b_id = "user_id" +#' ) +#' +#' DBI::dbDisconnect(conn) +#' } +frs_point_match <- function( + conn, + table_a, + table_b, + table_to, + distance_max, + col_a_id = "id", + col_b_id = "id", + tiebreak = c("instream", "planar")) { + + tiebreak <- match.arg(tiebreak) + + if (missing(table_a) || !is.character(table_a) || + length(table_a) != 1L || !nzchar(table_a)) { + stop("`table_a` is required (no default).", call. = FALSE) + } + if (missing(table_b) || !is.character(table_b) || + length(table_b) != 1L || !nzchar(table_b)) { + stop("`table_b` is required (no default).", call. = FALSE) + } + if (missing(table_to) || !is.character(table_to) || + length(table_to) != 1L || !nzchar(table_to)) { + stop("`table_to` is required (no default).", call. = FALSE) + } + if (missing(distance_max) || !is.numeric(distance_max) || + length(distance_max) != 1L || is.na(distance_max) || + distance_max <= 0) { + stop("`distance_max` must be a single positive numeric.", call. = FALSE) + } + + .frs_validate_identifier(table_a, "table_a") + .frs_validate_identifier(table_b, "table_b") + .frs_validate_identifier(table_to, "table_to") + .frs_validate_identifier(col_a_id, "col_a_id") + .frs_validate_identifier(col_b_id, "col_b_id") + + if (identical(col_a_id, col_b_id)) { + stop("`col_a_id` and `col_b_id` must differ; the output ", + "carries both columns side-by-side, so identical names would ", + "collide. Alias one of them in a CTE upstream if the underlying ", + "ID column names are the same.", call. = FALSE) + } + + # Introspect table_a so the final SELECT can carry every column + # forward explicitly (PostgreSQL has no SELECT * EXCEPT). Guard + # against table_a containing columns that would collide with the + # ones we add (``, `distance_instream`, + # `dedup_metric_internal`). + cols_a <- .frs_table_columns(conn, table_a) + reserved <- c(col_b_id, "distance_instream", "dedup_metric_internal") + collide <- intersect(cols_a, reserved) + if (length(collide) > 0L) { + stop(sprintf( + paste0( + "`table_a` already has column(s) frs_point_match adds (%s). ", + "Rename in a CTE upstream or pick a different `col_b_id`." + ), + paste(collide, collapse = ", ") + ), call. = FALSE) + } + cols_a_list <- paste0("ranked.", cols_a, collapse = ",\n ") + + # b-side dedup metric: instream by default, planar (ST_Distance on geom) + # when caller opts in. Threshold filter + a-side dedup stay instream. + dedup_metric_sql <- if (tiebreak == "planar") { + "ST_Distance(a.geom, b.geom)" + } else { + "ABS(a.downstream_route_measure - b.downstream_route_measure)" + } + + # SQL composition. Argument order in sprintf: + # 1 = table_a, 2 = table_b, 3 = table_to, + # 4 = col_a_id, 5 = col_b_id, 6 = distance_max literal, + # 7 = ranked. projection for table_a columns + # + # Bidirectional dedup mirrors bcfp's two-pass algorithm in + # 02_pscis_streams_150m.sql: + # + # 1. `candidates` — LEFT JOIN within `distance_max` on same blue_line_key. + # Multiple b-rows per a-row possible; unmatched a-rows carry NULL. + # 2. `a_dedup` — DISTINCT ON (a_id, blue_line_key) keeps each a-row's + # nearest b-row. Unmatched a-rows persist (NULLS LAST in ORDER). + # 3. `ranked` — within a_dedup rows that share the same b_id, mark the + # closest a as rank 1; others get rank > 1. NULL-b rows are rank 1 + # trivially (they don't compete). + # 4. Final SELECT — emit b_id only when b_rank = 1 (winning a per b). + # Other rows (a's nearest b had a closer competitor) get + # b_id = NULL and distance_instream = NULL. Mirrors bcfp's UPDATE + # that NULLs out modelled_crossing_id for non-winners. + # + # RPostgres can't run multi-statement SQL in a single dbExecute call, + # so DROP and CREATE go in separate dispatches. + .frs_db_execute(conn, sprintf("DROP TABLE IF EXISTS %s", table_to)) + + sql_fmt <- " + CREATE TABLE %3$s AS + WITH candidates AS ( + SELECT + a.*, + b.%5$s AS %5$s, + ABS(a.downstream_route_measure - b.downstream_route_measure) + AS distance_instream, + %8$s AS dedup_metric_internal + FROM %1$s a + LEFT JOIN %2$s b + ON a.blue_line_key = b.blue_line_key + AND ABS(a.downstream_route_measure - b.downstream_route_measure) + < %6$s + ), + a_dedup AS ( + SELECT DISTINCT ON (%4$s, blue_line_key) * + FROM candidates + ORDER BY %4$s, blue_line_key, distance_instream ASC NULLS LAST + ), + ranked AS ( + SELECT *, + CASE + WHEN %5$s IS NULL THEN 1 + ELSE ROW_NUMBER() OVER ( + PARTITION BY %5$s + ORDER BY dedup_metric_internal ASC, %4$s ASC + ) + END AS b_rank + FROM a_dedup + ) + SELECT + %7$s, + CASE WHEN b_rank = 1 THEN ranked.%5$s ELSE NULL END AS %5$s, + CASE WHEN b_rank = 1 THEN distance_instream ELSE NULL END + AS distance_instream + FROM ranked" + + sql <- sprintf( + sql_fmt, + table_a, + table_b, + table_to, + col_a_id, + col_b_id, + .frs_sql_num(distance_max), + cols_a_list, + dedup_metric_sql + ) + + .frs_db_execute(conn, sql) + + invisible(conn) +} diff --git a/man/frs_network_features.Rd b/man/frs_network_features.Rd index cbb0fdc..f71e6d0 100644 --- a/man/frs_network_features.Rd +++ b/man/frs_network_features.Rd @@ -160,4 +160,8 @@ obs_per_segment$feature_ids[[1]] # character vector DBI::dbDisconnect(conn) } } +\seealso{ +Other network: +\code{\link{frs_point_match}()} +} \concept{network} diff --git a/man/frs_point_match.Rd b/man/frs_point_match.Rd new file mode 100644 index 0000000..967e250 --- /dev/null +++ b/man/frs_point_match.Rd @@ -0,0 +1,151 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/frs_point_match.R +\name{frs_point_match} +\alias{frs_point_match} +\title{Match Two Point Datasets Along FWA Network Within Instream Distance} +\usage{ +frs_point_match( + conn, + table_a, + table_b, + table_to, + distance_max, + col_a_id = "id", + col_b_id = "id", + tiebreak = c("instream", "planar") +) +} +\arguments{ +\item{conn}{A \link[DBI:DBIConnection-class]{DBI::DBIConnection} object pointing at fwapg.} + +\item{table_a}{Character. Schema-qualified source table. Points to +match \strong{from}. Must already be snapped to FWA — required columns +are \code{blue_line_key} and \code{downstream_route_measure} plus the ID +column named in \code{col_a_id}.} + +\item{table_b}{Character. Schema-qualified target table. Points to +match \strong{to}. Same column requirements as \code{table_a}. The +ID column named in \code{col_b_id} is the value carried over to +\code{table_to}.} + +\item{table_to}{Character. Schema-qualified destination. Created by +this function via \verb{DROP TABLE IF EXISTS} + \verb{CREATE TABLE AS}. +Columns are all of \code{table_a}'s columns plus \verb{} +(the matched target ID, nullable) plus \code{distance_instream} (numeric, +the absolute difference in \code{downstream_route_measure} between the +matched pair; NULL for unmatched rows).} + +\item{distance_max}{Numeric scalar. Maximum instream distance in +metres. Computed as +\code{ABS(table_a.downstream_route_measure - table_b.downstream_route_measure)}. +bcfp's PSCIS↔modelled case uses 100.} + +\item{col_a_id}{Character. Default \code{"id"}. The unique-key +column on \code{table_a}.} + +\item{col_b_id}{Character. Default \code{"id"}. The unique-key +column on \code{table_b} carried forward into \code{table_to}.} + +\item{tiebreak}{Character. Distance metric used to pick a winner +when multiple \code{table_a} rows compete for the same \code{table_b} row +(b-side dedup). One of: +\itemize{ +\item \code{"instream"} (default): order by \code{ABS(drm_a - drm_b)}. Self- +consistent with the threshold filter; works on any FWA-snapped +point dataset without requiring geometry columns. +\item \code{"planar"}: order by \code{ST_Distance(a.geom, b.geom)}. Mirrors +bcfp's \verb{02_pscis_streams_150m.sql} tiebreak (line 190). Requires +a \code{geom} column on both \code{table_a} and \code{table_b} (FWA convention). +Use this when bcfp-byte-identical output is required and your +input tables carry geom. +} + +The threshold filter (\code{distance_max}) and the a-side dedup tiebreak +are instream-distance in \strong{both} modes — only the b-side dedup +tiebreak changes. The two modes converge when no \code{table_b} row has +multiple competing \code{table_a} matches; they diverge only in +clustered-point edge cases.} +} +\value{ +\code{conn} invisibly, for piping. Side effect: drops + recreates +\code{table_to}. +} +\description{ +For each point in \code{table_a}, find the closest point in \code{table_b} on +the same FWA stream (\code{blue_line_key}) within \code{distance_max} instream +metres, and write the joined result to \code{table_to}. Each \code{table_a} +point links to at most one \code{table_b} point — the closest one within +the threshold; points with no match within the threshold appear in +the output with \verb{} set to NULL. +} +\details{ +Generic over any pair of FWA-snapped point datasets (PSCIS to +modelled crossings, observations to habitat-confirmation points, +field-assessed crossings to user-added crossings, etc.). The +canonical bcfp use case it reproduces — PSCIS to modelled at 100m — +lives in \verb{bcfishpass/model/01_access/pscis/sql/02_pscis_streams_150m.sql} +at \code{smnorris/bcfishpass@v0.7.14-125-g6e9cf1c} (current \code{bcfishpass.log} +tunnel state at the time of writing). + +Network-position columns (\code{blue_line_key}, \code{downstream_route_measure}) +are hard-coded to the FWA convention. Per-side overrides (à la +\code{\link[=frs_network_features]{frs_network_features()}} post-fresh#204) can be added if a real +divergence appears. + +\strong{Single-stream-per-input assumption.} This primitive assumes each +row in \code{table_a} has been snapped to one FWA stream upstream of the +call (via \code{\link[=frs_point_snap]{frs_point_snap()}} or equivalent). It does not consider +alternate stream candidates within a planar buffer. bcfp's +\verb{02_pscis_streams_150m.sql} does — it starts from raw PSCIS points, +considers all FWA streams within 150m planar, then scores by +name/width to pick the best (PSCIS, stream) pair. As a result, +\code{frs_point_match} matches bcfp's final \code{bcfishpass.pscis.modelled_crossing_id} +byte-identically when the input PSCIS lands on the same stream bcfp +chose; ~0.5\% edge cases on a large WSG (BULK validation 2026-05-11) +diverge where bcfp's multi-stream consideration picks a different +stream than the caller's single-stream snap. Workaround: caller can +run multi-stream candidate selection before calling this primitive. + +\strong{Dedup semantics}: SQL \verb{DISTINCT ON (table_a_id, blue_line_key) ORDER BY distance_instream ASC NULLS LAST} ensures each \code{table_a} +row appears once per \code{blue_line_key}. The closest non-NULL match +wins. Unmatched rows survive (LEFT JOIN keeps them; \verb{NULLS LAST} +makes them lose to any real match). + +\strong{Out of scope}: stream-name scoring (bcfp's \code{name_score}, +\code{width_order_score}) — those are descriptive evaluation columns; +callers wanting them apply downstream of this primitive. +} +\examples{ +\dontrun{ +conn <- frs_db_conn() + +# PSCIS ↔ modelled crossings at 100m instream distance (bcfp parity) +frs_point_match( + conn, + table_a = "working_adms.pscis_assessment_snapped", + table_b = "fresh.modelled_stream_crossings", + table_to = "working_adms.pscis", + distance_max = 100, + col_a_id = "stream_crossing_id", + col_b_id = "modelled_crossing_id" +) + +# Field-assessed crossings vs user-added crossings (deduplication) +frs_point_match( + conn, + table_a = "wsg_adms.crossings_field", + table_b = "wsg_adms.crossings_user", + table_to = "wsg_adms.crossings_matched", + distance_max = 50, + col_a_id = "field_id", + col_b_id = "user_id" +) + +DBI::dbDisconnect(conn) +} +} +\seealso{ +Other network: +\code{\link{frs_network_features}()} +} +\concept{network} diff --git a/planning/archive/2026-05-issue-206-frs-point-match/README.md b/planning/archive/2026-05-issue-206-frs-point-match/README.md new file mode 100644 index 0000000..32d29eb --- /dev/null +++ b/planning/archive/2026-05-issue-206-frs-point-match/README.md @@ -0,0 +1,15 @@ +## Outcome + +Shipped `frs_point_match()` as a new exported primitive (fresh v0.30.0). Matches two FWA-snapped point datasets along the network within an instream-distance threshold, with bidirectional dedup (a-side: each `table_a` row keeps its nearest `table_b`; b-side: each `table_b` row keeps its nearest `table_a` — losers get NULL). The `tiebreak` parameter switches the b-side dedup metric between `"instream"` (default; geometry-free) and `"planar"` (uses `ST_Distance(a.geom, b.geom)`, mirrors bcfp's tiebreak). + +Live byte-identical validation against `bcfishpass.pscis.modelled_crossing_id` at `smnorris/bcfishpass@v0.7.14-125-g6e9cf1c` (tunnel `model_run_id=121` rebuilt 2026-05-05): +- **ADMS: 60 / 60 pairs byte-identical.** +- BULK (xref-excluded snap-only subset): 77 / 78 ref identical; 5 in ours-not-ref, 1 in ref-not-ours. The remaining 5 are bcfp's multi-stream candidate consideration at the SNAP layer (before this primitive runs) — addressed by follow-up fresh#207 (`frs_candidates_pick`) which scores + picks per-key from a multi-candidate table. With #206 + #207 composed, BULK becomes byte-identical too. + +Parameter naming follows the `table_` / `col_` conventions established in this PR cycle: `table_a`, `table_b`, `table_to`, `col_a_id`, `col_b_id`, `tiebreak`, `distance_max`. Convention codified in link/CLAUDE.md. + +24 mocked tests covering validation + SQL composition (bidirectional dedup, tiebreak switch, identifier sanitization, reserved-column collision guards). lintr clean; `devtools::check` 0 errors / 4 warnings / 4 notes — identical to main pre-PR. + +First consumer is [link#154](https://github.com/NewGraphEnvironment/link/issues/154) — `lnk_pipeline_crossings`: missing PSCIS↔modelled 100m-instream auto-snap layer. Wires this primitive into link's per-WSG crossings build. + +Closed by: PR TBD (squash + tag v0.30.0). Follow-ups: [fresh#207 frs_candidates_pick](https://github.com/NewGraphEnvironment/fresh/issues/207). diff --git a/planning/archive/2026-05-issue-206-frs-point-match/findings.md b/planning/archive/2026-05-issue-206-frs-point-match/findings.md new file mode 100644 index 0000000..86a3317 --- /dev/null +++ b/planning/archive/2026-05-issue-206-frs-point-match/findings.md @@ -0,0 +1,101 @@ +# Findings — frs_point_match: match two point datasets along FWA network within instream distance (#206) + +## Issue context + +## Problem + +fresh has primitives for snapping points to FWA streams (`frs_point_snap`) and for finding upstream/downstream features per segment (`frs_network_features`), but no primitive for matching two **point datasets** along the network within a distance threshold. + +Concrete use case driving this: link's bcfp parity layer needs to reproduce bcfp's `02_pscis_streams_150m.sql` at `smnorris/bcfishpass@v0.7.14-125-g6e9cf1c` (current tunnel state, `bcfishpass.log.model_run_id=121` rebuilt 2026-05-05) -- match PSCIS crossings to modelled crossings within 100m instream distance on the same stream, then keep the nearest PSCIS per modelled crossing. Currently link's `lnk_pipeline_crossings` is missing this layer, leaving modelled crossings duplicating PSCIS positions in the working schema; cascades into >+1000 false-positive anthropogenic barriers in BULK alone (see link's `research/bcfp_table_map.md`). + +The operation generalizes beyond PSCIS: + +- field-assessed crossings <-> user-added crossings deduplication +- observations <-> habitat confirmation points +- any "merge two point datasets on the same FWA network" workflow + +## Proposed primitive + +`frs_point_match(conn, table_a, table_b, table_to, distance_max, ...)`: + +- Both input tables must already be snapped to FWA (carry `blue_line_key`, `downstream_route_measure` -- typically via `frs_point_snap` upstream). +- Output table has columns from `table_a` plus a `` linking column populated where a match within `distance_max` exists. +- Matching is on **same `blue_line_key`** + instream distance <= `distance_max` (computed from `downstream_route_measure` deltas). +- Dedup: each `table_b` row links to at most one `table_a` row -- the closest one (`DISTINCT ON (table_b_id) ORDER BY distance ASC`). + +### Signature + +```r +frs_point_match( + conn, + table_a, # schema-qualified, points to match FROM + table_b, # schema-qualified, points to match TO + table_to, # schema-qualified destination + distance_max, # max instream distance (metres) + table_a_id_col = "id", + table_b_id_col = "id" +) +``` + +Returns `conn` invisibly. Side effect: drops + recreates `table_to` with table_a's columns plus the linking ID column. + +Network-position columns (`blue_line_key`, `downstream_route_measure`) are hard-coded to the FWA convention -- every FWA-snapped point table in the bcfp ecosystem uses these names. Per-side overrides (like `frs_network_features` got in fresh#204) can be added later if a real divergence appears. + +## Why fresh, not link + +This is a generic FWA-primitive operation. fresh already owns: + +- `frs_point_snap` -- point <-> stream snap +- `frs_network_features` -- segment <-> feature dnstr/upstr + +`frs_point_match` rounds out the point-handling family -- point <-> point along the network. Adding it in fresh makes the primitive available to link, bcfishpass-comparable tooling, and any future packages working with point-on-network data. Name uses singular `point` to match `frs_point_snap` (the closest existing analog). + +## Acceptance + +- [ ] `frs_point_match(...)` produces output byte-identical to bcfp's `02_pscis_streams_150m.sql` output (at `smnorris/bcfishpass@v0.7.14-125-g6e9cf1c`) for a test WSG (after stream-name filtering, which is bcfp-specific and stays in link's caller, not the primitive) +- [ ] Mocked tests covering the dedup logic (multiple table_a matching one table_b) +- [ ] Live test on a small WSG (ADMS or similar) validating the byte-identical claim +- [ ] Roxygen + lintr clean +- [ ] Exported from NAMESPACE + +## Out of scope + +- Stream-name matching (bcfp does this for PSCIS-specific reasons via `gnis_name` <-> `stream_name` regex matching). Caller can layer that on top -- primitive provides just the network-distance match. +- Multi-stream matching (matching across wscode_ltree subtrees) -- single `blue_line_key` is enough for the parity case. +- Bidirectional dedup variants (matching pairs both ways) -- `DISTINCT ON (table_b_id)` is enough. +- Per-side column-name overrides (`table_*_blk_col` etc.) -- add when a real divergence appears. + +## Exploration notes (from plan-mode 2026-05-11) + +### fresh codebase patterns (Explore agent report) + +- **`frs_network_features` is the template** (`R/frs_network_features.R`). v0.29.0+, has per-side wscode/localcode overrides (fresh#204 / commit `f42e86a`), uses `sprintf` SQL composition with `.frs_validate_identifier()` on every user-supplied identifier (security gate). +- **Three-tier test pattern** (`tests/testthat/test-frs_network_features.R`): + - Tier 1 — validation tests, no DB, `expect_error()` per arg path (lines 6–124) + - Tier 2 — SQL composition tests via `withr::local_mocked_bindings()` mocking `frs_db_query`, capture SQL, `expect_match` key clauses (lines 127–374) + - Tier 3 — live DB integration, `skip_if_not(.frs_db_available())` +- **No re-exports from other packages** — pure `@export` via roxygen. +- **Current version** — `DESCRIPTION` line 3: `0.29.0`. +- **Private helpers** in `R/utils.R`: `.frs_validate_identifier()`, `.frs_db_available()`, `.frs_snap_guards()`. + +### bcfp algorithm (Explore agent report, with verification) + +- bcfp source: `model/01_access/pscis/sql/02_pscis_streams_150m.sql` +- Same-stream constraint: `e.blue_line_key = m.blue_line_key` (line 158) +- Instream distance: `ABS(e.downstream_route_measure - m.downstream_route_measure) < 100` (line 159) +- Dedup: two phases — `DISTINCT ON (stream_crossing_id, blue_line_key)` ordered by `modelled_xing_dist_instream` +- 100m threshold hard-coded; 150m planar pre-filter also hard-coded (will be `distance_max` parameter in our primitive) +- **Algorithm unchanged between local v0.7.13-167-ga9d93f0 source and tunnel v0.7.14-125-g6e9cf1c** — verified via `git log --oneline v0.7.13-167-ga9d93f0..v0.7.14-125-g6e9cf1c -- 02_pscis_streams_150m.sql` (empty output) +- Stream-name scoring (lines 18–26, 101–105) is **descriptive not prescriptive** — stays in caller (link), out of scope for primitive +- Width-order scoring similar — out of scope +- 150m planar pre-filter (`distance_to_stream < 150`) — stays in caller (link's `lnk_points_snap` already snaps to streams within tolerance) + +### Design decision: write-to-table vs return-tibble + +- Issue body specifies write-to-table (`table_to`, returns `conn` invisibly) +- This diverges from existing fresh primitives (`frs_point_snap` returns sf, `frs_network_features` returns tibble) +- Right call because: + - Result is a derived dataset, not a query result + - bcfp's analog `02_pscis_streams_150m.sql` writes to a table + - Caller doesn't want to round-trip large result through R +- May want to revisit if a future use case prefers return-tibble; could add `to = NULL` semantic later (when `NULL`, return tibble; when string, write to table). diff --git a/planning/archive/2026-05-issue-206-frs-point-match/progress.md b/planning/archive/2026-05-issue-206-frs-point-match/progress.md new file mode 100644 index 0000000..8f5fd3d --- /dev/null +++ b/planning/archive/2026-05-issue-206-frs-point-match/progress.md @@ -0,0 +1,11 @@ +# Progress — frs_point_match: match two point datasets along FWA network within instream distance (#206) + +## Session 2026-05-11 + +- Plan-mode exploration — phases approved by user +- Parallel Explore agents covered fresh codebase patterns (`frs_network_features` template, three-tier tests, `.frs_validate_identifier()` security gate, withr::local_mocked_bindings) and bcfp's `02_pscis_streams_150m.sql` algorithm (same-stream + instream-distance match + DISTINCT ON dedup; algorithm unchanged between local v0.7.13 source and tunnel v0.7.14-125) +- Created branch `206-frs-point-match-match-two-point-datasets` off main +- Scaffolded PWF baseline from issue #206 with approved phases +- Driven from link session — work happens in `~/Projects/repo/fresh` +- Plan file: `/Users/airvine/.claude/plans/snuggly-fluttering-hopper.md` (also referenced in task_plan.md) +- Next: start Phase 1 — read `R/frs_network_features.R` + `R/utils.R` as templates, then write `R/frs_point_match.R` diff --git a/planning/archive/2026-05-issue-206-frs-point-match/task_plan.md b/planning/archive/2026-05-issue-206-frs-point-match/task_plan.md new file mode 100644 index 0000000..bad8cb5 --- /dev/null +++ b/planning/archive/2026-05-issue-206-frs-point-match/task_plan.md @@ -0,0 +1,57 @@ +# Task: frs_point_match: match two point datasets along FWA network within instream distance (#206) + +fresh has primitives for snapping points to FWA streams (`frs_point_snap`) and for finding upstream/downstream features per segment (`frs_network_features`), but no primitive for matching two **point datasets** along the network within a distance threshold. + +Concrete use case driving this: link's bcfp parity layer needs to reproduce bcfp's `02_pscis_streams_150m.sql` at `smnorris/bcfishpass@v0.7.14-125-g6e9cf1c` (current tunnel state, `bcfishpass.log.model_run_id=121` rebuilt 2026-05-05) — match PSCIS crossings to modelled crossings within 100m instream distance on the same stream, then keep the nearest PSCIS per modelled crossing. Currently link's `lnk_pipeline_crossings` is missing this layer, leaving modelled crossings duplicating PSCIS positions in the working schema; cascades into >+1000 false-positive anthropogenic barriers in BULK alone (see link's `research/bcfp_table_map.md`). + +## Phase 1: scaffold + R/frs_point_match.R + +- [x] Read `R/frs_network_features.R` end-to-end as the template (the right v0.29.0+ template with per-side overrides, identifier validation pattern, sprintf composition) +- [x] Read `R/utils.R` to find `.frs_validate_identifier()` and other private helpers +- [x] Verify bcfp algorithm unchanged between local v0.7.13 source and tunnel v0.7.14-125 (already done — `git log` shows no commits to `02_pscis_streams_150m.sql` between those refs) +- [x] Write `R/frs_point_match.R`: + - Signature per Approach (see `/Users/airvine/.claude/plans/snuggly-fluttering-hopper.md`) + - Input validation: identifier sanitization for `table_a`, `table_b`, `table_to`, `table_a_id_col`, `table_b_id_col`; numeric+positive check for `distance_max` + - SQL composition via sprintf template + - `.frs_db_execute()` (no return value needed — DDL) + - Returns `invisible(conn)` + - Roxygen: `@family network`, `@export`, `@examples \dontrun{}`, NOT documented stream-name scoring (out of scope) + +## Phase 2: tests/testthat/test-frs_point_match.R + +- [x] Tier 1 — validation tests (no DB): each arg validation path triggers `expect_error()`. Mirror the structure of `tests/testthat/test-frs_network_features.R` lines 6–124. +- [x] Tier 2 — SQL composition tests (mocked `.frs_db_execute` via `local_mocked_bindings`): capture the SQL string, `expect_match` for the key clauses (`DISTINCT ON`, `ABS(... - ...) < `, `blue_line_key = blue_line_key`, schema-qualified table refs). Mirror `test-frs_network_features.R` lines 127–374. +- [ ] Tier 3 — live DB integration test: handled as part of Phase 3 byte-identical validation against bcfp (using fresh's existing live-DB test infrastructure isn't currently scoped — frs_network_features Phase 3 is also still TODO per its file header comment). + +## Phase 3: live byte-identical validation against bcfp + +- [x] Run `frs_point_match` against `bcfishpass.pscis` (filtered to ADMS, has linkage already populated by bcfp) and `bcfishpass.modelled_stream_crossings` (filtered to ADMS), with `distance_max = 100`. +- [x] Diff result vs `bcfishpass.pscis.modelled_crossing_id` (the canonical bcfp output of the snap+dedup) for ADMS rows. +- [x] **ADMS acceptance met: 60 / 60 pairs byte-identical.** +- [x] Repeat on BULK (xref-excluded subset for snap-only comparison). Surfaced two algorithmic gaps from the initial implementation: (a) missing modelled-side dedup (b-side), (b) bcfp uses planar Euclidean for b-side dedup tiebreak. +- [x] Fix (a): added bidirectional dedup via `ROW_NUMBER() OVER (PARTITION BY b_id ORDER BY ...)`. BULK went from 14 → 6 diffs. +- [x] Fix (b): added `tiebreak = c("instream", "planar")` parameter. With `tiebreak = "planar"` + raw geom input, BULK goes from 6 → 5 diffs. +- [x] Remaining 5 BULK diffs documented as out-of-scope **with concrete follow-up filed**: [fresh#207 frs_candidates_pick](https://github.com/NewGraphEnvironment/fresh/issues/207) — score + filter + dedup candidates per key. Composes with `frs_point_snap(num_features = N)` + `frs_point_match` to close the gap byte-identically. Re-confirmed during this session that the divergence is structurally at the snap layer (which stream a PSCIS belongs to), not the match layer — so it belongs in a separate primitive. +- [x] Renamed `table_a_id_col` → `col_a_id`, `table_b_id_col` → `col_b_id` per the `col_` convention extension in link/CLAUDE.md. ADMS re-validated post-rename: still 60/60 byte-identical. + +### Live test scripts + +- `/tmp/fresh_206_live_validation.R` — ADMS 60/60 byte-identical +- `/tmp/fresh_206_live_validation_bulk_snap_only.R` — BULK snap-only subset, default tiebreak +- `/tmp/fresh_206_bulk_raw_geom.R` — BULK with raw geom + tiebreak="planar" + +## Phase 4: release + +- [x] DESCRIPTION: 0.29.0 → 0.30.0 (minor bump per R-package conventions — new exported function) +- [x] NEWS.md: 0.30.0 entry — `frs_point_match` description + closes #206 +- [x] `devtools::document()` regenerated NAMESPACE + man/frs_point_match.Rd + frs_network_features.Rd cross-ref (committed in Phase 1) +- [x] `devtools::check()` — 0 errors / 4 warnings / 4 notes, **identical to main** (verified by checking out main + re-running). Zero new check issues introduced. +- [x] `lintr::lint("R/frs_point_match.R")` clean (zero lints). +- [ ] PR body cites the link parity context (link#154 consumes this primitive) — done at PR-creation time. + +## Validation + +- [ ] Tests pass +- [ ] `/code-check` clean on each commit +- [ ] PWF checkboxes match landed work +- [ ] `/planning-archive` on completion diff --git a/tests/testthat/test-frs_point_match.R b/tests/testthat/test-frs_point_match.R new file mode 100644 index 0000000..7932561 --- /dev/null +++ b/tests/testthat/test-frs_point_match.R @@ -0,0 +1,245 @@ +# Tests for frs_point_match(). +# Phase 1 — validation (no DB). Phase 2 — SQL composition (mocked +# .frs_db_execute). Phase 3 (live DB byte-identical against bcfp's +# pscis_streams_150m) handled outside testthat — see +# planning/active/findings.md for the procedure. + +# ----- Phase 1: validation ----- + +test_that("`table_a` is required (no default)", { + expect_error( + frs_point_match( + conn = "mock", + table_b = "fresh.modelled_stream_crossings", + table_to = "working_adms.pscis", + distance_max = 100, + col_a_id = "stream_crossing_id", + col_b_id = "modelled_crossing_id" + ), + regexp = "`table_a` is required" + ) +}) + +test_that("`table_b` is required (no default)", { + expect_error( + frs_point_match( + conn = "mock", + table_a = "working_adms.pscis_assessment_snapped", + table_to = "working_adms.pscis", + distance_max = 100, + col_a_id = "stream_crossing_id", + col_b_id = "modelled_crossing_id" + ), + regexp = "`table_b` is required" + ) +}) + +test_that("`table_to` is required (no default)", { + expect_error( + frs_point_match( + conn = "mock", + table_a = "working_adms.pscis_assessment_snapped", + table_b = "fresh.modelled_stream_crossings", + distance_max = 100, + col_a_id = "stream_crossing_id", + col_b_id = "modelled_crossing_id" + ), + regexp = "`table_to` is required" + ) +}) + +test_that("`distance_max` must be positive scalar numeric", { + base_args <- list( + conn = "mock", + table_a = "working_adms.pscis_assessment_snapped", + table_b = "fresh.modelled_stream_crossings", + table_to = "working_adms.pscis", + col_a_id = "stream_crossing_id", + col_b_id = "modelled_crossing_id" + ) + + expect_error(do.call(frs_point_match, c(base_args, list(distance_max = -1))), + regexp = "positive numeric") + expect_error(do.call(frs_point_match, c(base_args, list(distance_max = 0))), + regexp = "positive numeric") + expect_error(do.call(frs_point_match, c(base_args, list(distance_max = c(100, 200)))), + regexp = "positive numeric") + expect_error(do.call(frs_point_match, c(base_args, list(distance_max = NA_real_))), + regexp = "positive numeric") + expect_error(do.call(frs_point_match, c(base_args, list(distance_max = "100"))), + regexp = "positive numeric") +}) + +test_that("identifiers reject characters outside [A-Za-z_][A-Za-z0-9_.]*", { + base_args <- list( + conn = "mock", + table_a = "working_adms.pscis_assessment_snapped", + table_b = "fresh.modelled_stream_crossings", + table_to = "working_adms.pscis", + distance_max = 100, + col_a_id = "stream_crossing_id", + col_b_id = "modelled_crossing_id" + ) + + expect_error(do.call(frs_point_match, + modifyList(base_args, list(table_a = "schema; DROP TABLE x"))), + regexp = "invalid characters") + expect_error(do.call(frs_point_match, + modifyList(base_args, list(col_b_id = "id with spaces"))), + regexp = "invalid characters") +}) + +test_that("`col_a_id` and `col_b_id` must differ", { + expect_error( + frs_point_match( + conn = "mock", + table_a = "schema_a.points", + table_b = "schema_b.points", + table_to = "schema_out.matched", + distance_max = 100, + col_a_id = "id", + col_b_id = "id" + ), + regexp = "must differ" + ) +}) + +# ----- Phase 2: SQL composition (mocked .frs_db_execute) ----- + +with_captured_sql <- function(call_expr, cols_a = c("stream_crossing_id", "blue_line_key", "downstream_route_measure", "linear_feature_id")) { + # RPostgres requires one statement per dbExecute call, so the + # function dispatches DROP and CREATE separately. Capture both. + # `.frs_table_columns` introspects table_a — also mocked. + captured <- list() + local_mocked_bindings( + .frs_db_execute = function(conn, sql) { + captured[[length(captured) + 1L]] <<- sql + 0L + }, + .frs_table_columns = function(conn, table, exclude_generated = FALSE) cols_a + ) + call_expr() + captured +} + +test_that("SQL composes DROP + CREATE + same-blk join + DISTINCT ON", { + sqls <- with_captured_sql(function() { + frs_point_match( + conn = "mock", + table_a = "working_adms.pscis_assessment_snapped", + table_b = "fresh.modelled_stream_crossings", + table_to = "working_adms.pscis", + distance_max = 100, + col_a_id = "stream_crossing_id", + col_b_id = "modelled_crossing_id" + ) + }) + expect_length(sqls, 2L) + expect_match(sqls[[1]], "DROP TABLE IF EXISTS working_adms\\.pscis") + expect_match(sqls[[2]], "CREATE TABLE working_adms\\.pscis AS") + expect_match(sqls[[2]], "DISTINCT ON \\(stream_crossing_id, blue_line_key\\)") + expect_match(sqls[[2]], "a\\.blue_line_key = b\\.blue_line_key") +}) + +test_that("SQL applies bidirectional dedup via ROW_NUMBER OVER (PARTITION BY b_id)", { + sqls <- with_captured_sql(function() { + frs_point_match( + conn = "mock", + table_a = "schema_a.x", + table_b = "schema_b.y", + table_to = "schema_out.z", + distance_max = 100, + col_a_id = "a_id", + col_b_id = "b_id" + ) + }) + # The b-side dedup is what mirrors bcfp's "ensure modelled matches one PSCIS". + expect_match(sqls[[2]], "ROW_NUMBER\\(\\) OVER \\(\\s*\\n?\\s*PARTITION BY b_id") + # And losers get NULL'd out via CASE WHEN b_rank = 1 ... + expect_match(sqls[[2]], "CASE WHEN b_rank = 1 THEN ranked\\.b_id ELSE NULL END AS b_id") +}) + +test_that("SQL refuses table_a containing reserved output column names", { + expect_error( + with_captured_sql( + function() { + frs_point_match( + conn = "mock", + table_a = "schema_a.x", + table_b = "schema_b.y", + table_to = "schema_out.z", + distance_max = 100, + col_a_id = "a_id", + col_b_id = "b_id" + ) + }, + cols_a = c("a_id", "blue_line_key", "downstream_route_measure", "b_id") # b_id collides + ), + regexp = "frs_point_match adds" + ) +}) + +test_that("distance_max appears as a numeric literal in the join predicate", { + sqls <- with_captured_sql(function() { + frs_point_match( + conn = "mock", + table_a = "schema_a.x", + table_b = "schema_b.y", + table_to = "schema_out.z", + distance_max = 100, + col_a_id = "a_id", + col_b_id = "b_id" + ) + }) + # ABS(a.drm - b.drm) < 100 (in the CREATE statement, sqls[[2]]) + expect_match( + sqls[[2]], + "ABS\\(a\\.downstream_route_measure - b\\.downstream_route_measure\\)\\s*\\n?\\s*<\\s*100" + ) +}) + +test_that("LEFT JOIN preserves table_a rows with no match", { + sqls <- with_captured_sql(function() { + frs_point_match( + conn = "mock", + table_a = "schema_a.x", + table_b = "schema_b.y", + table_to = "schema_out.z", + distance_max = 100, + col_a_id = "a_id", + col_b_id = "b_id" + ) + }) + expect_match(sqls[[2]], "LEFT JOIN schema_b\\.y b") + expect_no_match(sqls[[2]], "INNER JOIN") +}) + +test_that("ORDER BY uses distance_instream ASC NULLS LAST for dedup tiebreak", { + sqls <- with_captured_sql(function() { + frs_point_match( + conn = "mock", + table_a = "schema_a.x", + table_b = "schema_b.y", + table_to = "schema_out.z", + distance_max = 100, + col_a_id = "a_id", + col_b_id = "b_id" + ) + }) + expect_match(sqls[[2]], "ASC NULLS LAST") +}) + +test_that("col_b_id carried through SELECT as named column", { + sqls <- with_captured_sql(function() { + frs_point_match( + conn = "mock", + table_a = "schema_a.x", + table_b = "schema_b.y", + table_to = "schema_out.z", + distance_max = 100, + col_a_id = "a_id", + col_b_id = "b_id" + ) + }) + expect_match(sqls[[2]], "b\\.b_id AS b_id") +})