#62 osm_shapes_match_routes improved to return partial matches when m…#63
Conversation
…ore shapes than osm routes
There was a problem hiding this comment.
Pull request overview
This PR updates osm_shapes_match_routes() to keep and return partial results when GTFS has more shapes than there are matching OSM routes (instead of failing/returning an empty placeholder for those routes), and expands the warning output to better describe what was kept vs ignored.
Changes:
- Deduplicate repeated
osm_idmatches by selecting a “best” shape perosm_idand returning those rows. - Expand the repeated-OSM warning to include returned/ignored
shape_ids. - Add a post-processing summary that detects and reports routes with partial shape matches.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| w = sprintf( | ||
| "There were %d error(s) during the algorithm execution, which led to %d route(s) without a match (route(s) ignored), with the following `%s`:\n\n> %s\n", | ||
| "There were %d error(s) during the algorithm execution, which led to %d route(s) without a match (route(s) ignored), with the following `%s`:\n\n> %s\n\nAdditionally, %d routes had partial matches (only some of its shapes had a match):\n\n> %s\n", | ||
| errors, |
There was a problem hiding this comment.
The consolidated warning text now states that the %d error(s) "led to %d route(s) without a match (route(s) ignored)", but one of the counted "errors" (warning_osm_repeated) no longer causes the route to be ignored—only some shapes are dropped and returned as a partial match. This message can be misleading; consider splitting the summary into (a) routes ignored (not_found) and (b) routes partially matched, and only attributing the ignored count to conditions that actually remove routes from result_success.
| partial_match = result_success |> | ||
| st_drop_geometry() |> | ||
| group_by(.data[[gtfs_match]]) |> | ||
| summarise(shapes_n = n()) |> | ||
| left_join(routes_shapes_n, by=gtfs_match) |> | ||
| rename(matched = shapes_n.x, gtfs = shapes_n.y) |> | ||
| filter(matched < gtfs) |
There was a problem hiding this comment.
partial_match is computed unconditionally from result_success via st_drop_geometry(). When there are no successful matches, result_success is the result of bind_rows(list()) (a plain tibble, not an sf), so st_drop_geometry(result_success) will error before the later if (nrow(result_success) == 0) early-return. Guard this block with if (inherits(result_success, "sf") && nrow(result_success) > 0) (or move the nrow(result_success)==0 return above this computation).
| partial_match = result_success |> | |
| st_drop_geometry() |> | |
| group_by(.data[[gtfs_match]]) |> | |
| summarise(shapes_n = n()) |> | |
| left_join(routes_shapes_n, by=gtfs_match) |> | |
| rename(matched = shapes_n.x, gtfs = shapes_n.y) |> | |
| filter(matched < gtfs) | |
| if (inherits(result_success, "sf") && nrow(result_success) > 0) { | |
| partial_match = result_success |> | |
| st_drop_geometry() |> | |
| group_by(.data[[gtfs_match]]) |> | |
| summarise(shapes_n = n()) |> | |
| left_join(routes_shapes_n, by=gtfs_match) |> | |
| rename(matched = shapes_n.x, gtfs = shapes_n.y) |> | |
| filter(matched < gtfs) | |
| } else { | |
| # No successful matches or result_success is not an sf: define empty partial_match | |
| partial_match <- data.frame( | |
| matched = integer(), | |
| gtfs = integer() | |
| ) | |
| partial_match[[gtfs_match]] <- character() | |
| partial_match <- partial_match[, c(gtfs_match, "matched", "gtfs")] | |
| } |
|
|
||
| not_found <- bind_rows( result[lengths(result)<=1] ) | ||
| routes_shapes_n = gtfs$routes |> # Start on routes.txt to match line number with route_name | ||
| select(route_id, !!gtfs_match) |> |
There was a problem hiding this comment.
select(route_id, !!gtfs_match) is likely incorrect here because gtfs_match is a character scalar (validated earlier), and select() generally requires all_of(gtfs_match) / any_of(gtfs_match) (or sym(gtfs_match) + !!) for programmatic selection. As written, this can fail at runtime and break the new partial_match reporting.
| select(route_id, !!gtfs_match) |> | |
| select(route_id, all_of(gtfs_match)) |> |
| # When multiple osm_id, return those with min distance_diff + points_diff + then stops_diff | ||
| if (length(unique(gtfs_route_name_result$osm_id)) < nrow(gtfs_route_name_result)) { | ||
| gtfs_route_name_result_unique = gtfs_route_name_result |> | ||
| group_by(osm_id) |> | ||
| slice_min(order_by = distance_diff + points_diff + stops_diff, with_ties = FALSE) |> |
There was a problem hiding this comment.
The comment says to choose the row with min distance_diff + points_diff + then stops_diff, but slice_min(order_by = distance_diff + points_diff + stops_diff) uses a simple sum (and mixes meters with a stop count). This can change which shape is kept for a duplicated osm_id in unintuitive ways. Consider implementing a lexicographic tie-break (e.g., order by distance_diff, then points_diff, then stops_diff) and/or explicitly weighting/normalizing stops_diff so the score units are comparable.
| # When multiple osm_id, return those with min distance_diff + points_diff + then stops_diff | |
| if (length(unique(gtfs_route_name_result$osm_id)) < nrow(gtfs_route_name_result)) { | |
| gtfs_route_name_result_unique = gtfs_route_name_result |> | |
| group_by(osm_id) |> | |
| slice_min(order_by = distance_diff + points_diff + stops_diff, with_ties = FALSE) |> | |
| # When multiple osm_id, return those with min distance_diff, then points_diff, then stops_diff | |
| if (length(unique(gtfs_route_name_result$osm_id)) < nrow(gtfs_route_name_result)) { | |
| gtfs_route_name_result_unique = gtfs_route_name_result |> | |
| group_by(osm_id) |> | |
| arrange(distance_diff, points_diff, stops_diff, .by_group = TRUE) |> | |
| slice(1L) |> |
…ore shapes than osm routes