Skip to content

spike: Tackle the problem of generated ids #4

@jcpitre

Description

@jcpitre

In #1 (comment) we realized that some providers would generate new ids for each version of datasets.
When we think about it, it makes sense that ids are only meaningful within the dataset itself. We are not guaranteed that they will make sense in the context of another dataset, even of the same feed.

We looked at mdb-2014. and saw that the shape_ids in shapes.txt a regenerated for each dataset. e.g. from mdb-2014-202603110034:

shape_id,shape_pt_lat,shape_pt_lon,shape_pt_sequence,shape_dist_traveled
0000abe0-5266-475b-808a-5cf929120a80,50.118257595,-5.540823891,229,

But for mdb-2014-202603090029 the "same" line is:

shape_id,shape_pt_lat,shape_pt_lon,shape_pt_sequence,shape_dist_traveled
3d88c650-6c5b-4f14-996b-e5b01b4eec33,50.118257595,-5.540823891,229,

We need to establish how prevalent this in the feeds we host.
And find a way to evaluate the diff considering these generated ids.

Here is copilot take on it:

A rough breakdown by field

Field Stability Reason
shape_id 🔴 Very unstable Almost universally regenerated on export by scheduling tools
trip_id 🟠 Often unstable HASTUS, Trapeze, and other tools generate these internally
service_id 🟠 Often unstable Date-based generation is common (e.g. 20240115_WD)
block_id 🟠 Often unstable Operational scheduling artifact
route_id 🟡 Moderately stable Often matches public route numbers, but not always
fare_id 🟡 Moderately stable Fare structures change slowly
stop_id 🟢 Usually stable Stops are physical — agencies maintain these
agency_id 🟢 Very stable Rarely changes

So realistically stop_id and agency_id are the only keys you can reliably trust. Everything else is potentially a surrogate.

This means for a large fraction of feeds, the diff engine's output for shapes.txt, trips.txt, and calendar.txt is likely dominated by key churn rather than real changes — which
significantly undermines the value of the tool for those files without content-based matching.

This is arguably the most important design problem for gtfs-diff-engine to solve

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions