Add a heuristic that tells if the standard diff is possible

In #1 we realized that some GTFS datasets use Ids (e.g. shape_id) that are re-generated for every dataset (See #4 )
In that case the first version of the gtfs-diff engine cannot significantly find the differences.

As a stop-gap measure, we should have some kind of heuristic that quickly tells us if the diff engine can be used on a given dataset. 

Copilot Suggestion:

> Do a cheap O(N) pre-flight check that scans only the id columns of each file (no full row parsing). For every file present in both feeds, we compute:
> 
>  churn = size(base_ids OR new_ids - base_ids AND new_ids) / size(base_ids OR new_ids)
> 
> If the weighted overall churn across all files reaches 50% (user defined), the diff is aborted with a clear error message listing which files have high churn. This prevents the engine from producing a meaningless diff when a publisher has fully regenerated all IDs between versions.
> 
> 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a heuristic that tells if the standard diff is possible #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add a heuristic that tells if the standard diff is possible #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions