Skip to content

Add a heuristic that tells if the standard diff is possible #5

@jcpitre

Description

@jcpitre

In #1 we realized that some GTFS datasets use Ids (e.g. shape_id) that are re-generated for every dataset (See #4 )
In that case the first version of the gtfs-diff engine cannot significantly find the differences.

As a stop-gap measure, we should have some kind of heuristic that quickly tells us if the diff engine can be used on a given dataset.

Copilot Suggestion:

Do a cheap O(N) pre-flight check that scans only the id columns of each file (no full row parsing). For every file present in both feeds, we compute:

churn = size(base_ids OR new_ids - base_ids AND new_ids) / size(base_ids OR new_ids)

If the weighted overall churn across all files reaches 50% (user defined), the diff is aborted with a clear error message listing which files have high churn. This prevents the engine from producing a meaningless diff when a publisher has fully regenerated all IDs between versions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions