Remove lossy edge deduplication from integration pipeline#2106
Merged
eKathleenCarter merged 26 commits intomainfrom Mar 20, 2026
Merged
Remove lossy edge deduplication from integration pipeline#2106eKathleenCarter merged 26 commits intomainfrom
eKathleenCarter merged 26 commits intomainfrom
Conversation
- Update get_unioned_edge_schema() unique constraint to include
primary_knowledge_source, allowing multiple edges with the same
(subject, predicate, object) from different knowledge sources
- Change normalize_edges() to exact-row dedup only (dropDuplicates()
without subset), preserving edges that share SPO+PKS but differ in
qualifiers or publications
- Rewrite union_edges() to replace the lossy groupBy().agg(F.first())
with dropDuplicates() on (subject, predicate, object,
primary_knowledge_source), keeping per-source edge attributes intact
- Compute primary_knowledge_sources via a separate non-lossy groupBy
join so each edge row carries cross-source provenance for its SPO
- Add logging to union_edges() for per-source edge counts and dedup delta
- Update test_unify_edges to assert edges from different PKS are
preserved as separate rows with correct primary_knowledge_sources
Collaborator
Author
|
Collaborator
Author
only showing the top 25 most changed sources
|
Collaborator
Author
|
See below for examples of what the source_edge_propteries contain DetailsThe table below was generated from this BQ query and Claude was used to format the result into a readable MD table.
|
…://github.com/everycure-org/matrix into ekcarter/xdata-278-remove-edge-de-duplication
Collaborator
Author
|
Waiting to review this with EC tomorrow. After the discussion, this will be ready for review. |
…ll match the new integration pipeline
…://github.com/everycure-org/matrix into ekcarter/xdata-278-remove-edge-de-duplication
…://github.com/everycure-org/matrix into ekcarter/xdata-278-remove-edge-de-duplication
…pairs duplicate (source, target) error in CII The previous pool of 20 drugs/diseases caused ec_indications_list and off_label (both 100 rows) to sample with replacement, allowing duplicate (ec_id, target) pairs with differing on_label/off_label values. This PR causes normalize_edges change from SPO-key dedup to all-column dedup, these duplicates survived into generate_pairs and caused a pandera unique=["source", "target"] violation. Confirmed via BigQuery: no duplicate (source, target) exist in the real ec_clinical_trials_edges_normalized, off_label_edges_normalized, or ec_indications_list_edges_normalized tables. The 110-row pool more accurately reflects production data for these sources.
…://github.com/everycure-org/matrix into ekcarter/xdata-278-remove-edge-de-duplication
Collaborator
JacquesVergine
left a comment
There was a problem hiding this comment.
Looks better, a few small comments and questions
eKathleenCarter
commented
Mar 18, 2026
JacquesVergine
approved these changes
Mar 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of the changes
Removes the lossy edge deduplication from the integration pipeline (XDATA-278) and replaces it with a configurable opt-in filter in the filtering pipeline.
union_edges was collapsing all edges sharing the same (subject, predicate, object) triple into a single row using groupBy().agg(F.first()). This silently dropped object_direction_qualifier, knowledge_level, agent_type, and primary_knowledge_source from all but one source per triple, making it impossible to preserve conflicting evidence (e.g. three edges asserting "increased", "decreased", and null direction for the same TGF-β relationship).
Changes
Integration pipeline
Filtering pipeline
Fixes / Resolves the following issues:
Checklist:
enhancementorbug)pulling in latest main, uncomment the below "Merge Notification" section and
describe steps necessary for people
kedro run -e sample -p test_sample(see sample environment guide)