-
Notifications
You must be signed in to change notification settings - Fork 135
[curate apply-geolocation-rules] treat '?' values akin to empty values #1929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Allows '?' to be used interchangeably with the empty value for the purposes of geographic remappings. See the added tests for real-life occurrences which this commit fixes. In our default rules files there are n=5181 rules where only the final value (location) is empty and these rules can now apply to metadata values which set "location='?'". There are 44 rules with empty division & location and one rule (!) with empty country, division & location.
These didn't affect the test results (mainly as we used dummy rules which never conflicted) but the intention of these tests is to only use the rules defined in the actual test. (The --no-default-rules flag was added in c0575c6)
|
Here's a summary of changes to the seasonal-flu data when using this PR: n=3,131 records changed from having location="?" to now having a new division & location. Some examples (using country/division/location format):
n=9,803 records changed division value only, e.g.
n=757 records changed location value only, e.g.
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1929 +/- ##
==========================================
+ Coverage 74.25% 74.27% +0.01%
==========================================
Files 82 82
Lines 9052 9059 +7
Branches 1847 1850 +3
==========================================
+ Hits 6722 6729 +7
Misses 2024 2024
Partials 306 306 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Updating augur to PR <nextstrain/augur#1929> allows our geo-location values with '?' to be appropriately parsed by `augur curate apply-geolocation-rules` Changes for seasonal-flu: n=3,131 records changed from having location="?" to now having a new division & location. Some examples (using country/division/location format): * New Zealand/West Coast/? → New Zealand/Canterbury/West Coast NZ (the coast is certainly not in canterbury, but let's let that slide!) * Norway/Buskerud/? → Norway/Viken/Buskerud n=9,803 records changed division value only, e.g. * Belgium/Flanders/? → Belgium/Vlaanderen/? n=757 records changed location value only, e.g. * Romania/Iasi/? → Romania/Iasi/Iasi
| # We want to match '?' values to the empty string as well to allow them to be used as empty values | ||
| # but if they are used it's expected they are used for all "empty" fields | ||
| annot_using_question_marks = tuple(['?' if x=='' else x for x in annot]) | ||
| if raw[1]=='' and raw[2]=='' and raw[3]=='': | ||
| geolocation_rules[raw[0]]['?']['?']['?'] = annot_using_question_marks | ||
| elif raw[2]=='' and raw[3]=='': | ||
| geolocation_rules[raw[0]][raw[1]]['?']['?'] = annot_using_question_marks | ||
| elif raw[3]=='': | ||
| geolocation_rules[raw[0]][raw[1]][raw[2]]['?'] = annot_using_question_marks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: instead of adding duplicate rule entries for a new null value, add logic for different null values on the data. Something like this: 5588acb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll defer to @joverlee521 here - my heads mostly focused on the downstream fixes this entails rather than the specific way it's fixed. I'm happy to change to your approach as desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @victorlin, I like 5588acb!
We can come back to this at a later date since the seasonal-flu ingest is changing the ? values to empty strings as discussed in nextstrain/seasonal-flu#267 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, easy fix!
The '?' null value meant that our geographic rules weren't applied as intended. For instance, we (Augur) have rules such as: 1. `Asia/Bangladesh/Jashore/` → `Asia/Bangladesh/Khulna/Jashore` 2. `Asia/Bangladesh/Jashore/*` → `Asia/Bangladesh/Khulna/*` and our geographic fixes layer would produce values like `Asia/Bangladesh/Jashore/?`. We want the first rule to be applied to this, resulting in a corrected value of `Asia/Bangladesh/Khulna/Jashore` however the second rule would be run instead resulting in `Asia/Bangladesh/Khulna/?`. Originally I implemented a fix in Augur for this to allow geo-location values with '?' to be appropriately parsed by `augur curate apply-geolocation-rules` <nextstrain/augur#1929> however on review we decided to go the other way and change the seasonal-flu geographic null value to the empty string <#267 (comment)> **Summary of changes to (seasonal-flu) ingested data** 393,688 / 514,895 records changed Value "?" → '' changes (very much as expected!): * n = 4 (country) * n = 102,408 (division) * n = 389,522 (location) Value "?" → valid geographic deme (via the rules doing their thing): * n = 0 (country) * n = 1,925 (division) * n = 3,888 (location) Value change from one deme to another (via the rules doing their thing): * n = 280 (country) * n = 8,904 (division) * n = 0 (location) **Some examples** The example given above (`Asia/Bangladesh/Jashore/?`) results in 38 records having a location "?" → "Jashore" 97 records go from "Norway/Buskerud/?" → "Norway/Viken/Buskerud"
The '?' null value meant that our geographic rules weren't applied as intended. For instance, we (Augur) have rules such as: 1. `Asia/Bangladesh/Jashore/` → `Asia/Bangladesh/Khulna/Jashore` 2. `Asia/Bangladesh/Jashore/*` → `Asia/Bangladesh/Khulna/*` and our geographic fixes layer would produce values like `Asia/Bangladesh/Jashore/?`. We want the first rule to be applied to this, resulting in a corrected value of `Asia/Bangladesh/Khulna/Jashore` however the second rule would be run instead resulting in `Asia/Bangladesh/Khulna/?`. Originally I implemented a fix in Augur for this to allow geo-location values with '?' to be appropriately parsed by `augur curate apply-geolocation-rules` <nextstrain/augur#1929> however on review we decided to go the other way and change the seasonal-flu geographic null value to the empty string <#267 (comment)> **Summary of changes to (seasonal-flu) ingested data** 393,688 / 514,895 records changed Value "?" → '' changes (very much as expected!): * n = 4 (country) * n = 102,408 (division) * n = 389,522 (location) Value "?" → valid geographic deme (via the rules doing their thing): * n = 0 (country) * n = 1,925 (division) * n = 3,888 (location) Value change from one deme to another (via the rules doing their thing): * n = 280 (country) * n = 8,904 (division) * n = 0 (location) **Some examples** The example given above (`Asia/Bangladesh/Jashore/?`) results in 38 records having a location "?" → "Jashore" 97 records go from "Norway/Buskerud/?" → "Norway/Viken/Buskerud"
Allows '?' to be used interchangeably with the empty value for the purposes of geographic remappings. See the added tests for real-life occurrences which this commit fixes.
In our default rules files there are n=5181 rules where only the final value (location) is empty and these rules can now apply to metadata values which set "location='?'". There are 44 rules with empty division & location and one rule (!) with empty country, division & location.