Skip to content

Conversation

@jameshadfield
Copy link
Member

Allows '?' to be used interchangeably with the empty value for the purposes of geographic remappings. See the added tests for real-life occurrences which this commit fixes.

In our default rules files there are n=5181 rules where only the final value (location) is empty and these rules can now apply to metadata values which set "location='?'". There are 44 rules with empty division & location and one rule (!) with empty country, division & location.

  • Automated checks pass
  • Check if you need to add a changelog message
  • Check if you need to add tests
  • Check if you need to update docs

Allows '?' to be used interchangeably with the empty value for the
purposes of geographic remappings. See the added tests for real-life
occurrences which this commit fixes.

In our default rules files there are n=5181 rules where only the final
value (location) is empty and these rules can now apply to metadata
values which set "location='?'". There are 44 rules with empty division
& location and one rule (!) with empty country, division & location.
These didn't affect the test results (mainly as we used dummy rules
which never conflicted) but the intention of these tests is to only use
the rules defined in the actual test.

(The --no-default-rules flag was added in c0575c6)
@jameshadfield
Copy link
Member Author

jameshadfield commented Nov 27, 2025

Here's a summary of changes to the seasonal-flu data when using this PR:

n=3,131 records changed from having location="?" to now having a new division & location. Some examples (using country/division/location format):

  • New Zealand/West Coast/? → New Zealand/Canterbury/West Coast NZ (the coast is certainly not in canterbury, but let's let that slide!)
  • Norway/Buskerud/? → Norway/Viken/Buskerud

n=9,803 records changed division value only, e.g.

  • Belgium/Flanders/? → Belgium/Vlaanderen/?

n=757 records changed location value only, e.g.

  • Romania/Iasi/? → Romania/Iasi/Iasi

@codecov
Copy link

codecov bot commented Nov 27, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.27%. Comparing base (54e93d7) to head (e5fcd69).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1929      +/-   ##
==========================================
+ Coverage   74.25%   74.27%   +0.01%     
==========================================
  Files          82       82              
  Lines        9052     9059       +7     
  Branches     1847     1850       +3     
==========================================
+ Hits         6722     6729       +7     
  Misses       2024     2024              
  Partials      306      306              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jameshadfield added a commit to nextstrain/seasonal-flu that referenced this pull request Nov 27, 2025
Updating augur to PR <nextstrain/augur#1929>
allows our geo-location values with '?' to be appropriately parsed
by `augur curate apply-geolocation-rules`

Changes for seasonal-flu:

n=3,131 records changed from having location="?" to now having a new
division & location. Some examples (using country/division/location format):
 * New Zealand/West Coast/? → New Zealand/Canterbury/West Coast NZ
   (the coast is certainly not in canterbury, but let's let that slide!)
 * Norway/Buskerud/? → Norway/Viken/Buskerud

n=9,803 records changed division value only, e.g.
  * Belgium/Flanders/? → Belgium/Vlaanderen/?

n=757 records changed location value only, e.g.
  * Romania/Iasi/? → Romania/Iasi/Iasi
Comment on lines +76 to +84
# We want to match '?' values to the empty string as well to allow them to be used as empty values
# but if they are used it's expected they are used for all "empty" fields
annot_using_question_marks = tuple(['?' if x=='' else x for x in annot])
if raw[1]=='' and raw[2]=='' and raw[3]=='':
geolocation_rules[raw[0]]['?']['?']['?'] = annot_using_question_marks
elif raw[2]=='' and raw[3]=='':
geolocation_rules[raw[0]][raw[1]]['?']['?'] = annot_using_question_marks
elif raw[3]=='':
geolocation_rules[raw[0]][raw[1]][raw[2]]['?'] = annot_using_question_marks
Copy link
Member

@victorlin victorlin Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: instead of adding duplicate rule entries for a new null value, add logic for different null values on the data. Something like this: 5588acb

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll defer to @joverlee521 here - my heads mostly focused on the downstream fixes this entails rather than the specific way it's fixed. I'm happy to change to your approach as desired.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @victorlin, I like 5588acb!

We can come back to this at a later date since the seasonal-flu ingest is changing the ? values to empty strings as discussed in nextstrain/seasonal-flu#267 (comment).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, easy fix!

jameshadfield added a commit to nextstrain/seasonal-flu that referenced this pull request Dec 3, 2025
The '?' null value meant that our geographic rules weren't applied as
intended. For instance, we (Augur) have rules such as:

  1. `Asia/Bangladesh/Jashore/` →	`Asia/Bangladesh/Khulna/Jashore`
  2. `Asia/Bangladesh/Jashore/*` →	`Asia/Bangladesh/Khulna/*`

and our geographic fixes layer would produce values like
`Asia/Bangladesh/Jashore/?`. We want the first rule to be applied to
this, resulting in a corrected value of `Asia/Bangladesh/Khulna/Jashore`
however the second rule would be run instead resulting in
`Asia/Bangladesh/Khulna/?`.

Originally I implemented a fix in Augur for this to allow geo-location
values with '?' to be appropriately parsed by `augur curate
apply-geolocation-rules` <nextstrain/augur#1929>
however on review we decided to go the other way and change the
seasonal-flu geographic null value to the empty string
<#267 (comment)>

**Summary of changes to (seasonal-flu) ingested data**

393,688 / 514,895 records changed

Value "?" → '' changes (very much as expected!):
* n = 4 (country)
* n = 102,408 (division)
* n = 389,522 (location)

Value "?" → valid geographic deme (via the rules doing their thing):
* n = 0 (country)
* n = 1,925 (division)
* n = 3,888 (location)

Value change from one deme to another (via the rules doing their thing):
* n = 280 (country)
* n = 8,904 (division)
* n = 0 (location)

**Some examples**

The example given above (`Asia/Bangladesh/Jashore/?`) results in 38
records having a location "?" → "Jashore"

97 records go from "Norway/Buskerud/?" → "Norway/Viken/Buskerud"
jameshadfield added a commit to nextstrain/seasonal-flu that referenced this pull request Dec 7, 2025
The '?' null value meant that our geographic rules weren't applied as
intended. For instance, we (Augur) have rules such as:

  1. `Asia/Bangladesh/Jashore/` →	`Asia/Bangladesh/Khulna/Jashore`
  2. `Asia/Bangladesh/Jashore/*` →	`Asia/Bangladesh/Khulna/*`

and our geographic fixes layer would produce values like
`Asia/Bangladesh/Jashore/?`. We want the first rule to be applied to
this, resulting in a corrected value of `Asia/Bangladesh/Khulna/Jashore`
however the second rule would be run instead resulting in
`Asia/Bangladesh/Khulna/?`.

Originally I implemented a fix in Augur for this to allow geo-location
values with '?' to be appropriately parsed by `augur curate
apply-geolocation-rules` <nextstrain/augur#1929>
however on review we decided to go the other way and change the
seasonal-flu geographic null value to the empty string
<#267 (comment)>

**Summary of changes to (seasonal-flu) ingested data**

393,688 / 514,895 records changed

Value "?" → '' changes (very much as expected!):
* n = 4 (country)
* n = 102,408 (division)
* n = 389,522 (location)

Value "?" → valid geographic deme (via the rules doing their thing):
* n = 0 (country)
* n = 1,925 (division)
* n = 3,888 (location)

Value change from one deme to another (via the rules doing their thing):
* n = 280 (country)
* n = 8,904 (division)
* n = 0 (location)

**Some examples**

The example given above (`Asia/Bangladesh/Jashore/?`) results in 38
records having a location "?" → "Jashore"

97 records go from "Norway/Buskerud/?" → "Norway/Viken/Buskerud"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants