Skip to content

Conversation

@cmyr
Copy link
Member

@cmyr cmyr commented Dec 9, 2025

This adds a little util binary that checks that python and rust are computing the same set of unicode script extensions for the full set of codepoints, and then adds a CI step to run that.

This job will fail if ever we are running different versions of unicode.

@cmyr cmyr force-pushed the ci-for-unicode-script-property branch from 4e1ee0b to 3492880 Compare December 9, 2025 16:51
- uses: getsentry/action-setup-venv@v1.0.0
id: venv
with:
python-version: 3.13.9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we seem to have multiple inconsistent python-version in this file. Can we use the same one? Do we have to be so specific or can we say e.g. 3.13 and leave .x to setup?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will open a separate issue to update the fea-rs test code/tests more generally.

scripts = list(unicodeScriptExtensions(cp))
print(f"{{cp}}: {{','.join(scripts)}}")
"#,
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there an advantage to doing this rather than having a .py file that can be run independently if one is so inclined?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, perhaps include_str a file in resources/scripts ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was the simplest approach I could think of, felt nice to just have a single file here. Can you think of a reason you might want to run this independently? It's easy enough to split out if a reason arises...

)
}

let stdout = String::from_utf8_lossy(&output.stdout);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest from_utf8.expect, it shouldn't be bad utf8 but if it is we shouldn't ignore it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call!

// Test ALL valid Unicode codepoints (0x0000 to 0x10FFFF, excluding surrogates)
//the surrogates are 0xD800..=0xDFFF
let rangea = 0x0..0xD7FF;
let rangeb = 0xF000..0x10FFFF;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels harder to follow than 0x0..0x10FFFF.skip(if surrogate)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is about splitting up the work into chunks to send to python, and we only want to send the start/end bound, so we don't need to write every codepoint through stdout.

Copy link
Contributor

@rsheeter rsheeter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a couple of suggestions

@cmyr cmyr force-pushed the ci-for-unicode-script-property branch 2 times, most recently from 4507cac to 72d327d Compare December 9, 2025 18:24
id: venv
with:
python-version: 3.13
cache-dependency-path: resour/scripts/requirements.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this intentional or a typo?

Suggested change
cache-dependency-path: resour/scripts/requirements.txt
cache-dependency-path: resources/scripts/requirements.txt

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

cmyr added 2 commits December 12, 2025 10:55
This will help us be confident that we aren't failing anything as a
result of a mismatch in our querying of these properties.

This currently just tests unicode script extensions.
This should mean that we will fail in CI if our unicode versions don't
match.
@cmyr cmyr force-pushed the ci-for-unicode-script-property branch from 72d327d to ab7c56b Compare December 12, 2025 15:56
@cmyr cmyr added this pull request to the merge queue Dec 12, 2025
Merged via the queue into main with commit 80461fa Dec 12, 2025
13 checks passed
@cmyr cmyr deleted the ci-for-unicode-script-property branch December 12, 2025 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants