tests: Add tests for utf-8 encoding for LiteralUTF8Char (#208)#240
tests: Add tests for utf-8 encoding for LiteralUTF8Char (#208)#240xmnlab merged 5 commits intoarxlang:mainfrom
Conversation
|
Hey @Vikash-Kumar-23, nice catch on the "ascii" codec bug! |
|
also the CI linter is failing because the docstring in test_utf8_char_lowering_correctness isn't in Douki YAML format. This project requires docstrings to use the title: ... syntax. Change: |
|
Hey @yuvimittal, just left some review comments on #240 — caught a double-encoding issue and a Douki docstring format that's causing the linter to fail. Let me know if those were useful findings or if I'm missing something. |
|
Thanks for the review! @Jaskirat-s7
All tests pass and |
0772b79 to
09d867b
Compare
|
@Vikash-Kumar-23 could you rebase your branch on top of the upstream main please? |
|
also ensure that the CI is all green |
ff16348 to
0ae0ba3
Compare
|
Hi @xmnlab Done! I've rebased on top of the latest upstream main and fixed the douki linter issues. The CI should be all green now. Ready for another look! |
0ae0ba3 to
ff44e11
Compare
|
thanks @Vikash-Kumar-23 |
|
🎉 This PR is included in version 1.9.0 🎉 The release is available on:
Your semantic-release bot 📦🚀 |
Pull Request description
This PR fixes a UnicodeEncodeError in the LLVMLiteIR builder that occurred when lowering LiteralUTF8Char nodes containing non-ASCII characters (e.g., é).
The builder was previously hard-coded to use the "ascii" codec when initializing the bytearray for the global string constant. I have updated this to use "utf-8" to correctly support international characters and symbols.
Fixes #208
How to test these changes
Run a reproduction script (e.g., parsing astx.LiteralUTF8Char("é")) and verify that builder.translate(module) no longer raises a UnicodeEncodeError.
Run the new regression test to verify the generated LLVM IR string contains the correct hex-encoded bytes for UTF-8 (for é, this is \c3\a9):
PowerShell
python -m pytest tests/test_string.py::test_utf8_char_lowering_correctness -v
Expected Output:
Plaintext
======================= test session starts =======================
platform win32 -- Python 3.14.0, pytest-9.0.2, pluggy-1.6.0
rootdir: C:...\irx
configfile: pyproject.toml
plugins: typeguard-4.5.1
collecting ... collected 1 item
tests/test_string.py::test_utf8_char_lowering_correctness PASSED [100%]
======================== 1 passed in 0.15s ========================
Pull Request checklists
This PR is a:
[x] bug-fix
[ ] new feature
[ ] maintenance
About this PR:
[x] it includes tests.
[ ] the tests are executed on CI.
[ ] the tests generate log file(s) (path).
[x] pre-commit hooks were executed locally.
[ ] this PR requires a project documentation update.
Author's checklist:
[x] I have reviewed the changes and it contains no misspelling.
[x] The code is well commented, especially in the parts that contain more complexity.
[x] New and old tests passed locally. (Core IR generation tests passed; binary linking tests skipped due to local environment constraints).
Additional information
Technical Evidence
Before Fix (Crash):
The builder failed when encountering multibyte characters because it was forced to use the ASCII codec.
Plaintext
Attempting to translate LiteralUTF8Char with value: é
Traceback (most recent call last):
File "src/irx/builders/llvmliteir.py", line 1757, in visit_LiteralUTF8Char
string_data_type, bytearray(string_value + "\0", "ascii")
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 0: ordinal not in range(128)
After Fix (Successful IR Generation):
The generated IR now correctly encodes the character as a UTF-8 byte sequence:
Code snippet
@"str_ascii_..." = internal constant [3 x i8] c"\c3\a9\00"
The added test test_utf8_char_lowering_correctness verifies this hex sequence directly in the IR string to ensure technical correctness, addressing feedback from previous PR attempts.
Reviewer's checklist
Copy and paste this template for your review's note:
Reviewer's Checklist
mainbranch