Skip to content

tests: Add tests for utf-8 encoding for LiteralUTF8Char (#208)#240

Merged
xmnlab merged 5 commits intoarxlang:mainfrom
Vikash-Kumar-23:fix-utf8-char-lowering
Apr 1, 2026
Merged

tests: Add tests for utf-8 encoding for LiteralUTF8Char (#208)#240
xmnlab merged 5 commits intoarxlang:mainfrom
Vikash-Kumar-23:fix-utf8-char-lowering

Conversation

@Vikash-Kumar-23
Copy link
Copy Markdown
Contributor

@Vikash-Kumar-23 Vikash-Kumar-23 commented Mar 19, 2026

Pull Request description
This PR fixes a UnicodeEncodeError in the LLVMLiteIR builder that occurred when lowering LiteralUTF8Char nodes containing non-ASCII characters (e.g., é).

The builder was previously hard-coded to use the "ascii" codec when initializing the bytearray for the global string constant. I have updated this to use "utf-8" to correctly support international characters and symbols.

Fixes #208

How to test these changes
Run a reproduction script (e.g., parsing astx.LiteralUTF8Char("é")) and verify that builder.translate(module) no longer raises a UnicodeEncodeError.

Run the new regression test to verify the generated LLVM IR string contains the correct hex-encoded bytes for UTF-8 (for é, this is \c3\a9):

PowerShell
python -m pytest tests/test_string.py::test_utf8_char_lowering_correctness -v
Expected Output:

Plaintext
======================= test session starts =======================
platform win32 -- Python 3.14.0, pytest-9.0.2, pluggy-1.6.0
rootdir: C:...\irx
configfile: pyproject.toml
plugins: typeguard-4.5.1
collecting ... collected 1 item

tests/test_string.py::test_utf8_char_lowering_correctness PASSED [100%]
======================== 1 passed in 0.15s ========================
Pull Request checklists
This PR is a:

[x] bug-fix

[ ] new feature

[ ] maintenance

About this PR:

[x] it includes tests.

[ ] the tests are executed on CI.

[ ] the tests generate log file(s) (path).

[x] pre-commit hooks were executed locally.

[ ] this PR requires a project documentation update.

Author's checklist:

[x] I have reviewed the changes and it contains no misspelling.

[x] The code is well commented, especially in the parts that contain more complexity.

[x] New and old tests passed locally. (Core IR generation tests passed; binary linking tests skipped due to local environment constraints).

Additional information
Technical Evidence

Before Fix (Crash):
The builder failed when encountering multibyte characters because it was forced to use the ASCII codec.

Plaintext
Attempting to translate LiteralUTF8Char with value: é
Traceback (most recent call last):
File "src/irx/builders/llvmliteir.py", line 1757, in visit_LiteralUTF8Char
string_data_type, bytearray(string_value + "\0", "ascii")
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 0: ordinal not in range(128)

After Fix (Successful IR Generation):
The generated IR now correctly encodes the character as a UTF-8 byte sequence:

Code snippet
@"str_ascii_..." = internal constant [3 x i8] c"\c3\a9\00"
The added test test_utf8_char_lowering_correctness verifies this hex sequence directly in the IR string to ensure technical correctness, addressing feedback from previous PR attempts.

Reviewer's checklist
Copy and paste this template for your review's note:

Reviewer's Checklist

  • I managed to reproduce the problem locally from the main branch
  • I managed to test the new changes locally
  • I confirm that the issues mentioned were fixed/resolved .

@Jaskirat-s7
Copy link
Copy Markdown
Contributor

Hey @Vikash-Kumar-23, nice catch on the "ascii" codec bug!
one small suggestion is that I noticed the utf8_bytes variable is already computed a few lines above (line 1917) but isn't used for the initializer. You could simplify the fix by reusing it:
string_data.initializer = ir.Constant(
string_data_type, bytearray(utf8_bytes + b"\0")
)
This avoids encoding the string twice and also matches the pattern used by the LiteralUTF8String visitor right below (line 1963–1964) and also, the docstring on line 1911 still says "Handle ASCII string literals" — might be worth updating it to something like "Handle UTF-8 character literals" since the visitor is for LiteralUTF8Char.
Test looks solid!

@Jaskirat-s7
Copy link
Copy Markdown
Contributor

also the CI linter is failing because the docstring in test_utf8_char_lowering_correctness isn't in Douki YAML format. This project requires docstrings to use the title: ... syntax. Change:
"""Verify LiteralUTF8Char correctly lowers to UTF-8 hex in IR."""
to:
"""title: Verify LiteralUTF8Char correctly lowers to UTF-8 hex in IR."""
You can see the pattern in the other tests in the same file (e.g. line 55, 90, etc.).

@Jaskirat-s7
Copy link
Copy Markdown
Contributor

Hey @yuvimittal, just left some review comments on #240 — caught a double-encoding issue and a Douki docstring format that's causing the linter to fail. Let me know if those were useful findings or if I'm missing something.

@Vikash-Kumar-23
Copy link
Copy Markdown
Contributor Author

Thanks for the review! @Jaskirat-s7

  • Reused utf8_bytes and removed redundant .encode() to avoid double encoding.
  • Updated initializer to bytearray(utf8_bytes + b"\0"), matching LiteralUTF8String.
  • Fixed test docstring to Douki YAML format.

All tests pass and ruff is clean. Let me know if you'd like additional UTF-8 edge case coverage.

@Vikash-Kumar-23 Vikash-Kumar-23 force-pushed the fix-utf8-char-lowering branch from 0772b79 to 09d867b Compare March 25, 2026 00:00
@xmnlab
Copy link
Copy Markdown
Contributor

xmnlab commented Mar 31, 2026

@Vikash-Kumar-23 could you rebase your branch on top of the upstream main please?

@xmnlab xmnlab marked this pull request as draft March 31, 2026 21:49
@xmnlab
Copy link
Copy Markdown
Contributor

xmnlab commented Mar 31, 2026

also ensure that the CI is all green

@Vikash-Kumar-23 Vikash-Kumar-23 force-pushed the fix-utf8-char-lowering branch from ff16348 to 0ae0ba3 Compare April 1, 2026 02:22
@Vikash-Kumar-23
Copy link
Copy Markdown
Contributor Author

Hi @xmnlab

Done! I've rebased on top of the latest upstream main and fixed the douki linter issues. The CI should be all green now. Ready for another look!

@xmnlab xmnlab force-pushed the fix-utf8-char-lowering branch from 0ae0ba3 to ff44e11 Compare April 1, 2026 03:36
@xmnlab xmnlab changed the title fix: use utf-8 encoding for LiteralUTF8Char lowering (#208) tests: Add tests for utf-8 encoding for LiteralUTF8Char (#208) Apr 1, 2026
@xmnlab xmnlab marked this pull request as ready for review April 1, 2026 03:39
@xmnlab xmnlab merged commit 4ecdb28 into arxlang:main Apr 1, 2026
13 checks passed
@xmnlab
Copy link
Copy Markdown
Contributor

xmnlab commented Apr 1, 2026

thanks @Vikash-Kumar-23

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 2, 2026

🎉 This PR is included in version 1.9.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix LiteralUTF8Char lowering for non-ASCII UTF-8 chars

3 participants