tests: Add tests for utf-8 encoding for LiteralUTF8Char (#208) by Vikash-Kumar-23 · Pull Request #240 · arxlang/irx

Vikash-Kumar-23 · 2026-03-19T06:47:42Z

Pull Request description
This PR fixes a UnicodeEncodeError in the LLVMLiteIR builder that occurred when lowering LiteralUTF8Char nodes containing non-ASCII characters (e.g., é).

The builder was previously hard-coded to use the "ascii" codec when initializing the bytearray for the global string constant. I have updated this to use "utf-8" to correctly support international characters and symbols.

Fixes #208

How to test these changes
Run a reproduction script (e.g., parsing astx.LiteralUTF8Char("é")) and verify that builder.translate(module) no longer raises a UnicodeEncodeError.

Run the new regression test to verify the generated LLVM IR string contains the correct hex-encoded bytes for UTF-8 (for é, this is \c3\a9):

PowerShell
python -m pytest tests/test_string.py::test_utf8_char_lowering_correctness -v
Expected Output:

Plaintext
======================= test session starts =======================
platform win32 -- Python 3.14.0, pytest-9.0.2, pluggy-1.6.0
rootdir: C:...\irx
configfile: pyproject.toml
plugins: typeguard-4.5.1
collecting ... collected 1 item

tests/test_string.py::test_utf8_char_lowering_correctness PASSED [100%]
======================== 1 passed in 0.15s ========================
Pull Request checklists
This PR is a:

[x] bug-fix

[ ] new feature

[ ] maintenance

About this PR:

[x] it includes tests.

[ ] the tests are executed on CI.

[ ] the tests generate log file(s) (path).

[x] pre-commit hooks were executed locally.

[ ] this PR requires a project documentation update.

Author's checklist:

[x] I have reviewed the changes and it contains no misspelling.

[x] The code is well commented, especially in the parts that contain more complexity.

[x] New and old tests passed locally. (Core IR generation tests passed; binary linking tests skipped due to local environment constraints).

Additional information
Technical Evidence

Before Fix (Crash):
The builder failed when encountering multibyte characters because it was forced to use the ASCII codec.

Plaintext
Attempting to translate LiteralUTF8Char with value: é
Traceback (most recent call last):
File "src/irx/builders/llvmliteir.py", line 1757, in visit_LiteralUTF8Char
string_data_type, bytearray(string_value + "\0", "ascii")
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 0: ordinal not in range(128)

After Fix (Successful IR Generation):
The generated IR now correctly encodes the character as a UTF-8 byte sequence:

Code snippet
@"str_ascii_..." = internal constant [3 x i8] c"\c3\a9\00"
The added test test_utf8_char_lowering_correctness verifies this hex sequence directly in the IR string to ensure technical correctness, addressing feedback from previous PR attempts.

Reviewer's checklist
Copy and paste this template for your review's note:

Reviewer's Checklist

I managed to reproduce the problem locally from the main branch
I managed to test the new changes locally
I confirm that the issues mentioned were fixed/resolved .

Jaskirat-s7 · 2026-03-24T08:51:45Z

Hey @Vikash-Kumar-23, nice catch on the "ascii" codec bug!
one small suggestion is that I noticed the utf8_bytes variable is already computed a few lines above (line 1917) but isn't used for the initializer. You could simplify the fix by reusing it:
string_data.initializer = ir.Constant(
string_data_type, bytearray(utf8_bytes + b"\0")
)
This avoids encoding the string twice and also matches the pattern used by the LiteralUTF8String visitor right below (line 1963–1964) and also, the docstring on line 1911 still says "Handle ASCII string literals" — might be worth updating it to something like "Handle UTF-8 character literals" since the visitor is for LiteralUTF8Char.
Test looks solid!

Jaskirat-s7 · 2026-03-24T08:54:31Z

also the CI linter is failing because the docstring in test_utf8_char_lowering_correctness isn't in Douki YAML format. This project requires docstrings to use the title: ... syntax. Change:
"""Verify LiteralUTF8Char correctly lowers to UTF-8 hex in IR."""
to:
"""title: Verify LiteralUTF8Char correctly lowers to UTF-8 hex in IR."""
You can see the pattern in the other tests in the same file (e.g. line 55, 90, etc.).

Jaskirat-s7 · 2026-03-24T08:57:18Z

Hey @yuvimittal, just left some review comments on #240 — caught a double-encoding issue and a Douki docstring format that's causing the linter to fail. Let me know if those were useful findings or if I'm missing something.

Vikash-Kumar-23 · 2026-03-24T16:05:10Z

Thanks for the review! @Jaskirat-s7

Reused utf8_bytes and removed redundant .encode() to avoid double encoding.
Updated initializer to bytearray(utf8_bytes + b"\0"), matching LiteralUTF8String.
Fixed test docstring to Douki YAML format.

All tests pass and ruff is clean. Let me know if you'd like additional UTF-8 edge case coverage.

xmnlab · 2026-03-31T21:49:31Z

@Vikash-Kumar-23 could you rebase your branch on top of the upstream main please?

xmnlab · 2026-03-31T21:49:56Z

also ensure that the CI is all green

Vikash-Kumar-23 · 2026-04-01T02:29:57Z

Hi @xmnlab

Done! I've rebased on top of the latest upstream main and fixed the douki linter issues. The CI should be all green now. Ready for another look!

…e encoding

xmnlab · 2026-04-01T03:39:22Z

thanks @Vikash-Kumar-23

github-actions · 2026-04-02T04:15:17Z

🎉 This PR is included in version 1.9.0 🎉

The release is available on:

1.9.0
GitHub release

Your semantic-release bot 📦🚀

Vikash-Kumar-23 force-pushed the fix-utf8-char-lowering branch from 0772b79 to 09d867b Compare March 25, 2026 00:00

xmnlab marked this pull request as draft March 31, 2026 21:49

Vikash-Kumar-23 force-pushed the fix-utf8-char-lowering branch from ff16348 to 0ae0ba3 Compare April 1, 2026 02:22

Vikash-Kumar-23 added 5 commits April 1, 2026 03:35

fix: use utf-8 encoding for LiteralUTF8Char lowering (#208)

69855c1

chore: fix linting, formatting, and type hints for #208

c0e6559

fix: use utf-8 encoding for LiteralUTF8Char lowering and remove doubl…

21b1cd8

…e encoding

Run douki: sync docs and fix docstrings

52550b2

fix: align LiteralUTF8Char visit docstring for douki

ff44e11

xmnlab force-pushed the fix-utf8-char-lowering branch from 0ae0ba3 to ff44e11 Compare April 1, 2026 03:36

xmnlab changed the title ~~fix: use utf-8 encoding for LiteralUTF8Char lowering (#208)~~ tests: Add tests for utf-8 encoding for LiteralUTF8Char (#208) Apr 1, 2026

xmnlab marked this pull request as ready for review April 1, 2026 03:39

xmnlab merged commit 4ecdb28 into arxlang:main Apr 1, 2026
13 checks passed

github-actions bot added the released label Apr 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests: Add tests for utf-8 encoding for LiteralUTF8Char (#208)#240

tests: Add tests for utf-8 encoding for LiteralUTF8Char (#208)#240
xmnlab merged 5 commits intoarxlang:mainfrom
Vikash-Kumar-23:fix-utf8-char-lowering

Vikash-Kumar-23 commented Mar 19, 2026 •

edited

Loading

Uh oh!

Jaskirat-s7 commented Mar 24, 2026

Uh oh!

Jaskirat-s7 commented Mar 24, 2026

Uh oh!

Jaskirat-s7 commented Mar 24, 2026

Uh oh!

Vikash-Kumar-23 commented Mar 24, 2026

Uh oh!

xmnlab commented Mar 31, 2026

Uh oh!

xmnlab commented Mar 31, 2026

Uh oh!

Vikash-Kumar-23 commented Apr 1, 2026

Uh oh!

Uh oh!

xmnlab commented Apr 1, 2026

Uh oh!

github-actions bot commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Vikash-Kumar-23 commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Checklist

Uh oh!

Jaskirat-s7 commented Mar 24, 2026

Uh oh!

Jaskirat-s7 commented Mar 24, 2026

Uh oh!

Jaskirat-s7 commented Mar 24, 2026

Uh oh!

Vikash-Kumar-23 commented Mar 24, 2026

Uh oh!

xmnlab commented Mar 31, 2026

Uh oh!

xmnlab commented Mar 31, 2026

Uh oh!

Vikash-Kumar-23 commented Apr 1, 2026

Uh oh!

Uh oh!

xmnlab commented Apr 1, 2026

Uh oh!

github-actions bot commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Vikash-Kumar-23 commented Mar 19, 2026 •

edited

Loading