Skip to content

Implement ES2025 RegExp pattern modifiers#1520

Closed
andreasrosdal wants to merge 3 commits into
quickjs-ng:masterfrom
nordstjernen-web:claude/regexp-inline-modifiers
Closed

Implement ES2025 RegExp pattern modifiers#1520
andreasrosdal wants to merge 3 commits into
quickjs-ng:masterfrom
nordstjernen-web:claude/regexp-inline-modifiers

Conversation

@andreasrosdal
Copy link
Copy Markdown
Contributor

@andreasrosdal andreasrosdal commented May 31, 2026

Add support for inline flag modifier groups (?ims:...), (?-ims:...), and mixed forms like (?i-m:...) that locally enable or disable the i (ignoreCase), m (multiline), and s (dotAll) flags for a subpattern.

Another Claude coded improvement here, I hope it is good.

Parsing:

  • re_parse_term now recognizes (? followed by i/m/s/- as a modifier group, parses the add flags, an optional '-' and remove flags, then ':'. It validates that only i/m/s are used, no flag is duplicated or appears on both sides, and at least one flag is present (so (?-:...) is a SyntaxError while the empty (?:...) remains the plain non-capturing group). The current i/m/s parser state is saved, the modifier applied, the disjunction parsed recursively, and the state restored afterwards so flags only affect the group.

The i and s flags are consumed at parse time (case folding and dot semantics), so toggling the parser state handles them directly.

Multiline (m) is decided at match time in the original engine, so per group m required moving the decision to parse time. ^ and $ now emit REOP_line_start / REOP_line_end (multiline semantics, matching at any line boundary) when multiline is in effect, or the new REOP_bol / REOP_eol opcodes (absolute string start/end) otherwise. The matcher handles all four unconditionally with no flag check.

Because case sensitivity can now differ between a group and the global flag, case folding can no longer be driven by a single match-time flag. Case-insensitive character, range and back reference matches now use dedicated opcodes (char*_ci, range*_ci, back_reference*_ci) that canonicalize the input, while the plain opcodes compare literally. The emitter chooses the variant from the effective ignore_case state, and the bytecode walkers, stack-size computation and dumper handle the new opcodes.

Default behavior for patterns without modifiers and for the global i/m/s flags is unchanged.

Enable the regexp-modifiers test262 feature in test262.conf.

claude added 3 commits May 31, 2026 09:48
Add support for inline flag modifier groups (?ims:...), (?-ims:...),
and mixed forms like (?i-m:...) that locally enable or disable the
i (ignoreCase), m (multiline), and s (dotAll) flags for a subpattern.

Parsing:
- re_parse_term now recognizes (? followed by i/m/s/- as a modifier
  group, parses the add flags, an optional '-' and remove flags, then
  ':'. It validates that only i/m/s are used, no flag is duplicated or
  appears on both sides, and at least one flag is present (so (?-:...)
  is a SyntaxError while the empty (?:...) remains the plain
  non-capturing group). The current i/m/s parser state is saved,
  the modifier applied, the disjunction parsed recursively, and the
  state restored afterwards so flags only affect the group.

The i and s flags are consumed at parse time (case folding and dot
semantics), so toggling the parser state handles them directly.

Multiline (m) is decided at match time in the original engine, so per
group m required moving the decision to parse time. ^ and $ now emit
REOP_line_start / REOP_line_end (multiline semantics, matching at any
line boundary) when multiline is in effect, or the new REOP_bol /
REOP_eol opcodes (absolute string start/end) otherwise. The matcher
handles all four unconditionally with no flag check.

Because case sensitivity can now differ between a group and the global
flag, case folding can no longer be driven by a single match-time
flag. Case-insensitive character, range and back reference matches now
use dedicated opcodes (char*_ci, range*_ci, back_reference*_ci) that
canonicalize the input, while the plain opcodes compare literally. The
emitter chooses the variant from the effective ignore_case state, and
the bytecode walkers, stack-size computation and dumper handle the new
opcodes.

Default behavior for patterns without modifiers and for the global
i/m/s flags is unchanged.

Enable the regexp-modifiers test262 feature in test262.conf.

https://claude.ai/code/session_01MhkkobYvut7A4oP4w8eV1b
The new opcodes for RegExp pattern modifiers (char*_ci, bol/eol,
back_reference_ci/backward_back_reference_ci, range*_ci) were inserted in
the middle of the opcode list, which renumbered every opcode after them.
lre-test.c builds bytecode from hardcoded opcode byte values (e.g. 0x0C =
REOP_save_start) to exercise the out-of-bounds save-index validation, so
the renumbering made that test's bytecode mean something else and the
assertion aborted.

Move all new opcodes to the end of libregexp-opcode.h so the existing
opcode values stay stable. The only adjacency constraint among the new
opcodes (backward_back_reference_ci must immediately follow
back_reference_ci, used as REOP_back_reference_ci + is_backward_dir) is
preserved. All opcodes are referenced by name elsewhere, so moving them is
otherwise transparent.

https://claude.ai/code/session_01MhkkobYvut7A4oP4w8eV1b
Two CI failures on this branch:

1. `make codegen` left gen/repl.c dirty. repl.js contains regexp literals
   whose compiled bytecode changed because the regexp opcodes were
   renumbered, but gen/repl.c was never regenerated. Regenerate it so the
   CI clean-tree check passes.

2. Enabling the regexp-modifiers test262 feature surfaced three tests that
   the engine cannot pass:
     - add-ignoreCase-affects-slash-lower-b.js  (\b after U+017F)
     - add-ignoreCase-affects-slash-lower-p.js  (\p{Lu} under i)
     - add-ignoreCase-affects-slash-upper-b.js  (\B between Z and U+017F)
   These are pre-existing limitations: \b/\B (is_word_char) and \p{...}
   character classes do not apply Unicode case folding under ignoreCase,
   and they fail identically with the global /i flag. They are not
   regressions from the modifiers feature, so record them in
   test262_errors.txt as known errors (matching how other known
   limitations are tracked).

https://claude.ai/code/session_01MhkkobYvut7A4oP4w8eV1b
@saghul
Copy link
Copy Markdown
Contributor

saghul commented May 31, 2026

Have you gone through the code yourself?

@andreasrosdalw
Copy link
Copy Markdown

andreasrosdalw commented May 31, 2026

I have looked at the code, it is implemented by Claude. It's only a proposal, I hope it will be useful and that you will like it. Seems like a good idea.

Since I am using this quickjs in the browser Nordstjernen.org I am trying to improve the JavaScript engine implementation in multiple ways to improve the browser, this is part of that effort.

Feel free to give it a try, as it's the only current web browser using Quickjs
https://github.com/nordstjernen-web/nordstjernen

@andreasrosdal
Copy link
Copy Markdown
Contributor Author

Closing, since it's overly complicated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants