Skip to content

feat: add hex_alternation, base64_2pad, identifier, space_separated_charset, uri_scheme native ops#38

Merged
alltheseas merged 4 commits into
mainfrom
feat/hex-alternation-base64-2pad
Apr 2, 2026
Merged

feat: add hex_alternation, base64_2pad, identifier, space_separated_charset, uri_scheme native ops#38
alltheseas merged 4 commits into
mainfrom
feat/hex-alternation-base64-2pad

Conversation

@alltheseas
Copy link
Copy Markdown
Contributor

@alltheseas alltheseas commented Apr 1, 2026

Summary

  • Add 5 new PatternCheck ops to eliminate all remaining regex fallback patterns from schemata v0.3.2
  • All 12 language emitters updated with renderPatternCheck cases and helper implementations
  • Classifier coverage: 90/90 patterns native, zero regex fallbacks

Commit 1: hex_alternation + base64_2pad (closes #37)

  • hex_alternation: multi-length hex alternation (e.g. ^(?:[a-f0-9]{64}|[a-f0-9]{96}|[a-f0-9]{128})$)
  • base64_2pad: strict base64 with mandatory == padding

Commit 2: identifier + space_separated_charset + uri_scheme (closes #40)

Pattern Op
^[a-z][a-z0-9]*$ identifier
^[A-Z][a-zA-Z0-9]*$ identifier
^[a-z][a-z0-9-]*$ identifier
^!?[a-z][a-z0-9]*$ identifier
^!?[0-9]+$ identifier
^[a-z_]+( [a-z_]+)*$ space_separated_charset
^[A-Za-z][A-Za-z0-9+.-]*:// uri_scheme

Test plan

  • npm run build compiles cleanly
  • npm test — 927 tests pass, 0 failures
  • --all code generation works for all 13 languages
  • Fuzz-equivalence tests verify native checkers match regex behavior
  • Classifier coverage: 90/90 native, 0 regex fallbacks

Closes #37, closes #40

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Recognizes five new pattern types (hex alternation, strict base64 with "==", identifier forms, space-separated char runs, and URI scheme prefixes) and emits optimized native validators instead of generic regex fallbacks across all target languages.
  • Tests

    • Added unit and fuzz tests covering the new classifications and native-check behavior to ensure correctness and broad coverage.

Add two new PatternCheck ops for MIP-00/MIP-05 schema patterns that
previously fell back to regex:

- hex_alternation: multi-length hex string (e.g. 64|96|128 chars)
- base64_2pad: strict base64 with mandatory == padding

Both ops are implemented in all 11 non-TS emitters with matching
helper functions. Fuzz-equivalence tests verify correctness against
the actual regex patterns.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 1, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ed5ef1cc-6b0d-4291-a0ef-fcfe2bf2e88c

📥 Commits

Reviewing files that changed from the base of the PR and between 095d0ef and a2384ec.

📒 Files selected for processing (9)
  • src/classify-pattern.ts
  • src/emit-cpp.ts
  • src/emit-csharp.ts
  • src/emit-dart.ts
  • src/emit-java.ts
  • src/emit-php.ts
  • src/emit-python.ts
  • src/emit-ruby.ts
  • src/emit-rust.ts
✅ Files skipped from review due to trivial changes (1)
  • src/emit-cpp.ts
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/classify-pattern.ts
  • src/emit-rust.ts
  • src/emit-ruby.ts
  • src/emit-csharp.ts

📝 Walkthrough

Walkthrough

The PR extends the pattern classifier with five new native PatternCheck ops—hex_alternation, base64_2pad, identifier, space_separated_charset, and uri_scheme—updates isNativeCheck accordingly, adds rendering and helper-generation for these ops across all language emitters, and expands tests and fuzz harness to exercise the new checks.

Changes

Cohort / File(s) Summary
Pattern Classification
src/classify-pattern.ts
Added five new PatternCheck union members; updated classifyRegex to detect multi-length hex alternations, strict base64 ending ==, identifier forms with optional prefix and separate first/rest charsets, space-separated single-charset tokens, and URI-scheme prefix; marked new ops native in isNativeCheck.
Emitters — Codegen (core)
src/emit-c.ts, src/emit-cpp.ts, src/emit-csharp.ts, src/emit-dart.ts, src/emit-go.ts, src/emit-java.ts, src/emit-kotlin.ts, src/emit-php.ts, src/emit-python.ts, src/emit-ruby.ts, src/emit-rust.ts, src/emit-swift.ts
Added rendering branches for the five new PatternCheck.op values across all language emitters; each emitter now emits calls for hex_alternation, base64_2pad, identifier, space_separated_charset, and uri_scheme, and registers corresponding helper names.
Emitters — Helper generation
src/emit-*.ts (helpers across languages)
Extended helper emission in each language to conditionally generate validators: strict base64 *_base64_2pad, identifier check, space-separated charset check, URI-scheme check, and per-length hex helpers (lower/mixed); helpers implement length, padding, charset, optional-prefix, and separator logic.
Tests — Classification & Fuzz
tests/classify-pattern.test.ts, tests/fuzz-equivalence.test.ts
Added unit tests asserting classification to the new ops and that they are native; extended fuzz-equivalence harness with native check implementations, input generators, and deterministic test cases for hex alternation and base64-2pad.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐰 I hopped through regex fields at dawn,
Found hex and base64 and tokens drawn.
From scheme to name, five patterns bloom,
Across twelve lands they shed the regex gloom.
🥕 Cheers — native checks all snug and warm!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 51.72% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding five new native pattern check operations (hex_alternation, base64_2pad, identifier, space_separated_charset, uri_scheme) to the codebase.
Linked Issues check ✅ Passed The PR comprehensively addresses all coding requirements from linked issues #37 and #40: implements hex_alternation and base64_2pad classifiers and emitters [#37], and adds identifier, space_separated_charset, and uri_scheme ops with full classifier and emitter coverage across all 12 languages [#40].
Out of Scope Changes check ✅ Passed All changes are directly in scope: classifier updates, pattern check type extensions, emitter implementations for all languages, and corresponding test coverage—no unrelated refactoring or scope creep detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/hex-alternation-base64-2pad

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

…40)

Add 3 new PatternCheck ops to eliminate the remaining 7 regex fallbacks
introduced in schemata v0.3.2:

- identifier: ^[prefix]?[firstCharset][restCharset]*$ (5 patterns)
- space_separated_charset: ^[charset]+( [charset]+)*$ (1 pattern)
- uri_scheme: ^[A-Za-z][A-Za-z0-9+.-]*:// (1 pattern)

All 12 language emitters updated with renderPatternCheck cases and
helper implementations. Classifier coverage: 90/90 patterns native,
zero regex fallbacks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@alltheseas alltheseas changed the title feat: add hex_alternation and base64_2pad native ops feat: add hex_alternation, base64_2pad, identifier, space_separated_charset, uri_scheme native ops Apr 2, 2026
…ates per-branch case

P1: Ruby check_identifier, check_space_separated_charset, check_uri_scheme
were emitted as instance methods (`def check_*`) but called from module
singleton context. Changed to `def self.check_*` with correct 2/4-space
indentation matching all existing helpers.

P2: hex_alternation classifier assumed all alternation branches shared the
same case policy. Now checks each branch individually and falls through to
regex if policies differ, preventing silent case-widening.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/classify-pattern.test.ts (1)

516-656: ⚠️ Potential issue | 🟡 Minor

Please add explicit regex-vs-native cases for these new ops here.

The new assertions only verify the classified IR shape. This file still lacks adversarial equivalence coverage for hex_alternation, base64_2pad, identifier, space_separated_charset, and uri_scheme themselves, so semantic regressions can slip through unless the fuzz suite happens to hit them.

As per coding guidelines, "When adding a new PatternCheck op, add classification tests with regex-vs-native equivalence tests using adversarial inputs. Test BOTH filtered and unfiltered modes for ops that accept optional constraints, and anchor test regex to match validator semantics (include $ or end-anchor)."

Also applies to: 825-827

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/classify-pattern.test.ts` around lines 516 - 656, Add adversarial
regex-vs-native equivalence tests for the new PatternCheck ops: hex_alternation,
base64_2pad, identifier, space_separated_charset, and uri_scheme. For each op
(found via classifyRegex and validated with isNativeCheck), add tests that
anchor the regex to validator semantics (include end-anchor/$), generate
adversarial inputs (fuzz both matching and non-matching cases), and assert
equivalence between the native validator and the regex matcher; for ops that
accept optional constraints (e.g., identifier with optionalPrefix), include both
filtered and unfiltered modes. Follow the existing test harness style in this
file for creating adversarial equivalence assertions so the new ops get the same
coverage as prior pattern checks.
🧹 Nitpick comments (1)
src/emit-dart.ts (1)

233-239: Use dartString() for these emitted Dart literals.

These are the only new branches in this file that bypass the local Dart string escaper. JSON.stringify() leaves $ untouched, and the raw optional-prefix literal skips escaping entirely, so a charset or prefix containing interpolation/escape-sensitive characters will generate the wrong Dart source.

♻️ Proposed fix
     case 'identifier': {
       helpers.add('_checkIdentifier');
-      return { expr: `_checkIdentifier(${varExpr}, ${JSON.stringify(check.firstCharset)}, ${JSON.stringify(check.restCharset)}${check.optionalPrefix ? `, '${check.optionalPrefix}'` : ''})`, helpers };
+      return {
+        expr: `_checkIdentifier(${varExpr}, ${dartString(check.firstCharset)}, ${dartString(check.restCharset)}${check.optionalPrefix ? `, ${dartString(check.optionalPrefix)}` : ''})`,
+        helpers,
+      };
     }
     case 'space_separated_charset': {
       helpers.add('_checkSpaceSeparatedCharset');
-      return { expr: `_checkSpaceSeparatedCharset(${varExpr}, ${JSON.stringify(check.charset)})`, helpers };
+      return { expr: `_checkSpaceSeparatedCharset(${varExpr}, ${dartString(check.charset)})`, helpers };
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/emit-dart.ts` around lines 233 - 239, The emitted Dart literals in the
'identifier' and 'space_separated_charset' branches currently use JSON.stringify
and raw string interpolation for the optional prefix, which bypasses the local
Dart escaper; update both branches to call the dartString() helper when
embedding charset and optionalPrefix values (i.e., replace
JSON.stringify(check.firstCharset), JSON.stringify(check.restCharset),
JSON.stringify(check.charset) and the raw optionalPrefix with
dartString(check.firstCharset), dartString(check.restCharset),
dartString(check.charset) and dartString(check.optionalPrefix) respectively) so
the output uses the existing Dart escaper; keep the
helpers.add('_checkIdentifier') and helpers.add('_checkSpaceSeparatedCharset')
calls and ensure dartString is in scope where emit-dart.ts constructs these expr
strings.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/classify-pattern.ts`:
- Around line 110-119: The current hex_alternation branch uses a single check
(m[1].includes('A-F')) to decide case and thus can merge alternation arms with
different alphabets; change the logic to examine each alternation arm separately
(parse the arms from m[1] by iterating the alternation segments / each
class+quantifier pair) and verify all arms use the same character class (either
all include A-F or none do); only return { op: 'hex_alternation', lengths, case:
... } when every arm has the same alphabet, otherwise fall back to returning a
generic regex classification (e.g., do not match here and let higher-level code
return 'regex').

In `@src/emit-cpp.ts`:
- Around line 205-211: The hex_alternation branch builds helper names like
`check_hex64`/`check_hex64_mixed` but elsewhere the emitter declares helpers as
`check_hex_64`/`check_hex_64_mixed`, causing a mismatch; update the branch so
the same helper naming convention is used: change the helper name construction
used in the `hex_alternation` case (the `fn` variable passed to `helpers.add`
and returned in `${fn}(${varExpr})`) to match the declared helper pattern
(`check_hex_{len}` or `check_hex_{len}_mixed`), or alternatively update the
helper declarations to emit `check_hex{len}` names—ensure `helpers.add` and the
call sites use the identical symbol format (`check_hex_<len>` or `check_hex<
len>`) across `hex_alternation`, helper declarations, and any other references.

In `@src/emit-csharp.ts`:
- Around line 1319-1365: The three generated helpers (CheckIdentifier,
CheckSpaceSeparatedCharset, CheckUriScheme) currently dereference the input
string s before checking for null; add an explicit null guard to the start of
each function (e.g. if (s == null) return false;) so a null tag returns false
instead of throwing, leaving the rest of the existing logic unchanged; update
the start of CheckIdentifier, CheckSpaceSeparatedCharset, and CheckUriScheme to
perform this null check before any access to s.Length or s[i].

In `@src/emit-java.ts`:
- Around line 240-250: The new Java helper methods checkIdentifier,
checkSpaceSeparatedCharset, and checkUriScheme currently dereference the input
string before checking for null (because renderValueCheckJava passes raw
t.get(index) which may be null); update each helper to first guard against null
by returning false when s == null (or s == null || s.isEmpty() if appropriate to
match existing patterns) before any s.charAt(...) or other dereference, ensuring
the helpers follow the same null-check pattern as other helpers in the file and
do not throw NullPointerException.

In `@src/emit-php.ts`:
- Around line 239-249: The three helper names used in the switch cases
(check_identifier, check_space_separated_charset, check_uri_scheme) violate the
file's PHP helper naming convention and their string literals are escaped with
JSON.stringify instead of the file's phpString helper; update the cases in
emit-php.ts to register and call schemata_check_identifier,
schemata_check_space_separated_charset, and schemata_check_uri_scheme (replace
helpers.add('check_*') with helpers.add('schemata_check_*') and call the
schemata_ names in the generated expr) and replace JSON.stringify(...) arguments
with phpString(...) so the charset/prefix literals go through the file's
phpString escaping function; also ensure any other references to those helper
names in this file are renamed accordingly so the helper implementations and
registrations match the new schemata_* names.

In `@src/emit-python.ts`:
- Around line 201-207: The hex_alternation branch constructs helper names
without the underscore before the length, causing mismatch with
emitPythonHelpers which generates helpers like _check_hex_64; update the helper
name construction in the 'hex_alternation' case so the fn variable uses
`_check_hex_${len}` and `_check_hex_${len}_mixed` (preserving the existing
helpers.add(fn) and returned `${fn}(${varExpr})` usage) so emitted validators
reference the exact helper names that emitPythonHelpers generates.

In `@src/emit-ruby.ts`:
- Around line 205-212: The hex_alternation branch builds helper names like
check_hex{len} / check_hex{len}_mixed which don't match the helpers emitted by
emitRubyHelpers (which emit check_hex_{len} / check_hex_{len}_mixed); update the
helper name construction in the hex_alternation case (symbol: hex_alternation)
to include the underscore before the length (use check_hex_${len} and
check_hex_${len}_mixed naming that matches emitRubyHelpers) and ensure
helpers.add(...) is called with those corrected names so the generated Ruby
includes the required methods.

In `@src/emit-rust.ts`:
- Around line 259-265: In the 'hex_alternation' branch the helper names are
built without the underscore (producing check_hex64/check_hex64_mixed) which
don't match the helpers emitted by emitRustHelpers
(check_hex_<len>/check_hex_<len>_mixed); update the helper construction in the
case 'hex_alternation' block so the names include the underscore (e.g., use
`check_hex_${len}` and `check_hex_${len}_mixed` when calling helpers.add and
when building the `${fn}(${varExpr})` expressions) so helpers.add(...) matches
what emitRustHelpers emits.

In `@tests/fuzz-equivalence.test.ts`:
- Around line 775-797: The added hand-written oracle branches inside
buildNativeChecker (the 'hex_alternation' and 'base64_2pad' cases) must be
removed and replaced so the oracle validates strings using the actual native
regex engine (new RegExp(pattern)) rather than a JS reimplementation; locate the
case blocks for 'hex_alternation' and 'base64_2pad' in buildNativeChecker and
change them to return a function that tests the string against the compiled
RegExp for the generated pattern (or delegate to the common RegExp-based path
used elsewhere in buildNativeChecker) so the fuzz suite compares the emitter
output to the true native behavior instead of duplicating logic that could share
bugs.

---

Outside diff comments:
In `@tests/classify-pattern.test.ts`:
- Around line 516-656: Add adversarial regex-vs-native equivalence tests for the
new PatternCheck ops: hex_alternation, base64_2pad, identifier,
space_separated_charset, and uri_scheme. For each op (found via classifyRegex
and validated with isNativeCheck), add tests that anchor the regex to validator
semantics (include end-anchor/$), generate adversarial inputs (fuzz both
matching and non-matching cases), and assert equivalence between the native
validator and the regex matcher; for ops that accept optional constraints (e.g.,
identifier with optionalPrefix), include both filtered and unfiltered modes.
Follow the existing test harness style in this file for creating adversarial
equivalence assertions so the new ops get the same coverage as prior pattern
checks.

---

Nitpick comments:
In `@src/emit-dart.ts`:
- Around line 233-239: The emitted Dart literals in the 'identifier' and
'space_separated_charset' branches currently use JSON.stringify and raw string
interpolation for the optional prefix, which bypasses the local Dart escaper;
update both branches to call the dartString() helper when embedding charset and
optionalPrefix values (i.e., replace JSON.stringify(check.firstCharset),
JSON.stringify(check.restCharset), JSON.stringify(check.charset) and the raw
optionalPrefix with dartString(check.firstCharset),
dartString(check.restCharset), dartString(check.charset) and
dartString(check.optionalPrefix) respectively) so the output uses the existing
Dart escaper; keep the helpers.add('_checkIdentifier') and
helpers.add('_checkSpaceSeparatedCharset') calls and ensure dartString is in
scope where emit-dart.ts constructs these expr strings.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ed1ab8ab-b492-426f-a9b2-363eac0280f6

📥 Commits

Reviewing files that changed from the base of the PR and between 8c44334 and 095d0ef.

📒 Files selected for processing (15)
  • src/classify-pattern.ts
  • src/emit-c.ts
  • src/emit-cpp.ts
  • src/emit-csharp.ts
  • src/emit-dart.ts
  • src/emit-go.ts
  • src/emit-java.ts
  • src/emit-kotlin.ts
  • src/emit-php.ts
  • src/emit-python.ts
  • src/emit-ruby.ts
  • src/emit-rust.ts
  • src/emit-swift.ts
  • tests/classify-pattern.test.ts
  • tests/fuzz-equivalence.test.ts

Comment thread src/classify-pattern.ts
Comment thread src/emit-cpp.ts
Comment thread src/emit-csharp.ts
Comment thread src/emit-java.ts
Comment thread src/emit-php.ts Outdated
Comment thread src/emit-python.ts
Comment thread src/emit-ruby.ts
Comment thread src/emit-rust.ts
Comment on lines +775 to +797
case 'hex_alternation': {
const validChars = check.case === 'lower' ? HEX_LOWER : HEX_MIXED;
const lengthSet = new Set(check.lengths);
return (s) => {
if (!lengthSet.has(s.length)) return false;
for (let i = 0; i < s.length; i++) {
if (!validChars.includes(s[i])) return false;
}
return true;
};
}

case 'base64_2pad': {
const isB64 = (c: string) => (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || (c >= '0' && c <= '9') || c === '+' || c === '/';
return (s) => {
if (s.length < 4 || s.length % 4 !== 0) return false;
if (s[s.length - 1] !== '=' || s[s.length - 2] !== '=') return false;
for (let i = 0; i < s.length - 2; i++) {
if (!isB64(s[i])) return false;
}
return true;
};
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t extend the hand-written oracle for new ops.

These branches add five more JS reimplementations to buildNativeChecker(). If an emitter/helper bug matches the same logic here, the fuzz suite still passes, so the new coverage can’t catch the regression it claims to guard against.

Based on learnings, NEVER use the native implementation as the fuzz-test oracle — the buildNativeChecker must compare against the actual regex (new RegExp(pattern)), not a reimplementation of the native check. If the oracle encodes the same bug as the implementation, the test can never detect the regression.

Also applies to: 902-957

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/fuzz-equivalence.test.ts` around lines 775 - 797, The added
hand-written oracle branches inside buildNativeChecker (the 'hex_alternation'
and 'base64_2pad' cases) must be removed and replaced so the oracle validates
strings using the actual native regex engine (new RegExp(pattern)) rather than a
JS reimplementation; locate the case blocks for 'hex_alternation' and
'base64_2pad' in buildNativeChecker and change them to return a function that
tests the string against the compiled RegExp for the generated pattern (or
delegate to the common RegExp-based path used elsewhere in buildNativeChecker)
so the fuzz suite compares the emitter output to the true native behavior
instead of duplicating logic that could share bugs.

- C++/Python/Ruby/Rust: hex_alternation used check_hex{len} but helpers
  declared check_hex_{len} — add missing underscore before length
- C#/Java: add null guards to CheckIdentifier, CheckSpaceSeparatedCharset,
  CheckUriScheme matching all other helpers in those emitters
- PHP: rename check_identifier/check_space_separated_charset/check_uri_scheme
  to schemata_check_* matching file convention, use phpString() instead of
  JSON.stringify() for string escaping
- Dart: use dartString() instead of JSON.stringify() for charset literals

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@alltheseas alltheseas merged commit 4de67b7 into main Apr 2, 2026
4 checks passed
@alltheseas alltheseas deleted the feat/hex-alternation-base64-2pad branch April 2, 2026 05:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Classifier gaps: 7 new regex patterns in schemata v0.3.2 Classifier: 2 new patterns fall back to regex

1 participant