feat: add hex_alternation, base64_2pad, identifier, space_separated_charset, uri_scheme native ops#38
Conversation
Add two new PatternCheck ops for MIP-00/MIP-05 schema patterns that previously fell back to regex: - hex_alternation: multi-length hex string (e.g. 64|96|128 chars) - base64_2pad: strict base64 with mandatory == padding Both ops are implemented in all 11 non-TS emitters with matching helper functions. Fuzz-equivalence tests verify correctness against the actual regex patterns. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (9)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (4)
📝 WalkthroughWalkthroughThe PR extends the pattern classifier with five new native PatternCheck ops— Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…40) Add 3 new PatternCheck ops to eliminate the remaining 7 regex fallbacks introduced in schemata v0.3.2: - identifier: ^[prefix]?[firstCharset][restCharset]*$ (5 patterns) - space_separated_charset: ^[charset]+( [charset]+)*$ (1 pattern) - uri_scheme: ^[A-Za-z][A-Za-z0-9+.-]*:// (1 pattern) All 12 language emitters updated with renderPatternCheck cases and helper implementations. Classifier coverage: 90/90 patterns native, zero regex fallbacks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ates per-branch case P1: Ruby check_identifier, check_space_separated_charset, check_uri_scheme were emitted as instance methods (`def check_*`) but called from module singleton context. Changed to `def self.check_*` with correct 2/4-space indentation matching all existing helpers. P2: hex_alternation classifier assumed all alternation branches shared the same case policy. Now checks each branch individually and falls through to regex if policies differ, preventing silent case-widening. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/classify-pattern.test.ts (1)
516-656:⚠️ Potential issue | 🟡 MinorPlease add explicit regex-vs-native cases for these new ops here.
The new assertions only verify the classified IR shape. This file still lacks adversarial equivalence coverage for
hex_alternation,base64_2pad,identifier,space_separated_charset, anduri_schemethemselves, so semantic regressions can slip through unless the fuzz suite happens to hit them.As per coding guidelines, "When adding a new PatternCheck op, add classification tests with regex-vs-native equivalence tests using adversarial inputs. Test BOTH filtered and unfiltered modes for ops that accept optional constraints, and anchor test regex to match validator semantics (include
$or end-anchor)."Also applies to: 825-827
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/classify-pattern.test.ts` around lines 516 - 656, Add adversarial regex-vs-native equivalence tests for the new PatternCheck ops: hex_alternation, base64_2pad, identifier, space_separated_charset, and uri_scheme. For each op (found via classifyRegex and validated with isNativeCheck), add tests that anchor the regex to validator semantics (include end-anchor/$), generate adversarial inputs (fuzz both matching and non-matching cases), and assert equivalence between the native validator and the regex matcher; for ops that accept optional constraints (e.g., identifier with optionalPrefix), include both filtered and unfiltered modes. Follow the existing test harness style in this file for creating adversarial equivalence assertions so the new ops get the same coverage as prior pattern checks.
🧹 Nitpick comments (1)
src/emit-dart.ts (1)
233-239: UsedartString()for these emitted Dart literals.These are the only new branches in this file that bypass the local Dart string escaper.
JSON.stringify()leaves$untouched, and the raw optional-prefix literal skips escaping entirely, so a charset or prefix containing interpolation/escape-sensitive characters will generate the wrong Dart source.♻️ Proposed fix
case 'identifier': { helpers.add('_checkIdentifier'); - return { expr: `_checkIdentifier(${varExpr}, ${JSON.stringify(check.firstCharset)}, ${JSON.stringify(check.restCharset)}${check.optionalPrefix ? `, '${check.optionalPrefix}'` : ''})`, helpers }; + return { + expr: `_checkIdentifier(${varExpr}, ${dartString(check.firstCharset)}, ${dartString(check.restCharset)}${check.optionalPrefix ? `, ${dartString(check.optionalPrefix)}` : ''})`, + helpers, + }; } case 'space_separated_charset': { helpers.add('_checkSpaceSeparatedCharset'); - return { expr: `_checkSpaceSeparatedCharset(${varExpr}, ${JSON.stringify(check.charset)})`, helpers }; + return { expr: `_checkSpaceSeparatedCharset(${varExpr}, ${dartString(check.charset)})`, helpers }; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/emit-dart.ts` around lines 233 - 239, The emitted Dart literals in the 'identifier' and 'space_separated_charset' branches currently use JSON.stringify and raw string interpolation for the optional prefix, which bypasses the local Dart escaper; update both branches to call the dartString() helper when embedding charset and optionalPrefix values (i.e., replace JSON.stringify(check.firstCharset), JSON.stringify(check.restCharset), JSON.stringify(check.charset) and the raw optionalPrefix with dartString(check.firstCharset), dartString(check.restCharset), dartString(check.charset) and dartString(check.optionalPrefix) respectively) so the output uses the existing Dart escaper; keep the helpers.add('_checkIdentifier') and helpers.add('_checkSpaceSeparatedCharset') calls and ensure dartString is in scope where emit-dart.ts constructs these expr strings.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/classify-pattern.ts`:
- Around line 110-119: The current hex_alternation branch uses a single check
(m[1].includes('A-F')) to decide case and thus can merge alternation arms with
different alphabets; change the logic to examine each alternation arm separately
(parse the arms from m[1] by iterating the alternation segments / each
class+quantifier pair) and verify all arms use the same character class (either
all include A-F or none do); only return { op: 'hex_alternation', lengths, case:
... } when every arm has the same alphabet, otherwise fall back to returning a
generic regex classification (e.g., do not match here and let higher-level code
return 'regex').
In `@src/emit-cpp.ts`:
- Around line 205-211: The hex_alternation branch builds helper names like
`check_hex64`/`check_hex64_mixed` but elsewhere the emitter declares helpers as
`check_hex_64`/`check_hex_64_mixed`, causing a mismatch; update the branch so
the same helper naming convention is used: change the helper name construction
used in the `hex_alternation` case (the `fn` variable passed to `helpers.add`
and returned in `${fn}(${varExpr})`) to match the declared helper pattern
(`check_hex_{len}` or `check_hex_{len}_mixed`), or alternatively update the
helper declarations to emit `check_hex{len}` names—ensure `helpers.add` and the
call sites use the identical symbol format (`check_hex_<len>` or `check_hex<
len>`) across `hex_alternation`, helper declarations, and any other references.
In `@src/emit-csharp.ts`:
- Around line 1319-1365: The three generated helpers (CheckIdentifier,
CheckSpaceSeparatedCharset, CheckUriScheme) currently dereference the input
string s before checking for null; add an explicit null guard to the start of
each function (e.g. if (s == null) return false;) so a null tag returns false
instead of throwing, leaving the rest of the existing logic unchanged; update
the start of CheckIdentifier, CheckSpaceSeparatedCharset, and CheckUriScheme to
perform this null check before any access to s.Length or s[i].
In `@src/emit-java.ts`:
- Around line 240-250: The new Java helper methods checkIdentifier,
checkSpaceSeparatedCharset, and checkUriScheme currently dereference the input
string before checking for null (because renderValueCheckJava passes raw
t.get(index) which may be null); update each helper to first guard against null
by returning false when s == null (or s == null || s.isEmpty() if appropriate to
match existing patterns) before any s.charAt(...) or other dereference, ensuring
the helpers follow the same null-check pattern as other helpers in the file and
do not throw NullPointerException.
In `@src/emit-php.ts`:
- Around line 239-249: The three helper names used in the switch cases
(check_identifier, check_space_separated_charset, check_uri_scheme) violate the
file's PHP helper naming convention and their string literals are escaped with
JSON.stringify instead of the file's phpString helper; update the cases in
emit-php.ts to register and call schemata_check_identifier,
schemata_check_space_separated_charset, and schemata_check_uri_scheme (replace
helpers.add('check_*') with helpers.add('schemata_check_*') and call the
schemata_ names in the generated expr) and replace JSON.stringify(...) arguments
with phpString(...) so the charset/prefix literals go through the file's
phpString escaping function; also ensure any other references to those helper
names in this file are renamed accordingly so the helper implementations and
registrations match the new schemata_* names.
In `@src/emit-python.ts`:
- Around line 201-207: The hex_alternation branch constructs helper names
without the underscore before the length, causing mismatch with
emitPythonHelpers which generates helpers like _check_hex_64; update the helper
name construction in the 'hex_alternation' case so the fn variable uses
`_check_hex_${len}` and `_check_hex_${len}_mixed` (preserving the existing
helpers.add(fn) and returned `${fn}(${varExpr})` usage) so emitted validators
reference the exact helper names that emitPythonHelpers generates.
In `@src/emit-ruby.ts`:
- Around line 205-212: The hex_alternation branch builds helper names like
check_hex{len} / check_hex{len}_mixed which don't match the helpers emitted by
emitRubyHelpers (which emit check_hex_{len} / check_hex_{len}_mixed); update the
helper name construction in the hex_alternation case (symbol: hex_alternation)
to include the underscore before the length (use check_hex_${len} and
check_hex_${len}_mixed naming that matches emitRubyHelpers) and ensure
helpers.add(...) is called with those corrected names so the generated Ruby
includes the required methods.
In `@src/emit-rust.ts`:
- Around line 259-265: In the 'hex_alternation' branch the helper names are
built without the underscore (producing check_hex64/check_hex64_mixed) which
don't match the helpers emitted by emitRustHelpers
(check_hex_<len>/check_hex_<len>_mixed); update the helper construction in the
case 'hex_alternation' block so the names include the underscore (e.g., use
`check_hex_${len}` and `check_hex_${len}_mixed` when calling helpers.add and
when building the `${fn}(${varExpr})` expressions) so helpers.add(...) matches
what emitRustHelpers emits.
In `@tests/fuzz-equivalence.test.ts`:
- Around line 775-797: The added hand-written oracle branches inside
buildNativeChecker (the 'hex_alternation' and 'base64_2pad' cases) must be
removed and replaced so the oracle validates strings using the actual native
regex engine (new RegExp(pattern)) rather than a JS reimplementation; locate the
case blocks for 'hex_alternation' and 'base64_2pad' in buildNativeChecker and
change them to return a function that tests the string against the compiled
RegExp for the generated pattern (or delegate to the common RegExp-based path
used elsewhere in buildNativeChecker) so the fuzz suite compares the emitter
output to the true native behavior instead of duplicating logic that could share
bugs.
---
Outside diff comments:
In `@tests/classify-pattern.test.ts`:
- Around line 516-656: Add adversarial regex-vs-native equivalence tests for the
new PatternCheck ops: hex_alternation, base64_2pad, identifier,
space_separated_charset, and uri_scheme. For each op (found via classifyRegex
and validated with isNativeCheck), add tests that anchor the regex to validator
semantics (include end-anchor/$), generate adversarial inputs (fuzz both
matching and non-matching cases), and assert equivalence between the native
validator and the regex matcher; for ops that accept optional constraints (e.g.,
identifier with optionalPrefix), include both filtered and unfiltered modes.
Follow the existing test harness style in this file for creating adversarial
equivalence assertions so the new ops get the same coverage as prior pattern
checks.
---
Nitpick comments:
In `@src/emit-dart.ts`:
- Around line 233-239: The emitted Dart literals in the 'identifier' and
'space_separated_charset' branches currently use JSON.stringify and raw string
interpolation for the optional prefix, which bypasses the local Dart escaper;
update both branches to call the dartString() helper when embedding charset and
optionalPrefix values (i.e., replace JSON.stringify(check.firstCharset),
JSON.stringify(check.restCharset), JSON.stringify(check.charset) and the raw
optionalPrefix with dartString(check.firstCharset),
dartString(check.restCharset), dartString(check.charset) and
dartString(check.optionalPrefix) respectively) so the output uses the existing
Dart escaper; keep the helpers.add('_checkIdentifier') and
helpers.add('_checkSpaceSeparatedCharset') calls and ensure dartString is in
scope where emit-dart.ts constructs these expr strings.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ed1ab8ab-b492-426f-a9b2-363eac0280f6
📒 Files selected for processing (15)
src/classify-pattern.tssrc/emit-c.tssrc/emit-cpp.tssrc/emit-csharp.tssrc/emit-dart.tssrc/emit-go.tssrc/emit-java.tssrc/emit-kotlin.tssrc/emit-php.tssrc/emit-python.tssrc/emit-ruby.tssrc/emit-rust.tssrc/emit-swift.tstests/classify-pattern.test.tstests/fuzz-equivalence.test.ts
| case 'hex_alternation': { | ||
| const validChars = check.case === 'lower' ? HEX_LOWER : HEX_MIXED; | ||
| const lengthSet = new Set(check.lengths); | ||
| return (s) => { | ||
| if (!lengthSet.has(s.length)) return false; | ||
| for (let i = 0; i < s.length; i++) { | ||
| if (!validChars.includes(s[i])) return false; | ||
| } | ||
| return true; | ||
| }; | ||
| } | ||
|
|
||
| case 'base64_2pad': { | ||
| const isB64 = (c: string) => (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || (c >= '0' && c <= '9') || c === '+' || c === '/'; | ||
| return (s) => { | ||
| if (s.length < 4 || s.length % 4 !== 0) return false; | ||
| if (s[s.length - 1] !== '=' || s[s.length - 2] !== '=') return false; | ||
| for (let i = 0; i < s.length - 2; i++) { | ||
| if (!isB64(s[i])) return false; | ||
| } | ||
| return true; | ||
| }; | ||
| } |
There was a problem hiding this comment.
Don’t extend the hand-written oracle for new ops.
These branches add five more JS reimplementations to buildNativeChecker(). If an emitter/helper bug matches the same logic here, the fuzz suite still passes, so the new coverage can’t catch the regression it claims to guard against.
Based on learnings, NEVER use the native implementation as the fuzz-test oracle — the buildNativeChecker must compare against the actual regex (new RegExp(pattern)), not a reimplementation of the native check. If the oracle encodes the same bug as the implementation, the test can never detect the regression.
Also applies to: 902-957
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/fuzz-equivalence.test.ts` around lines 775 - 797, The added
hand-written oracle branches inside buildNativeChecker (the 'hex_alternation'
and 'base64_2pad' cases) must be removed and replaced so the oracle validates
strings using the actual native regex engine (new RegExp(pattern)) rather than a
JS reimplementation; locate the case blocks for 'hex_alternation' and
'base64_2pad' in buildNativeChecker and change them to return a function that
tests the string against the compiled RegExp for the generated pattern (or
delegate to the common RegExp-based path used elsewhere in buildNativeChecker)
so the fuzz suite compares the emitter output to the true native behavior
instead of duplicating logic that could share bugs.
- C++/Python/Ruby/Rust: hex_alternation used check_hex{len} but helpers
declared check_hex_{len} — add missing underscore before length
- C#/Java: add null guards to CheckIdentifier, CheckSpaceSeparatedCharset,
CheckUriScheme matching all other helpers in those emitters
- PHP: rename check_identifier/check_space_separated_charset/check_uri_scheme
to schemata_check_* matching file convention, use phpString() instead of
JSON.stringify() for string escaping
- Dart: use dartString() instead of JSON.stringify() for charset literals
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
PatternCheckops to eliminate all remaining regex fallback patterns from schemata v0.3.2renderPatternCheckcases and helper implementationsCommit 1:
hex_alternation+base64_2pad(closes #37)hex_alternation: multi-length hex alternation (e.g.^(?:[a-f0-9]{64}|[a-f0-9]{96}|[a-f0-9]{128})$)base64_2pad: strict base64 with mandatory==paddingCommit 2:
identifier+space_separated_charset+uri_scheme(closes #40)^[a-z][a-z0-9]*$identifier^[A-Z][a-zA-Z0-9]*$identifier^[a-z][a-z0-9-]*$identifier^!?[a-z][a-z0-9]*$identifier^!?[0-9]+$identifier^[a-z_]+( [a-z_]+)*$space_separated_charset^[A-Za-z][A-Za-z0-9+.-]*://uri_schemeTest plan
npm run buildcompiles cleanlynpm test— 927 tests pass, 0 failures--allcode generation works for all 13 languagesCloses #37, closes #40
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Tests