docs(parity): expand harness to all 19 Dynamo parser families#9261
docs(parity): expand harness to all 19 Dynamo parser families#9261keivenchang wants to merge 1 commit intomainfrom
Conversation
WalkthroughThis PR updates the parity test suite documentation to reflect expanded parser family coverage. The README was revised to display three additional families in the results table, increase aggregate cell counts from 140 to 200, and adjust narrative text to correctly reference the broader fixture corpus and its implications for test case overlap. ChangesParser Family Coverage Documentation Update
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tests/parity/README.md (2)
329-329:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUpdate the case count to match the expanded coverage.
Line 329 still references "~70 black-box tests", which was accurate for 7 families × 10 cases. With 10 families now covered, this should be ~100 to match line 380's updated reference.
📝 Proposed fix
-— for the ~70 black-box "given input X, parser returns Y" tests, +— for the ~100 black-box "given input X, parser returns Y" tests,🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/parity/README.md` at line 329, Update the README text that still reads "~70 black-box \"given input X, parser returns Y\" tests" to reflect the expanded coverage; change that phrase to "~100 black-box \"given input X, parser returns Y\" tests" (or "about 100") so it matches the updated reference on line 380 and the current 10 families × 10 cases count; locate the exact string in tests/parity/README.md and replace it accordingly.
102-102:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUpdate the test count to reflect the expanded coverage.
The cost metric still references 210 tests, which was correct for 7 families (7 families × 3 implementations × 10 cases = 210). With the addition of pythonic, gemma4, and deepseek_v3, the total is now 10 families × 3 implementations × 10 cases = 300 tests.
📊 Proposed fix
-| **Cost** | (TBD) | ~3 s for 210 tests | ~60 s for 30 tests (server boot dominates) | +| **Cost** | (TBD) | ~3 s for 300 tests | ~60 s for 30 tests (server boot dominates) |🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/parity/README.md` at line 102, The README table's Cost row still lists "210 tests"; update the test count to reflect the new total of 300 tests (10 families × 3 implementations × 10 cases) by replacing "210 tests" with "300 tests" in the table cell (the line starting with "| **Cost** | (TBD) | ~3 s for 210 tests | ..."). Ensure the text exactly reflects "300 tests".
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@tests/parity/README.md`:
- Line 329: Update the README text that still reads "~70 black-box \"given input
X, parser returns Y\" tests" to reflect the expanded coverage; change that
phrase to "~100 black-box \"given input X, parser returns Y\" tests" (or "about
100") so it matches the updated reference on line 380 and the current 10
families × 10 cases count; locate the exact string in tests/parity/README.md and
replace it accordingly.
- Line 102: The README table's Cost row still lists "210 tests"; update the test
count to reflect the new total of 300 tests (10 families × 3 implementations ×
10 cases) by replacing "210 tests" with "300 tests" in the table cell (the line
starting with "| **Cost** | (TBD) | ~3 s for 210 tests | ..."). Ensure the text
exactly reflects "300 tests".
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f478b20b-d2f4-4675-9294-b97fade3e24e
📒 Files selected for processing (1)
tests/parity/README.md
|
@coderabbitai both outside-diff findings caught — fixed in 51ddee6, but with larger numbers than your suggestions because the matrix grew further in-PR after a7d8090:
Same fix shape, scaled up. Please verify and resolve. |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 2558
One minor stale reference I noticed while scanning: Line 250 still reads Resolving the original comments now. [resolve] |
#9186 shipped the M2 cross-impl parity harness with 7 parser families; this PR adds the 12 remaining families Dynamo registers that have at least one upstream peer in vLLM or SGLang, bringing the matrix to full registry coverage. * New family fixtures (10 cases each): pythonic, gemma4, deepseek_v3 (legacy), deepseek_v4, deepseek_v3_2, hermes, qwen25, mistral, jamba, llama3_json, phi4, nemotron_nano. Two with no upstream peer (nemotron_deci, nemotron_nano) listed for completeness. * Wrapper map additions in tests/parity/parser/{vllm,sglang}.py for the matching peer parsers / detectors. * KNOWN_DIVERGENCES: ~50 new entries from smoke runs in vllm + sglang containers. Three categories: impl-defined recovery contracts on PARSER.batch.4/5 (malformed args / missing end-token), trailing-text drops on PARSER.batch.8 (XML-style families), and format-detection failures on sglang/mistral/* and sglang/qwen25/* (those detectors return calls=[] on the bare wire formats Dynamo emits — root cause not investigated here, cells recorded as X so the matrix reflects observed behavior). * README matrix restructured into a single table with bold Top-N models / Others section dividers; family names use compact bN column headers with a bN = PARSER.batch.N legend above; row labels carry (model name) annotations on the Top-N tier; whole-impl-na rows carry footnote markers (no vLLM peer / no SGLang peer). No parser-side code change — lib/parsers/ untouched. Doc + harness wiring only. Smoke against pinned containers (vllm 0.20.1, sglang 0.5.10.post1): * vllm: 120 passed / 30 xfailed / 40 skipped * sglang: 83 passed / 37 xfailed / 70 skipped * dynamo: 190/190 passed (oracle) Tally (excluding dynamo): 380 cells across 19 families x 10 cases x 2 impls -- 203 parity, 67 divergence (xfailed), 110 n/a. Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
905c817 to
3abad8a
Compare
Overview:
#9186 shipped the M2 cross-impl parity harness with 7 parser families; this PR adds the 12 remaining families Dynamo registers that have at least one upstream peer, bringing the matrix to full registry coverage.
Details:
pythonic,gemma4,deepseek_v3(legacy),deepseek_v4,deepseek_v3_2,hermes,qwen25,mistral,jamba,llama3_json,phi4, andnemotron_nano; matching vLLM keys and SGLang detectors wired throughtests/parity/parser/{vllm,sglang}.py; fixtures regenerated via the live Dynamo binding; cross-impl divergences recorded asKNOWN_DIVERGENCESxfails.PARSER.batch.4/.5(malformed args / missing end-token), trailing-text drops onPARSER.batch.8(XML-style families), and format-detection failures onsglang/mistral/*andsglang/qwen25/*(root cause not investigated here; cells recorded asXso the matrix reflects observed behavior).lib/parsers/untouched. Doc + harness wiring only.Coverage matrix (as in
tests/parity/README.md)Cell legend:
✓= both vLLM and SGLang match Dynamo;V= vLLM diverges (xfailed);S= SGLang diverges;VS= both diverge;n/a= no peer parser. Column legend:bN=PARSER.batch.N(1 happy-path, 2 multiple, 3 no-tool, 4 malformed, 5 missing-end-token, 6 empty-args, 7 complex-args, 8 interleaved-text, 9 empty, 10 duplicate).†= no vLLM peer (or vLLM returnsUNAVAILABLEat runtime — harmony needs token IDs);§= no SGLang peer.Verification:
Smoke against pinned containers — vllm 120/30/40, sglang 83/37/70, dynamo 190/190 (passed/xfailed/skipped); skips are families without a peer on that side.
Where should the reviewer start?
tests/parity/README.md(matrix), thentests/parity/parser/test_parity_parser.py(KNOWN_DIVERGENCESblock).Related Issues:
Relates to DIS-1906
/coderabbit profile chill