Skip to content

docs(parity): expand harness to all 19 Dynamo parser families#9261

Draft
keivenchang wants to merge 1 commit intomainfrom
keivenchang/DIS-1906__matrix-update-3-new-families
Draft

docs(parity): expand harness to all 19 Dynamo parser families#9261
keivenchang wants to merge 1 commit intomainfrom
keivenchang/DIS-1906__matrix-update-3-new-families

Conversation

@keivenchang
Copy link
Copy Markdown
Contributor

@keivenchang keivenchang commented May 7, 2026

Overview:

#9186 shipped the M2 cross-impl parity harness with 7 parser families; this PR adds the 12 remaining families Dynamo registers that have at least one upstream peer, bringing the matrix to full registry coverage.

Details:

  • How is this fixed? New fixtures for pythonic, gemma4, deepseek_v3 (legacy), deepseek_v4, deepseek_v3_2, hermes, qwen25, mistral, jamba, llama3_json, phi4, and nemotron_nano; matching vLLM keys and SGLang detectors wired through tests/parity/parser/{vllm,sglang}.py; fixtures regenerated via the live Dynamo binding; cross-impl divergences recorded as KNOWN_DIVERGENCES xfails.
  • Divergence categories surfaced: impl-defined recovery contracts on PARSER.batch.4/.5 (malformed args / missing end-token), trailing-text drops on PARSER.batch.8 (XML-style families), and format-detection failures on sglang/mistral/* and sglang/qwen25/* (root cause not investigated here; cells recorded as X so the matrix reflects observed behavior).
  • No parser-side code change — lib/parsers/ untouched. Doc + harness wiring only.

Coverage matrix (as in tests/parity/README.md)

Cell legend: = both vLLM and SGLang match Dynamo; V = vLLM diverges (xfailed); S = SGLang diverges; VS = both diverge; n/a = no peer parser. Column legend: bN = PARSER.batch.N (1 happy-path, 2 multiple, 3 no-tool, 4 malformed, 5 missing-end-token, 6 empty-args, 7 complex-args, 8 interleaved-text, 9 empty, 10 duplicate).

family b1 b2 b3 b4 b5 b6 b7 b8 b9 b10
Top-N models
deepseek_v4 § (DeepSeek V4) V V
gemma4 § (Gemma 4) V
glm47 (GLM 5.1) VS V
harmony † (gpt-oss) S S S S S S S S
kimi_k2 (Kimi K2.6) S VS
minimax_m2 (MiniMax 2.7) V VS VS
nemotron_deci †§ (Nemotron) n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a
qwen3_coder (Qwen 3.5) VS
Others
deepseek_v3 V VS S
deepseek_v3_1 V VS S
deepseek_v3_2 V V S
hermes VS V
jamba § V V
llama3_json § V V V V
mistral S S VS S S S VS S
nemotron_nano †§ n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a
phi4 § V V V
pythonic VS
qwen25 † S S S S S S S S

= no vLLM peer (or vLLM returns UNAVAILABLE at runtime — harmony needs token IDs); § = no SGLang peer.

Verification:

Smoke against pinned containers — vllm 120/30/40, sglang 83/37/70, dynamo 190/190 (passed/xfailed/skipped); skips are families without a peer on that side.

Where should the reviewer start?

tests/parity/README.md (matrix), then tests/parity/parser/test_parity_parser.py (KNOWN_DIVERGENCES block).

Related Issues:

Relates to DIS-1906

/coderabbit profile chill

@keivenchang keivenchang self-assigned this May 7, 2026
@github-actions github-actions Bot added documentation Improvements or additions to documentation docs labels May 7, 2026
Comment thread tests/parity/README.md Outdated
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

Review Change Stack

Walkthrough

This PR updates the parity test suite documentation to reflect expanded parser family coverage. The README was revised to display three additional families in the results table, increase aggregate cell counts from 140 to 200, and adjust narrative text to correctly reference the broader fixture corpus and its implications for test case overlap.

Changes

Parser Family Coverage Documentation Update

Layer / File(s) Summary
Parity Status Table
tests/parity/README.md
The "Current parity status (M2, batch mode)" results table extended to include pythonic, gemma4, and deepseek_v3 families; aggregate cell count increased from 140 to 200 excluding Dynamo oracle, with corresponding revisions to parity, xfail, and n/a tallies.
Documentation References
tests/parity/README.md
Narrative text updated to state reviewers can grep case IDs across all 10 families; overlap metric against Rust suite revised from ~70 to ~100 cases to align with current fixture expansion.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately reflects the main change—expanding parity harness coverage to all 19 Dynamo parser families—matching the PR objectives.
Description check ✅ Passed The PR description provides comprehensive details including overview, implementation specifics, coverage matrix, verification results, and related issues.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tests/parity/README.md (2)

329-329: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update the case count to match the expanded coverage.

Line 329 still references "~70 black-box tests", which was accurate for 7 families × 10 cases. With 10 families now covered, this should be ~100 to match line 380's updated reference.

📝 Proposed fix
-— for the ~70 black-box "given input X, parser returns Y" tests,
+— for the ~100 black-box "given input X, parser returns Y" tests,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/parity/README.md` at line 329, Update the README text that still reads
"~70 black-box \"given input X, parser returns Y\" tests" to reflect the
expanded coverage; change that phrase to "~100 black-box \"given input X, parser
returns Y\" tests" (or "about 100") so it matches the updated reference on line
380 and the current 10 families × 10 cases count; locate the exact string in
tests/parity/README.md and replace it accordingly.

102-102: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update the test count to reflect the expanded coverage.

The cost metric still references 210 tests, which was correct for 7 families (7 families × 3 implementations × 10 cases = 210). With the addition of pythonic, gemma4, and deepseek_v3, the total is now 10 families × 3 implementations × 10 cases = 300 tests.

📊 Proposed fix
-| **Cost** | (TBD) | ~3 s for 210 tests | ~60 s for 30 tests (server boot dominates) |
+| **Cost** | (TBD) | ~3 s for 300 tests | ~60 s for 30 tests (server boot dominates) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/parity/README.md` at line 102, The README table's Cost row still lists
"210 tests"; update the test count to reflect the new total of 300 tests (10
families × 3 implementations × 10 cases) by replacing "210 tests" with "300
tests" in the table cell (the line starting with "| **Cost** | (TBD) | ~3 s for
210 tests | ..."). Ensure the text exactly reflects "300 tests".
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@tests/parity/README.md`:
- Line 329: Update the README text that still reads "~70 black-box \"given input
X, parser returns Y\" tests" to reflect the expanded coverage; change that
phrase to "~100 black-box \"given input X, parser returns Y\" tests" (or "about
100") so it matches the updated reference on line 380 and the current 10
families × 10 cases count; locate the exact string in tests/parity/README.md and
replace it accordingly.
- Line 102: The README table's Cost row still lists "210 tests"; update the test
count to reflect the new total of 300 tests (10 families × 3 implementations ×
10 cases) by replacing "210 tests" with "300 tests" in the table cell (the line
starting with "| **Cost** | (TBD) | ~3 s for 210 tests | ..."). Ensure the text
exactly reflects "300 tests".

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f478b20b-d2f4-4675-9294-b97fade3e24e

📥 Commits

Reviewing files that changed from the base of the PR and between f5a9463 and a7d8090.

📒 Files selected for processing (1)
  • tests/parity/README.md

@pull-request-size pull-request-size Bot added size/M and removed size/S labels May 7, 2026
@keivenchang keivenchang changed the title docs(parity): refresh coverage matrix for 3 families merged in #9186 docs(parity): refresh coverage matrix for 3 families May 7, 2026
@keivenchang keivenchang changed the title docs(parity): refresh coverage matrix for 3 families docs(parity): refresh coverage matrix for parser families May 7, 2026
@keivenchang keivenchang changed the title docs(parity): refresh coverage matrix for parser families docs(parser): refresh coverage matrix for parser families May 7, 2026
@keivenchang keivenchang requested review from a team as code owners May 7, 2026 18:00
@pull-request-size pull-request-size Bot added size/L and removed size/M labels May 7, 2026
@keivenchang
Copy link
Copy Markdown
Contributor Author

@coderabbitai both outside-diff findings caught — fixed in 51ddee6, but with larger numbers than your suggestions because the matrix grew further in-PR after a7d8090:

  • Line 102 cost row: 210570 (19 families × 3 impls × 10 cases, not 300 — added hermes, qwen25, mistral, jamba, llama3_json, phi4, nemotron_nano, deepseek_v3_2 after your snapshot).
  • Line 346 black-box test count: ~70~190 (19 families × 10 cases, not ~100).

Same fix shape, scaled up. Please verify and resolve.

@keivenchang keivenchang changed the title docs(parser): refresh coverage matrix for parser families docs(parity): expand harness to all 19 Dynamo parser families May 7, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check current state of the README around the claimed fix locations
grep -n "570\|210\|300\|cost\|190\|~70\|~100\|families" tests/parity/README.md | head -60
echo "---LINE 95-115---"
sed -n '95,115p' tests/parity/README.md
echo "---LINE 340-355---"
sed -n '340,355p' tests/parity/README.md

Length of output: 2558


@keivenchang: Both fixes verified in the current branch ✅

  • Line 102~3 s for 570 tests (19 × 3 × 10 = 570) ✓
  • Line 346~190 black-box "given input X, parser returns Y" tests (19 × 10 = 190) ✓

One minor stale reference I noticed while scanning: Line 250 still reads "across all 10 families" while Line 144 now correctly states "19 parser families total". Not blocking, but worth a one-word fix if you do another pass.

Resolving the original comments now.

[resolve]

Copy link
Copy Markdown
Contributor

@furionw furionw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

#9186 shipped the M2 cross-impl parity harness with 7 parser families;
this PR adds the 12 remaining families Dynamo registers that have at
least one upstream peer in vLLM or SGLang, bringing the matrix to
full registry coverage.

* New family fixtures (10 cases each): pythonic, gemma4, deepseek_v3
  (legacy), deepseek_v4, deepseek_v3_2, hermes, qwen25, mistral, jamba,
  llama3_json, phi4, nemotron_nano. Two with no upstream peer
  (nemotron_deci, nemotron_nano) listed for completeness.
* Wrapper map additions in tests/parity/parser/{vllm,sglang}.py for
  the matching peer parsers / detectors.
* KNOWN_DIVERGENCES: ~50 new entries from smoke runs in vllm + sglang
  containers. Three categories: impl-defined recovery contracts on
  PARSER.batch.4/5 (malformed args / missing end-token), trailing-text
  drops on PARSER.batch.8 (XML-style families), and format-detection
  failures on sglang/mistral/* and sglang/qwen25/* (those detectors
  return calls=[] on the bare wire formats Dynamo emits — root cause
  not investigated here, cells recorded as X so the matrix reflects
  observed behavior).
* README matrix restructured into a single table with bold Top-N
  models / Others section dividers; family names use compact bN
  column headers with a bN = PARSER.batch.N legend above; row labels
  carry (model name) annotations on the Top-N tier; whole-impl-na rows
  carry footnote markers (no vLLM peer / no SGLang peer).

No parser-side code change — lib/parsers/ untouched. Doc + harness
wiring only.

Smoke against pinned containers (vllm 0.20.1, sglang 0.5.10.post1):
* vllm:   120 passed / 30 xfailed / 40 skipped
* sglang:  83 passed / 37 xfailed / 70 skipped
* dynamo: 190/190 passed (oracle)

Tally (excluding dynamo): 380 cells across 19 families x 10 cases x
2 impls -- 203 parity, 67 divergence (xfailed), 110 n/a.

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
@keivenchang keivenchang force-pushed the keivenchang/DIS-1906__matrix-update-3-new-families branch from 905c817 to 3abad8a Compare May 7, 2026 18:56
@keivenchang keivenchang marked this pull request as draft May 8, 2026 02:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs documentation Improvements or additions to documentation size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants