feat(eval): model-judgment eval harness — fixture defects + grader + ensemble-eval skill#19
Merged
Merged
Conversation
…ensemble-eval skill 測試金字塔最後一塊:模型判斷品質(reviewer 偵測率 + apply_fixes 修稿)。 確定性 surface 已全由 test/ 覆蓋;這塊走 eval —— 埋缺陷 fixture → K 次真 ensemble → 容差斷言。手動跑、絕不進 CI。 新增: - eval/fixtures/stats-paper/ — 合成論文埋 4 缺陷(捏造文獻、錯平均 5.42/4.42、 錯 t 6.34/2.79、N 120/102 不一致)+ ground-truth CSV + manifest。 - bin/pai-eval-grade — eval 的確定性評分器(detect K-run 容差聚合 / fix 修稿驗證;integrity findings 排除於命中)。test/pai-eval-grade.test.mjs(11)。 - skills/ensemble-eval/SKILL.md — dev 工具 skill。鐵律:fixture 唯讀、 reviewer context 中性(不洩 eval 字眼防 prompted recall)、不進 CI。 End-to-end smoke(實跑 1 次真 ensemble,codex 關):5 agents 全完成、 42 findings、4/4 缺陷命中(各被 3-4 個 lens 交叉證實)、grader exit 0。 bump 2.16.0 → 2.17.0(plugin.json + marketplace.json)。
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
摘要
把測試金字塔的最後一塊做完:模型判斷品質(reviewer 抓不抓得到真缺陷、
apply_fixes改稿改得對不對)。確定性 surface 已全由test/覆蓋(bats 59 + node 36);這塊非確定、貴、慢,不適合單元測試 → 走 eval harness:fixture 論文故意埋缺陷 → 跑 K 次真 ensemble → 容差斷言(每缺陷 ≥ minHits 次被抓到)。手動偶爾跑、絕不進 CI。新增
eval/fixtures/stats-paper/analysis/results.csv+manifest.json(match patterns / fix checks)。bin/pai-eval-gradedetect= K 次 run 容差聚合(match_any/match_all 對同一 finding、integrity findings 排除於命中、minHits 預設過半);fix= 修稿驗證(planted 文字消失 + corrected 值出現)。test/pai-eval-grade.test.mjsskills/ensemble-eval/SKILL.md--apply-fix在 temp 複本上驗修稿)。關鍵設計紀律(寫進 skill 鐵律)
test/。End-to-end smoke 驗證(實跑 1 次真 ensemble)
academic profile、codex 關、中性 context、K=1 +
--min-hits 1:4/4 缺陷命中、每個被 3–4 個 lens 交叉證實。Bonus:DA 甚至指出假數字的「編造指紋」({5.42, 6.34, p<.001} 成套自洽、唯獨 d=0.55 是真實分析的殘留物,反推 t=2.78≈ground truth 2.79)—— 這不是埋的考點,是 ensemble 自己推理出來的。
驗證(確定性部分)
版本
2.16.0→2.17.0(plugin.json + marketplace.json 同步)。CHANGELOG 已補[2.17.0]。測試金字塔(本 PR 後,完整)
/ensemble-eval手動