feat(eval): score prompts and skill outputs, not just souls#248
Open
prakashUXtech wants to merge 2 commits into
Open
feat(eval): score prompts and skill outputs, not just souls#248prakashUXtech wants to merge 2 commits into
prakashUXtech wants to merge 2 commits into
Conversation
Adds a third eval case mode, `prompt`, that scores a plain prompt or a skill's output instead of a seeded soul. A `mode: prompt` case skips the soul: nothing is birthed, the `seed` block is ignored, and the case's `message` goes straight to the scorer. New optional `CaseInputs.reference` field holds the text a skill was originally given. When set, the `judge` scorer shows it to the LLM under its own "Reference input" heading so criteria can ask whether a candidate output improved on where it started. No new scoring kind — prompt and skill outputs reuse `JudgeScoring`. No new CLI command or flag; `soul eval` runs a prompt-mode spec the same way it runs a soul spec. New reference spec tests/eval_examples/humanizer_skill.yaml scores the workspace /humanize skill: one deterministic regex gate plus four judge cases checking a humanized rewrite shed its AI tells and kept the meaning. Addresses qbtrix/paw-workspace#47.
Security scan: review neededPotentially dangerous code patterns detected in changed files. A maintainer should verify these are intentional and safe.### src/soul_protocol/eval/runner.py tests/test_eval/test_runner.py |
score_judge picked the reference judge template whenever case.inputs.reference was truthy, ignoring the mode. CaseInputs.reference is documented as prompt-mode-only, so a respond/recall case carrying reference would silently use a prompt that omits the user's message. Gate the field on mode == "prompt" so the docstring contract holds. Also drop a sentence duplicated back-to-back in the CHANGELOG soul optimize / autoresearch entry, and add a regression test: a respond-mode case with reference set must still get the plain judge prompt.
Security scan: review neededPotentially dangerous code patterns detected in changed files. A maintainer should verify these are intentional and safe.### src/soul_protocol/eval/runner.py tests/test_eval/test_runner.py |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
The eval framework only knew how to evaluate a seeded soul. This PR adds a third case mode,
prompt, that scores a plain prompt or a skill's output instead. Amode: promptcase skips the soul: nothing gets birthed, theseedblock is ignored, and the case'smessagegoes straight to the scorer.There's also a new optional field,
CaseInputs.reference, that holds the text a skill was originally given. Set it and thejudgescorer puts that text in front of the LLM under its own "Reference input" heading, so the criteria can ask whether a candidate output improved on where it started instead of judging it cold.No new scoring kind — prompt and skill outputs go through the existing
JudgeScoring. No new CLI command or flag either.soul evalruns a prompt-mode spec the same way it runs a soul spec.The new reference spec
tests/eval_examples/humanizer_skill.yamlscores the workspace/humanizeskill. It has one deterministicregexgate that runs with no engine, plus fourjudgecases that check a humanized rewrite shed its AI tells and kept the meaning.Changes:
eval/schema.py:CaseInputs.modegainsprompt, plus a new optionalreferencefield.eval/runner.py:_run_casedispatches the prompt mode, building the scorer input frommessagewith no soul interaction.eval/scoring.py:score_judgeadds a "Reference input" block to its prompt when the case carriesreference.tests/eval_examples/humanizer_skill.yaml: new reference spec for the/humanizeskill.eval-format.md,cli-reference.md,api-reference.md,CHANGELOG.md.Why
Our workspace prompts and skills live in version control but never get evaluated. Edit
/humanizeand the change ships blind. This is the read side of a prompt-evaluation pair: a way to tell whether an edit to a tracked skill made its output better or worse. soul-protocol already had a YAML eval framework with LLM-as-judge scoring, so extending it beat building a separate harness.Testing
uv run pytest tests/test_eval/ -q: 79 passed.uv run pytest tests/ -q: 3071 passed, 1 skipped (a pre-existing benchmark skip), no regressions.uv run pytest tests/test_optimize/ -q: 46 passed. The optimize module shares this schema and runner, so this checks the change didn't break it.soul eval tests/eval_examples/humanizer_skill.yamlruns end to end. Exit 0 with no engine (the judge cases skip), exit 0 with a deterministic judge engine (all 5 cases pass).uv run ruff checkclean on all changed files.Linked issue
Addresses qbtrix/paw-workspace#47. That issue lives in a different repo, so GitHub can't auto-close it from this PR. It will be closed by hand when this merges.