Skip to content

feat(eval): score prompts and skill outputs, not just souls#248

Open
prakashUXtech wants to merge 2 commits into
devfrom
feat/prompt-eval
Open

feat(eval): score prompts and skill outputs, not just souls#248
prakashUXtech wants to merge 2 commits into
devfrom
feat/prompt-eval

Conversation

@prakashUXtech
Copy link
Copy Markdown
Contributor

What

The eval framework only knew how to evaluate a seeded soul. This PR adds a third case mode, prompt, that scores a plain prompt or a skill's output instead. A mode: prompt case skips the soul: nothing gets birthed, the seed block is ignored, and the case's message goes straight to the scorer.

There's also a new optional field, CaseInputs.reference, that holds the text a skill was originally given. Set it and the judge scorer puts that text in front of the LLM under its own "Reference input" heading, so the criteria can ask whether a candidate output improved on where it started instead of judging it cold.

No new scoring kind — prompt and skill outputs go through the existing JudgeScoring. No new CLI command or flag either. soul eval runs a prompt-mode spec the same way it runs a soul spec.

The new reference spec tests/eval_examples/humanizer_skill.yaml scores the workspace /humanize skill. It has one deterministic regex gate that runs with no engine, plus four judge cases that check a humanized rewrite shed its AI tells and kept the meaning.

Changes:

  • eval/schema.py: CaseInputs.mode gains prompt, plus a new optional reference field.
  • eval/runner.py: _run_case dispatches the prompt mode, building the scorer input from message with no soul interaction.
  • eval/scoring.py: score_judge adds a "Reference input" block to its prompt when the case carries reference.
  • tests/eval_examples/humanizer_skill.yaml: new reference spec for the /humanize skill.
  • Docs: eval-format.md, cli-reference.md, api-reference.md, CHANGELOG.md.

Why

Our workspace prompts and skills live in version control but never get evaluated. Edit /humanize and the change ships blind. This is the read side of a prompt-evaluation pair: a way to tell whether an edit to a tracked skill made its output better or worse. soul-protocol already had a YAML eval framework with LLM-as-judge scoring, so extending it beat building a separate harness.

Testing

  • uv run pytest tests/test_eval/ -q: 79 passed.
  • uv run pytest tests/ -q: 3071 passed, 1 skipped (a pre-existing benchmark skip), no regressions.
  • uv run pytest tests/test_optimize/ -q: 46 passed. The optimize module shares this schema and runner, so this checks the change didn't break it.
  • Smoke: the new prompt mode parses through the schema, and the runner dispatches it without error, with and without an engine.
  • e2e: soul eval tests/eval_examples/humanizer_skill.yaml runs end to end. Exit 0 with no engine (the judge cases skip), exit 0 with a deterministic judge engine (all 5 cases pass).
  • uv run ruff check clean on all changed files.

Linked issue

Addresses qbtrix/paw-workspace#47. That issue lives in a different repo, so GitHub can't auto-close it from this PR. It will be closed by hand when this merges.

Adds a third eval case mode, `prompt`, that scores a plain prompt or a
skill's output instead of a seeded soul. A `mode: prompt` case skips the
soul: nothing is birthed, the `seed` block is ignored, and the case's
`message` goes straight to the scorer.

New optional `CaseInputs.reference` field holds the text a skill was
originally given. When set, the `judge` scorer shows it to the LLM under
its own "Reference input" heading so criteria can ask whether a candidate
output improved on where it started.

No new scoring kind — prompt and skill outputs reuse `JudgeScoring`. No
new CLI command or flag; `soul eval` runs a prompt-mode spec the same way
it runs a soul spec.

New reference spec tests/eval_examples/humanizer_skill.yaml scores the
workspace /humanize skill: one deterministic regex gate plus four judge
cases checking a humanized rewrite shed its AI tells and kept the meaning.

Addresses qbtrix/paw-workspace#47.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 21, 2026

Security scan: review needed

Potentially dangerous code patterns detected in changed files. A maintainer should verify these are intentional and safe.### src/soul_protocol/eval/runner.py

333:async def run_eval(
411:    return await run_eval(spec, engine=engine, case_filter=case_filter)

tests/test_eval/test_runner.py

88:    result = await run_eval(spec)
104:    result = await run_eval(spec)
120:    result = await run_eval(spec)
142:    result = await run_eval(spec)
162:    result = await run_eval(spec)
177:    result = await run_eval(spec)
188:    result = await run_eval(spec)
199:    result = await run_eval(spec)
209:    result = await run_eval(spec)
220:    result = await run_eval(spec)
232:    result = await run_eval(spec)
245:    result = await run_eval(spec, engine=engine)
257:    result = await run_eval(spec, engine=engine)
269:    result = await run_eval(spec)
279:    result = await run_eval(spec)
290:    result = await run_eval(spec)
322:    result = await run_eval(spec)
347:    result = await run_eval(spec, case_filter="alpha")
377:    result = await run_eval(spec)
397:    result = await run_eval(spec)
436:    result = await run_eval(spec)
454:    result = await run_eval(spec)
469:    result = await run_eval(spec, engine=engine)
491:    result = await run_eval(spec, engine=engine)

score_judge picked the reference judge template whenever
case.inputs.reference was truthy, ignoring the mode. CaseInputs.reference
is documented as prompt-mode-only, so a respond/recall case carrying
reference would silently use a prompt that omits the user's message.
Gate the field on mode == "prompt" so the docstring contract holds.

Also drop a sentence duplicated back-to-back in the CHANGELOG soul
optimize / autoresearch entry, and add a regression test: a respond-mode
case with reference set must still get the plain judge prompt.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 21, 2026

Security scan: review needed

Potentially dangerous code patterns detected in changed files. A maintainer should verify these are intentional and safe.### src/soul_protocol/eval/runner.py

333:async def run_eval(
411:    return await run_eval(spec, engine=engine, case_filter=case_filter)

tests/test_eval/test_runner.py

92:    result = await run_eval(spec)
108:    result = await run_eval(spec)
124:    result = await run_eval(spec)
146:    result = await run_eval(spec)
166:    result = await run_eval(spec)
181:    result = await run_eval(spec)
192:    result = await run_eval(spec)
203:    result = await run_eval(spec)
213:    result = await run_eval(spec)
224:    result = await run_eval(spec)
236:    result = await run_eval(spec)
249:    result = await run_eval(spec, engine=engine)
261:    result = await run_eval(spec, engine=engine)
273:    result = await run_eval(spec)
283:    result = await run_eval(spec)
294:    result = await run_eval(spec)
326:    result = await run_eval(spec)
351:    result = await run_eval(spec, case_filter="alpha")
381:    result = await run_eval(spec)
401:    result = await run_eval(spec)
440:    result = await run_eval(spec)
458:    result = await run_eval(spec)
473:    result = await run_eval(spec, engine=engine)
495:    result = await run_eval(spec, engine=engine)
522:    result = await run_eval(spec, engine=engine)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant