feat(eval): score prompts and skill outputs, not just souls by prakashUXtech · Pull Request #248 · qbtrix/soul-protocol

prakashUXtech · 2026-05-21T09:16:56Z

What

The eval framework only knew how to evaluate a seeded soul. This PR adds a third case mode, prompt, that scores a plain prompt or a skill's output instead. A mode: prompt case skips the soul: nothing gets birthed, the seed block is ignored, and the case's message goes straight to the scorer.

There's also a new optional field, CaseInputs.reference, that holds the text a skill was originally given. Set it and the judge scorer puts that text in front of the LLM under its own "Reference input" heading, so the criteria can ask whether a candidate output improved on where it started instead of judging it cold.

No new scoring kind — prompt and skill outputs go through the existing JudgeScoring. No new CLI command or flag either. soul eval runs a prompt-mode spec the same way it runs a soul spec.

The new reference spec tests/eval_examples/humanizer_skill.yaml scores the workspace /humanize skill. It has one deterministic regex gate that runs with no engine, plus four judge cases that check a humanized rewrite shed its AI tells and kept the meaning.

Changes:

eval/schema.py: CaseInputs.mode gains prompt, plus a new optional reference field.
eval/runner.py: _run_case dispatches the prompt mode, building the scorer input from message with no soul interaction.
eval/scoring.py: score_judge adds a "Reference input" block to its prompt when the case carries reference.
tests/eval_examples/humanizer_skill.yaml: new reference spec for the /humanize skill.
Docs: eval-format.md, cli-reference.md, api-reference.md, CHANGELOG.md.

Why

Our workspace prompts and skills live in version control but never get evaluated. Edit /humanize and the change ships blind. This is the read side of a prompt-evaluation pair: a way to tell whether an edit to a tracked skill made its output better or worse. soul-protocol already had a YAML eval framework with LLM-as-judge scoring, so extending it beat building a separate harness.

Testing

uv run pytest tests/test_eval/ -q: 79 passed.
uv run pytest tests/ -q: 3071 passed, 1 skipped (a pre-existing benchmark skip), no regressions.
uv run pytest tests/test_optimize/ -q: 46 passed. The optimize module shares this schema and runner, so this checks the change didn't break it.
Smoke: the new prompt mode parses through the schema, and the runner dispatches it without error, with and without an engine.
e2e: soul eval tests/eval_examples/humanizer_skill.yaml runs end to end. Exit 0 with no engine (the judge cases skip), exit 0 with a deterministic judge engine (all 5 cases pass).
uv run ruff check clean on all changed files.

Linked issue

Addresses qbtrix/paw-workspace#47. That issue lives in a different repo, so GitHub can't auto-close it from this PR. It will be closed by hand when this merges.

Adds a third eval case mode, `prompt`, that scores a plain prompt or a skill's output instead of a seeded soul. A `mode: prompt` case skips the soul: nothing is birthed, the `seed` block is ignored, and the case's `message` goes straight to the scorer. New optional `CaseInputs.reference` field holds the text a skill was originally given. When set, the `judge` scorer shows it to the LLM under its own "Reference input" heading so criteria can ask whether a candidate output improved on where it started. No new scoring kind — prompt and skill outputs reuse `JudgeScoring`. No new CLI command or flag; `soul eval` runs a prompt-mode spec the same way it runs a soul spec. New reference spec tests/eval_examples/humanizer_skill.yaml scores the workspace /humanize skill: one deterministic regex gate plus four judge cases checking a humanized rewrite shed its AI tells and kept the meaning. Addresses qbtrix/paw-workspace#47.

github-actions · 2026-05-21T09:17:09Z

Security scan: review needed

Potentially dangerous code patterns detected in changed files. A maintainer should verify these are intentional and safe.### src/soul_protocol/eval/runner.py

333:async def run_eval(
411:    return await run_eval(spec, engine=engine, case_filter=case_filter)

tests/test_eval/test_runner.py

88:    result = await run_eval(spec)
104:    result = await run_eval(spec)
120:    result = await run_eval(spec)
142:    result = await run_eval(spec)
162:    result = await run_eval(spec)
177:    result = await run_eval(spec)
188:    result = await run_eval(spec)
199:    result = await run_eval(spec)
209:    result = await run_eval(spec)
220:    result = await run_eval(spec)
232:    result = await run_eval(spec)
245:    result = await run_eval(spec, engine=engine)
257:    result = await run_eval(spec, engine=engine)
269:    result = await run_eval(spec)
279:    result = await run_eval(spec)
290:    result = await run_eval(spec)
322:    result = await run_eval(spec)
347:    result = await run_eval(spec, case_filter="alpha")
377:    result = await run_eval(spec)
397:    result = await run_eval(spec)
436:    result = await run_eval(spec)
454:    result = await run_eval(spec)
469:    result = await run_eval(spec, engine=engine)
491:    result = await run_eval(spec, engine=engine)

score_judge picked the reference judge template whenever case.inputs.reference was truthy, ignoring the mode. CaseInputs.reference is documented as prompt-mode-only, so a respond/recall case carrying reference would silently use a prompt that omits the user's message. Gate the field on mode == "prompt" so the docstring contract holds. Also drop a sentence duplicated back-to-back in the CHANGELOG soul optimize / autoresearch entry, and add a regression test: a respond-mode case with reference set must still get the plain judge prompt.

github-actions · 2026-05-21T12:24:28Z

Security scan: review needed

Potentially dangerous code patterns detected in changed files. A maintainer should verify these are intentional and safe.### src/soul_protocol/eval/runner.py

333:async def run_eval(
411:    return await run_eval(spec, engine=engine, case_filter=case_filter)

tests/test_eval/test_runner.py

92:    result = await run_eval(spec)
108:    result = await run_eval(spec)
124:    result = await run_eval(spec)
146:    result = await run_eval(spec)
166:    result = await run_eval(spec)
181:    result = await run_eval(spec)
192:    result = await run_eval(spec)
203:    result = await run_eval(spec)
213:    result = await run_eval(spec)
224:    result = await run_eval(spec)
236:    result = await run_eval(spec)
249:    result = await run_eval(spec, engine=engine)
261:    result = await run_eval(spec, engine=engine)
273:    result = await run_eval(spec)
283:    result = await run_eval(spec)
294:    result = await run_eval(spec)
326:    result = await run_eval(spec)
351:    result = await run_eval(spec, case_filter="alpha")
381:    result = await run_eval(spec)
401:    result = await run_eval(spec)
440:    result = await run_eval(spec)
458:    result = await run_eval(spec)
473:    result = await run_eval(spec, engine=engine)
495:    result = await run_eval(spec, engine=engine)
522:    result = await run_eval(spec, engine=engine)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): score prompts and skill outputs, not just souls#248

feat(eval): score prompts and skill outputs, not just souls#248
prakashUXtech wants to merge 2 commits into
devfrom
feat/prompt-eval

prakashUXtech commented May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

prakashUXtech commented May 21, 2026

What

Why

Testing

Linked issue

Uh oh!

github-actions Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Security scan: review needed

tests/test_eval/test_runner.py

Uh oh!

github-actions Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Security scan: review needed

tests/test_eval/test_runner.py

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 21, 2026 •

edited

Loading

github-actions Bot commented May 21, 2026 •

edited

Loading