Skip to content

Commit 5ae93a3

Browse files
christsoclaude
andauthored
feat: code-grader plain-text fallback + workspace env preflight (#1209)
* feat: add shell grader and workspace env preflight checks (#1207, #1208) Adds two new eval features: **Shell grader** (`type: shell`): runs a shell command and checks its stdout. - No `expected`: passes when exit code is 0 - `expected` with no `operator`: exact string match (trimmed stdout) - `expected` + `operator` (>, <, >=, <=, ==, !=): numeric float comparison **Workspace env preflight** (`workspace.env`): declares required system dependencies that are checked once before before_all hooks run. Fails fast with a clear diagnostic listing all missing commands/modules. Example: ```yaml workspace: env: required_commands: [ffmpeg, pandoc] required_python_modules: [PIL, openai] assertions: - type: shell command: "pdfinfo report.pdf | grep Pages | awk '{print $2}'" operator: ">=" expected: "5" ``` Closes #1207, #1208 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: resolve lint errors in shell grader and targets-validator imports Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor: replace shell grader with code-grader plain-text fallback (#1210) Per design review: the `shell` grader type violated the "audit existing primitives first" principle — `code-grader` already runs shell commands. Promptfoo solves this the same way (javascript/python fallbacks, no dedicated shell type). Remove the `shell` grader type entirely and instead extend `code-grader` to accept plain-text stdout without requiring the JSON protocol: | stdout (trimmed, case-insensitive) | score | |---|---| | empty string | 1 if exit 0, 0 if exit non-zero | | "true", "pass", "1" | 1 | | "false", "fail", "0" | 0 | | numeric string | clamped float | | anything else | 1 if exit 0, 0 if exit non-zero | Scripts that write to stderr on non-zero exit still surface as errors (existing behavior). Silent non-zero exits (e.g. `[ "$pages" -ge 5 ]`) use exit-code convention. Usage: # numeric comparison via exit code - type: code-grader command: ["bash", "-c", "[ $(pdfinfo report.pdf | grep Pages | awk '{print $2}') -ge 5 ]"] # score from stdout - type: code-grader command: ["bash", "-c", "echo 0.75"] Closes #1210 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * style: fix biome formatting in code-grader Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor: simplify code-grader plain-text fallback to exit-code + assertion text Replace the string/numeric score interpretation with a clean two-convention model: - Exit code: 0 = score 1 (pass), non-zero = score 0 (fail) - Stdout: becomes the assertion text (human-readable context for the result) - Stderr on non-zero exit: still surfaces as an error For numeric scores or multi-aspect results, use the JSON protocol. This removes the "0"/"1"/numeric string ambiguity and aligns with how Unix tooling (bats, make, shell builtins) already signals pass/fail. Updates docs and tests to reflect the new model. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * style: fix biome formatting Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 47028ea commit 5ae93a3

6 files changed

Lines changed: 293 additions & 32 deletions

File tree

apps/web/src/content/docs/docs/graders/code-graders.mdx

Lines changed: 43 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ Code graders are scripts that evaluate agent responses deterministically. Write
99

1010
## Contract
1111

12-
Code graders communicate via stdin/stdout JSON:
12+
Code graders receive eval context via stdin JSON and return a result via stdout.
1313

1414
**Input (stdin):**
1515
```json
@@ -19,8 +19,12 @@ Code graders communicate via stdin/stdout JSON:
1919
"output": "The answer is 42.",
2020
"expected_output": "42"
2121
}
22+
```
23+
24+
### JSON output (full protocol)
25+
26+
Emit a JSON object for numeric scores or multi-aspect results:
2227

23-
**Output (stdout):**
2428
```json
2529
{
2630
"score": 1.0,
@@ -35,6 +39,43 @@ Code graders communicate via stdin/stdout JSON:
3539
| `score` | `number` | 0.0 to 1.0 |
3640
| `assertions` | `Array<{ text, passed, evidence? }>` | Per-aspect results with verdict and optional evidence |
3741

42+
### Plain-text output (exit-code convention)
43+
44+
For simple pass/fail checks, skip the JSON protocol entirely. The exit code determines the score and stdout becomes the assertion text:
45+
46+
| Exit code | Score | Verdict |
47+
|-----------|-------|---------|
48+
| 0 | 1.0 | pass |
49+
| non-zero (no stderr) | 0.0 | fail |
50+
51+
```bash
52+
#!/bin/bash
53+
# check-pages.sh — passes when PDF has at least 5 pages
54+
pages=$(pdfinfo report.pdf | grep Pages | awk '{print $2}')
55+
if [ "$pages" -ge 5 ]; then
56+
echo "PDF has $pages pages (≥5 required)"
57+
else
58+
echo "PDF has only $pages pages (<5 required)"
59+
exit 1
60+
fi
61+
```
62+
63+
```yaml
64+
assertions:
65+
- type: code-grader
66+
command: [bash, scripts/check-pages.sh]
67+
```
68+
69+
Silent one-liners work too — stdout is optional:
70+
71+
```yaml
72+
assertions:
73+
- type: code-grader
74+
command: ["bash", "-c", "[ $(wc -l < output.txt) -ge 10 ]"]
75+
```
76+
77+
Scripts that write to stderr and exit non-zero surface as execution errors rather than quality failures.
78+
3879
## Python Example
3980
4081
```python

packages/core/src/evaluation/graders/code-grader.ts

Lines changed: 75 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -212,6 +212,8 @@ export class CodeGrader implements Grader {
212212

213213
try {
214214
let stdout: string;
215+
let exitCode = 0;
216+
let execStderr = '';
215217
if (context.dockerConfig) {
216218
// Docker execution mode: run grader inside a container
217219
const { DockerWorkspaceProvider } = await import('../workspace/docker-workspace.js');
@@ -221,40 +223,68 @@ export class CodeGrader implements Grader {
221223
stdin: inputPayload,
222224
repoCheckouts: getRepoCheckoutTargets(context.evalCase.workspace?.repos),
223225
});
224-
if (result.exitCode !== 0) {
225-
const trimmedErr = result.stderr.trim();
226-
throw new Error(
227-
trimmedErr.length > 0
228-
? `Code evaluator exited with code ${result.exitCode}: ${trimmedErr}`
229-
: `Code evaluator exited with code ${result.exitCode}`,
230-
);
231-
}
226+
exitCode = result.exitCode;
232227
stdout = result.stdout.trim();
228+
execStderr = result.stderr;
233229
} else {
234-
stdout = await executeScript(
230+
const result = await runScriptRaw(
235231
this.command,
236232
inputPayload,
237233
this.agentTimeoutMs,
238234
this.cwd,
239235
env,
240236
);
237+
exitCode = result.exitCode;
238+
stdout = result.stdout.trim();
239+
execStderr = result.stderr;
240+
}
241+
// Non-zero exit with JSON stdout, or with stderr output, is treated as an error
242+
// (script signaled failure through the protocol or wrote an error message).
243+
// Non-zero exit with plain stdout and no stderr uses the exit-code convention —
244+
// score 0 (fail), stdout becomes the assertion text.
245+
const looksLikeJson = stdout.startsWith('{') || stdout.startsWith('[');
246+
const hasStderr = execStderr.trim().length > 0;
247+
if (exitCode !== 0 && (looksLikeJson || hasStderr)) {
248+
const trimmedErr = formatStderr(execStderr);
249+
throw new Error(
250+
trimmedErr.length > 0
251+
? `Code evaluator exited with code ${exitCode}: ${trimmedErr}`
252+
: `Code evaluator exited with code ${exitCode}`,
253+
);
241254
}
242-
const parsed = parseJsonSafe(stdout);
243-
const score = clampScore(typeof parsed?.score === 'number' ? parsed.score : 0);
244-
const assertions: AssertionEntry[] = Array.isArray(parsed?.assertions)
245-
? parsed.assertions
246-
.filter(
247-
(a: unknown): a is { text: string; passed: boolean; evidence?: string } =>
248-
typeof a === 'object' &&
249-
a !== null &&
250-
typeof (a as Record<string, unknown>).text === 'string',
251-
)
252-
.map((a) => ({
253-
text: String(a.text),
254-
passed: Boolean(a.passed),
255-
...(typeof a.evidence === 'string' ? { evidence: a.evidence } : {}),
256-
}))
257-
: [];
255+
const rawParsed = parseJsonSafe(stdout);
256+
// Only treat stdout as the JSON protocol if it parsed as a plain object.
257+
// Bare JSON scalars (numbers, booleans, strings) fall through to the plain-text path.
258+
const parsed =
259+
rawParsed != null && typeof rawParsed === 'object' && !Array.isArray(rawParsed)
260+
? rawParsed
261+
: undefined;
262+
// Plain-text fallback: exit code is pass/fail, stdout is the assertion text.
263+
// For numeric scores or multi-aspect results, use the JSON protocol instead.
264+
const passed = exitCode === 0;
265+
const score =
266+
parsed != null
267+
? clampScore(typeof parsed.score === 'number' ? parsed.score : 0)
268+
: passed
269+
? 1
270+
: 0;
271+
const assertions: AssertionEntry[] =
272+
parsed != null && Array.isArray(parsed?.assertions)
273+
? parsed.assertions
274+
.filter(
275+
(a: unknown): a is { text: string; passed: boolean; evidence?: string } =>
276+
typeof a === 'object' &&
277+
a !== null &&
278+
typeof (a as Record<string, unknown>).text === 'string',
279+
)
280+
.map((a) => ({
281+
text: String(a.text),
282+
passed: Boolean(a.passed),
283+
...(typeof a.evidence === 'string' ? { evidence: a.evidence } : {}),
284+
}))
285+
: parsed == null
286+
? [{ text: stdout.trim() || (passed ? 'exit 0' : `exit ${exitCode}`), passed }]
287+
: [];
258288
// Capture optional structured details from code judge output
259289
const details =
260290
parsed?.details && typeof parsed.details === 'object' && !Array.isArray(parsed.details)
@@ -325,17 +355,33 @@ export class CodeGrader implements Grader {
325355
}
326356
}
327357

358+
/** Run a script and return raw stdout/stderr/exitCode without throwing. */
359+
async function runScriptRaw(
360+
scriptPath: readonly string[] | string,
361+
input: string,
362+
agentTimeoutMs?: number,
363+
cwd?: string,
364+
env?: Record<string, string>,
365+
): Promise<{ stdout: string; stderr: string; exitCode: number }> {
366+
return typeof scriptPath === 'string'
367+
? execShellWithStdin(scriptPath, input, { cwd, timeoutMs: agentTimeoutMs, env })
368+
: execFileWithStdin(scriptPath, input, { cwd, timeoutMs: agentTimeoutMs, env });
369+
}
370+
328371
export async function executeScript(
329372
scriptPath: readonly string[] | string,
330373
input: string,
331374
agentTimeoutMs?: number,
332375
cwd?: string,
333376
env?: Record<string, string>,
334377
): Promise<string> {
335-
const { stdout, stderr, exitCode } =
336-
typeof scriptPath === 'string'
337-
? await execShellWithStdin(scriptPath, input, { cwd, timeoutMs: agentTimeoutMs, env })
338-
: await execFileWithStdin(scriptPath, input, { cwd, timeoutMs: agentTimeoutMs, env });
378+
const { stdout, stderr, exitCode } = await runScriptRaw(
379+
scriptPath,
380+
input,
381+
agentTimeoutMs,
382+
cwd,
383+
env,
384+
);
339385

340386
if (exitCode !== 0) {
341387
const trimmedErr = formatStderr(stderr);

packages/core/src/evaluation/orchestrator.ts

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -958,6 +958,20 @@ export async function runEvaluation(
958958
setupLog('Docker image pull complete');
959959
}
960960

961+
// Run preflight environment checks (fail fast before any hooks or test cases)
962+
if (suiteWorkspace?.env) {
963+
try {
964+
await runPreflightChecks(suiteWorkspace.env, sharedWorkspacePath ?? undefined, setupLog);
965+
setupLog('preflight checks passed');
966+
} catch (error) {
967+
const message = error instanceof Error ? error.message : String(error);
968+
if (sharedWorkspacePath && !useStaticWorkspace) {
969+
await cleanupWorkspace(sharedWorkspacePath).catch(() => {});
970+
}
971+
throw new Error(message);
972+
}
973+
}
974+
961975
// Execute before_all (runs ONCE before first test per workspace)
962976
const suiteHooksEnabled = hooksEnabled(suiteWorkspace);
963977
const suiteBeforeAllHook = suiteWorkspace?.hooks?.before_all;
@@ -3924,3 +3938,45 @@ function computeWeightedMean(
39243938

39253939
return totalWeight > 0 ? weightedSum / totalWeight : 0;
39263940
}
3941+
3942+
/**
3943+
* Run preflight environment checks for workspace.env config.
3944+
* Fails fast if any required command or Python module is missing.
3945+
* Called once before before_all hooks, so long evals abort immediately on missing deps.
3946+
*/
3947+
async function runPreflightChecks(
3948+
env: import('./types.js').WorkspaceEnvConfig,
3949+
cwd: string | undefined,
3950+
log: (msg: string) => void,
3951+
): Promise<void> {
3952+
const execFileAsync = promisify(execFile);
3953+
const missing: string[] = [];
3954+
3955+
for (const cmd of env.required_commands ?? []) {
3956+
log(`preflight: checking command "${cmd}"`);
3957+
try {
3958+
if (process.platform === 'win32') {
3959+
await execFileAsync('where', [cmd], { cwd });
3960+
} else {
3961+
await execFileAsync('sh', ['-c', `command -v ${cmd}`], { cwd });
3962+
}
3963+
} catch {
3964+
missing.push(`command: ${cmd}`);
3965+
}
3966+
}
3967+
3968+
for (const mod of env.required_python_modules ?? []) {
3969+
log(`preflight: checking Python module "${mod}"`);
3970+
try {
3971+
await execFileAsync('python3', ['-c', `import ${mod}`], { cwd });
3972+
} catch {
3973+
missing.push(`python module: ${mod}`);
3974+
}
3975+
}
3976+
3977+
if (missing.length > 0) {
3978+
throw new Error(
3979+
`Preflight checks failed — missing dependencies:\n${missing.map((m) => ` • ${m}`).join('\n')}\n\nInstall the missing dependencies before running this eval.`,
3980+
);
3981+
}
3982+
}

packages/core/src/evaluation/types.ts

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -339,6 +339,25 @@ export type DockerWorkspaceConfig = {
339339
readonly cpus?: number;
340340
};
341341

342+
/**
343+
* Preflight environment requirements for the workspace.
344+
* Checked once before before_all hooks run. Fails fast if anything is missing.
345+
*
346+
* @example
347+
* ```yaml
348+
* workspace:
349+
* env:
350+
* required_commands: [ffmpeg, pandoc]
351+
* required_python_modules: [PIL, openai]
352+
* ```
353+
*/
354+
export type WorkspaceEnvConfig = {
355+
/** Shell commands that must be present in PATH (checked via `command -v`) */
356+
readonly required_commands?: readonly string[];
357+
/** Python modules that must be importable (checked via `python3 -c "import <module>"`) */
358+
readonly required_python_modules?: readonly string[];
359+
};
360+
342361
export type WorkspaceConfig = {
343362
/** Template directory or .code-workspace file. Directories are copied to temp workspace.
344363
* .code-workspace files are used by VS Code providers; CLI providers use the parent directory. */
@@ -359,6 +378,8 @@ export type WorkspaceConfig = {
359378
* Used as default cwd for hook commands so that file-referenced templates resolve
360379
* relative paths from their own directory, not the eval file's directory. */
361380
readonly workspaceFileDir?: string;
381+
/** Preflight environment requirements. Checked before before_all hooks run. */
382+
readonly env?: WorkspaceEnvConfig;
362383
};
363384

364385
export type CodeGraderConfig = {

packages/core/src/evaluation/yaml-parser.ts

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,7 @@ import type {
5454
TrialsConfig,
5555
TurnFailurePolicy,
5656
WorkspaceConfig,
57+
WorkspaceEnvConfig,
5758
WorkspaceHookConfig,
5859
WorkspaceHooksConfig,
5960
WorkspaceScriptConfig,
@@ -853,8 +854,9 @@ function parseWorkspaceConfig(raw: unknown, evalFileDir: string): WorkspaceConfi
853854
const mode = explicitMode ?? (workspacePath ? 'static' : undefined);
854855

855856
const docker = parseDockerWorkspaceConfig(obj.docker);
857+
const env = parseWorkspaceEnvConfig(obj.env);
856858

857-
if (!template && !isolation && !repos && !hooks && !mode && !workspacePath && !docker)
859+
if (!template && !isolation && !repos && !hooks && !mode && !workspacePath && !docker && !env)
858860
return undefined;
859861

860862
return {
@@ -865,6 +867,26 @@ function parseWorkspaceConfig(raw: unknown, evalFileDir: string): WorkspaceConfi
865867
...(mode !== undefined && { mode }),
866868
...(workspacePath !== undefined && { path: workspacePath }),
867869
...(docker !== undefined && { docker }),
870+
...(env !== undefined && { env }),
871+
};
872+
}
873+
874+
function parseWorkspaceEnvConfig(raw: unknown): WorkspaceEnvConfig | undefined {
875+
if (!isJsonObject(raw)) return undefined;
876+
const obj = raw as Record<string, unknown>;
877+
878+
const required_commands = Array.isArray(obj.required_commands)
879+
? (obj.required_commands.filter((c) => typeof c === 'string') as string[])
880+
: undefined;
881+
const required_python_modules = Array.isArray(obj.required_python_modules)
882+
? (obj.required_python_modules.filter((m) => typeof m === 'string') as string[])
883+
: undefined;
884+
885+
if (!required_commands?.length && !required_python_modules?.length) return undefined;
886+
887+
return {
888+
...(required_commands?.length && { required_commands }),
889+
...(required_python_modules?.length && { required_python_modules }),
868890
};
869891
}
870892

0 commit comments

Comments
 (0)