Write incremental results after each task completion by juppytt · Pull Request #93 · pinchbench/skill

juppytt · 2026-04-02T07:24:27Z

Summary

Write partial result JSON after each task finishes grading
External tools can poll the result file to show live progress
Partial results include in_progress: true, completed_tasks, and total_tasks fields
Final write at the end overwrites without these fields

This enables dashboards and monitoring tools to display per-task progress while a benchmark run is still in progress.

Session transcripts were deleted between tasks by cleanup_agent_sessions, making post-run debugging impossible. Now transcripts are copied to results/{run_id}_transcripts/{task_id}.jsonl before cleanup. Also fixes pre-existing duplicate _remove_readonly function definition that caused a SyntaxError on import. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When --judge is specified with a model ID, the judge calls the model API directly instead of running an OpenClaw agent session. This avoids OpenClaw personality files (SOUL.md, IDENTITY.md) overriding the judge's JSON-only grading instructions, which caused all llm_judge tasks to score 0. Supported model prefixes: - openrouter/* -> OpenRouter API (OPENROUTER_API_KEY) - anthropic/* -> Anthropic Messages API (ANTHROPIC_API_KEY) - openai/* -> OpenAI chat completions (OPENAI_API_KEY) - claude -> headless Claude CLI (claude -p) Without --judge, behavior is unchanged (OpenClaw agent session). Also fixes pre-existing duplicate _remove_readonly function definition in lib_agent.py that caused an IndentationError. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The function was defined twice on consecutive lines with the second definition shadowing the first. Also removed an extra bare func(path) call outside the try/except block.

Update the result JSON after every task finishes grading so external tools can poll progress while the benchmark is still running. The partial result includes in_progress=true, completed_tasks, and total_tasks fields. The final write at the end overwrites without these fields.

ScuttleBot

ScuttleBot review 🦀

Incremental results are a big UX win. Watching a benchmark run with no feedback for 30+ minutes is painful.

What's good:

in_progress, completed_tasks, total_tasks fields make progress polling trivial
Transcript archival to {run_id}_transcripts/ — great for post-run debugging
Final write strips the progress fields cleanly
README update documents the new behavior

Concerns:

Overlap with #87 — Both PRs modify lib_agent.py and lib_grading.py significantly. The diffs will conflict. Recommend merging one, rebasing the other.
_write_incremental_results() is called in a loop — if the result JSON gets large (many tasks, big grades), this could add I/O overhead. Probably fine in practice but worth noting.
The output_dir param added to run_task() — is this threaded through correctly everywhere?

Nice feature, needs coordination with #87.

juppytt and others added 6 commits April 1, 2026 20:24

Merge branch 'feat/transcript-archive'

0b34ba5

Fix duplicate _remove_readonly definition causing IndentationError

fde4b00

The function was defined twice on consecutive lines with the second definition shadowing the first. Also removed an extra bare func(path) call outside the try/except block.

Merge branch 'fix/duplicate-remove-readonly'

91efe4c

ScuttleBot reviewed Apr 6, 2026

View reviewed changes

This was referenced Apr 6, 2026

Clean up some recent changes #83

Closed

Add direct API judge backend via --judge flag #87

Merged

olearycrew merged commit 76f0bb9 into pinchbench:main Apr 6, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write incremental results after each task completion#93

Write incremental results after each task completion#93
olearycrew merged 6 commits intopinchbench:mainfrom
juppytt:feat/incremental-results

juppytt commented Apr 2, 2026

Uh oh!

ScuttleBot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

juppytt commented Apr 2, 2026

Summary

Uh oh!

ScuttleBot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants