Write incremental results after each task completion#93
Merged
olearycrew merged 6 commits intopinchbench:mainfrom Apr 6, 2026
Merged
Write incremental results after each task completion#93olearycrew merged 6 commits intopinchbench:mainfrom
olearycrew merged 6 commits intopinchbench:mainfrom
Conversation
Session transcripts were deleted between tasks by cleanup_agent_sessions,
making post-run debugging impossible. Now transcripts are copied to
results/{run_id}_transcripts/{task_id}.jsonl before cleanup.
Also fixes pre-existing duplicate _remove_readonly function definition
that caused a SyntaxError on import.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When --judge is specified with a model ID, the judge calls the model API directly instead of running an OpenClaw agent session. This avoids OpenClaw personality files (SOUL.md, IDENTITY.md) overriding the judge's JSON-only grading instructions, which caused all llm_judge tasks to score 0. Supported model prefixes: - openrouter/* -> OpenRouter API (OPENROUTER_API_KEY) - anthropic/* -> Anthropic Messages API (ANTHROPIC_API_KEY) - openai/* -> OpenAI chat completions (OPENAI_API_KEY) - claude -> headless Claude CLI (claude -p) Without --judge, behavior is unchanged (OpenClaw agent session). Also fixes pre-existing duplicate _remove_readonly function definition in lib_agent.py that caused an IndentationError. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The function was defined twice on consecutive lines with the second definition shadowing the first. Also removed an extra bare func(path) call outside the try/except block.
Update the result JSON after every task finishes grading so external tools can poll progress while the benchmark is still running. The partial result includes in_progress=true, completed_tasks, and total_tasks fields. The final write at the end overwrites without these fields.
ScuttleBot
reviewed
Apr 6, 2026
ScuttleBot
left a comment
There was a problem hiding this comment.
ScuttleBot review 🦀
Incremental results are a big UX win. Watching a benchmark run with no feedback for 30+ minutes is painful.
What's good:
in_progress,completed_tasks,total_tasksfields make progress polling trivial- Transcript archival to
{run_id}_transcripts/— great for post-run debugging - Final write strips the progress fields cleanly
- README update documents the new behavior
Concerns:
- Overlap with #87 — Both PRs modify
lib_agent.pyandlib_grading.pysignificantly. The diffs will conflict. Recommend merging one, rebasing the other. _write_incremental_results()is called in a loop — if the result JSON gets large (many tasks, big grades), this could add I/O overhead. Probably fine in practice but worth noting.- The
output_dirparam added torun_task()— is this threaded through correctly everywhere?
Nice feature, needs coordination with #87.
This was referenced Apr 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
in_progress: true,completed_tasks, andtotal_tasksfieldsThis enables dashboards and monitoring tools to display per-task progress while a benchmark run is still in progress.