Skip to content

Write incremental results after each task completion#93

Merged
olearycrew merged 6 commits intopinchbench:mainfrom
juppytt:feat/incremental-results
Apr 6, 2026
Merged

Write incremental results after each task completion#93
olearycrew merged 6 commits intopinchbench:mainfrom
juppytt:feat/incremental-results

Conversation

@juppytt
Copy link
Copy Markdown
Contributor

@juppytt juppytt commented Apr 2, 2026

Summary

  • Write partial result JSON after each task finishes grading
  • External tools can poll the result file to show live progress
  • Partial results include in_progress: true, completed_tasks, and total_tasks fields
  • Final write at the end overwrites without these fields

This enables dashboards and monitoring tools to display per-task progress while a benchmark run is still in progress.

juppytt and others added 6 commits April 1, 2026 20:24
Session transcripts were deleted between tasks by cleanup_agent_sessions,
making post-run debugging impossible. Now transcripts are copied to
results/{run_id}_transcripts/{task_id}.jsonl before cleanup.

Also fixes pre-existing duplicate _remove_readonly function definition
that caused a SyntaxError on import.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When --judge is specified with a model ID, the judge calls the model
API directly instead of running an OpenClaw agent session. This avoids
OpenClaw personality files (SOUL.md, IDENTITY.md) overriding the
judge's JSON-only grading instructions, which caused all llm_judge
tasks to score 0.

Supported model prefixes:
  - openrouter/* -> OpenRouter API (OPENROUTER_API_KEY)
  - anthropic/*  -> Anthropic Messages API (ANTHROPIC_API_KEY)
  - openai/*     -> OpenAI chat completions (OPENAI_API_KEY)
  - claude       -> headless Claude CLI (claude -p)

Without --judge, behavior is unchanged (OpenClaw agent session).

Also fixes pre-existing duplicate _remove_readonly function definition
in lib_agent.py that caused an IndentationError.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The function was defined twice on consecutive lines with the second
definition shadowing the first. Also removed an extra bare func(path)
call outside the try/except block.
Update the result JSON after every task finishes grading so external
tools can poll progress while the benchmark is still running. The
partial result includes in_progress=true, completed_tasks, and
total_tasks fields. The final write at the end overwrites without
these fields.
Copy link
Copy Markdown

@ScuttleBot ScuttleBot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScuttleBot review 🦀

Incremental results are a big UX win. Watching a benchmark run with no feedback for 30+ minutes is painful.

What's good:

  • in_progress, completed_tasks, total_tasks fields make progress polling trivial
  • Transcript archival to {run_id}_transcripts/ — great for post-run debugging
  • Final write strips the progress fields cleanly
  • README update documents the new behavior

Concerns:

  • Overlap with #87 — Both PRs modify lib_agent.py and lib_grading.py significantly. The diffs will conflict. Recommend merging one, rebasing the other.
  • _write_incremental_results() is called in a loop — if the result JSON gets large (many tasks, big grades), this could add I/O overhead. Probably fine in practice but worth noting.
  • The output_dir param added to run_task() — is this threaded through correctly everywhere?

Nice feature, needs coordination with #87.

@olearycrew olearycrew merged commit 76f0bb9 into pinchbench:main Apr 6, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants