Skip to content

fix: Windows compat, grading correctness, and judge efficiency#74

Merged
olearycrew merged 2 commits intopinchbench:mainfrom
zhaixccc:fix/windows-compat-and-grading-bugs
Mar 31, 2026
Merged

fix: Windows compat, grading correctness, and judge efficiency#74
olearycrew merged 2 commits intopinchbench:mainfrom
zhaixccc:fix/windows-compat-and-grading-bugs

Conversation

@zhaixccc
Copy link
Copy Markdown

@zhaixccc zhaixccc commented Mar 24, 2026

4 bugs found while benchmarking qwen3.5 model on Windows with OpenClaw:

  1. fix(lib_agent): preserve bootstrap files across task workspace resets

    • save SOUL.md/BOOTSTRAP.md/USER.md/IDENTITY.md/HEARTBEAT.md/TOOLS.md before shutil.rmtree() and restore afterward
    • without this the agent re-enters onboarding flow on every task after the first, producing useless responses
  2. fix(lib_agent): handle read-only files when clearing workspace on Windows

    • add onerror handler to shutil.rmtree() calls that chmod+deletes read-only files (git objects, etc.)
    • tasks that create a git repo (e.g. task_11_clawdhub) mark .git/objects as read-only; rmtree would crash with [WinError 5] on the next task, causing all subsequent tasks to score 0
  3. fix(lib_grading): skip LLM judge when task execution failed with no transcript

    • when status != success and transcript is empty, return score=0.0 directly
    • previously every failed task still triggered a full judge agent run (~2-3 min each), wasting API quota and always returning 0 anyway
  4. fix(lib_tasks): skip TASK_TEMPLATE.md placeholder task

    • filter out tasks whose id contains XX or equals task_XX_name
    • the template was being loaded and run as a real task, consuming one benchmark slot and always scoring 0
  5. fix(task_08_memory): support OpenClaw read tool name in read_notes grading

    • grading code only checked read_file/readFile (Cursor/Windsurf names)
    • OpenClaw uses read as the tool name; add it to the check so the criterion scores correctly

4 bugs found while benchmarking lenovo/aly model on Windows with OpenClaw:

1. fix(lib_agent): preserve bootstrap files across task workspace resets
   - save SOUL.md/BOOTSTRAP.md/USER.md/IDENTITY.md/HEARTBEAT.md/TOOLS.md
     before shutil.rmtree() and restore afterward
   - without this the agent re-enters onboarding flow on every task after
     the first, producing useless responses

2. fix(lib_agent): handle read-only files when clearing workspace on Windows
   - add onerror handler to shutil.rmtree() calls that chmod+deletes
     read-only files (git objects, etc.)
   - tasks that create a git repo (e.g. task_11_clawdhub) mark .git/objects
     as read-only; rmtree would crash with [WinError 5] on the next task,
     causing all subsequent tasks to score 0

3. fix(lib_grading): skip LLM judge when task execution failed with no transcript
   - when status != success and transcript is empty, return score=0.0 directly
   - previously every failed task still triggered a full judge agent run
     (~2-3 min each), wasting API quota and always returning 0 anyway

4. fix(lib_tasks): skip TASK_TEMPLATE.md placeholder task
   - filter out tasks whose id contains _XX_ or equals task_XX_name
   - the template was being loaded and run as a real task, consuming one
     benchmark slot and always scoring 0

5. fix(task_08_memory): support OpenClaw read tool name in read_notes grading
   - grading code only checked read_file/readFile (Cursor/Windsurf names)
   - OpenClaw uses read as the tool name; add it to the check so the
     criterion scores correctly
capture_output=True,
text=True,
check=False,
shell=USE_SHELL,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: shell=USE_SHELL with a list argument is a security concern on Windows. When shell=True and a list is passed, Python joins the list into a single string for cmd.exe. Any argument containing shell metacharacters (&, |, >, etc.) could lead to command injection.

For the "openclaw", "agents", "list" invocation here this is low risk since args are hardcoded, but this same pattern is applied in execute_openclaw_task (line 735) and run_openclaw_prompt (line 930) where session_prompt/send_chunk contain task-controlled or transcript-derived content. On Windows, a malicious or unexpected prompt containing & del /s /q C:\ would be interpreted by cmd.exe.

Consider using subprocess.list2cmdline() to properly quote the argument list when shell=True, or better yet, avoid shell=True and resolve the executable path via shutil.which("openclaw") instead — this is typically why shell=True is needed on Windows (to locate executables on PATH).

# causing the message to be truncated after the first line.
# Escape newlines to literal \n sequences so the full prompt is received.
send_chunk = (
chunk.replace("\r\n", "\\n").replace("\n", "\\n").replace("\r", "\\n")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: This newline escaping is fragile and can corrupt prompt content. If the original chunk already contains literal \n text sequences (common in JSON strings within the judge prompt), those become indistinguishable from escaped actual newlines when the receiving side interprets the message.

For example, a JSON snippet {"key": "line1\nline2"} in the prompt would be double-escaped to {"key": "line1\\nline2"} — or if it contained real newlines alongside escaped ones, the escaping would be inconsistent.

Consider using base64 encoding, a temp file passed via --file argument, or stdin piping (subprocess.run(..., input=chunk)) instead, which avoids the need for shell escaping entirely.

if item.get("type") == "text":
content_chunks.append(item.get("text", ""))
raw_text = "\n".join(content_chunks).strip()
logger.info(" [VERBOSE] Judge raw response text (first 2000 chars):\n%s", raw_text[:2000])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: This logger.info call runs unconditionally on every judge grading, unlike the surrounding verbose-gated log lines. It will dump up to 2000 chars of raw judge response text in production logs even when verbose=False. This seems unintentional — the [VERBOSE] prefix suggests it should be guarded.

Note that verbose is not in scope here (it's a parameter of _grade_llm_judge, not _parse_judge_response). Either pass verbose to this function, or move this log into the caller after the call to _parse_judge_response.

}
skip_dirs = {".git", ".openclaw", "__pycache__", "node_modules", "skills"}
file_contents: List[str] = []
for f in sorted(workspace.rglob("*")):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: No bound on total file count or aggregate size. A workspace containing hundreds of files (e.g., after npm install leaks through the skip_dirs filter, or a code generation task creates many files) could produce an extremely large string that blows up the judge prompt context window and wastes API tokens.

Consider adding a total size cap, e.g.:

MAX_WORKSPACE_CONTENT_CHARS = 30_000
total_chars = 0
for f in sorted(workspace.rglob("*")):
    ...
    content = f.read_text(encoding="utf-8")
    snippet = f"### File: {rel}\n{content[:3000]}"
    total_chars += len(snippet)
    if total_chars > MAX_WORKSPACE_CONTENT_CHARS:
        file_contents.append("... (truncated, workspace too large)")
        break
    file_contents.append(snippet)

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot bot commented Mar 24, 2026

Code Review Summary

Status: 5 Issues Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 5
SUGGESTION 0
Issue Details (click to expand)

WARNING

File Line Issue
scripts/lib_agent.py 171 shell=USE_SHELL with list args enables command injection on Windows — task prompts and transcript content passed as subprocess args can contain shell metacharacters
scripts/lib_agent.py 370 _remove_readonly onerror handler can itself raise, aborting rmtree midway and leaving workspace partially deleted
scripts/lib_agent.py 910 Newline escaping for Windows is fragile — can corrupt prompts containing literal \n text (e.g., JSON strings), and doesn't handle other shell metacharacters
scripts/lib_grading.py 392 logger.info with [VERBOSE] prefix runs unconditionally (not gated by verbose flag), dumping 2000 chars of judge response in production logs
scripts/lib_grading.py 323 _read_workspace_files has no total size bound — workspaces with many files can produce enormous judge prompts
Files Reviewed (4 files)
  • scripts/lib_agent.py - 3 issues
  • scripts/lib_grading.py - 2 issues
  • scripts/lib_tasks.py - 0 issues
  • tasks/task_08_memory.md - 0 issues

Fix these issues in Kilo Cloud


Reviewed by claude-4.6-opus-20260205 · 684,611 tokens

@olearycrew
Copy link
Copy Markdown
Member

@zhaixccc will you take a look at the review agent's comments and either agree or disagree with them?

Co-authored-by: kilo-code-bot[bot] <240665456+kilo-code-bot[bot]@users.noreply.github.com>
@olearycrew olearycrew merged commit 431d8cd into pinchbench:main Mar 31, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants