fix: Windows compat, grading correctness, and judge efficiency by zhaixccc · Pull Request #74 · pinchbench/skill

zhaixccc · 2026-03-24T10:43:35Z

4 bugs found while benchmarking qwen3.5 model on Windows with OpenClaw:

fix(lib_agent): preserve bootstrap files across task workspace resets
- save SOUL.md/BOOTSTRAP.md/USER.md/IDENTITY.md/HEARTBEAT.md/TOOLS.md before shutil.rmtree() and restore afterward
- without this the agent re-enters onboarding flow on every task after the first, producing useless responses
fix(lib_agent): handle read-only files when clearing workspace on Windows
- add onerror handler to shutil.rmtree() calls that chmod+deletes read-only files (git objects, etc.)
- tasks that create a git repo (e.g. task_11_clawdhub) mark .git/objects as read-only; rmtree would crash with [WinError 5] on the next task, causing all subsequent tasks to score 0
fix(lib_grading): skip LLM judge when task execution failed with no transcript
- when status != success and transcript is empty, return score=0.0 directly
- previously every failed task still triggered a full judge agent run (~2-3 min each), wasting API quota and always returning 0 anyway
fix(lib_tasks): skip TASK_TEMPLATE.md placeholder task
- filter out tasks whose id contains XX or equals task_XX_name
- the template was being loaded and run as a real task, consuming one benchmark slot and always scoring 0
fix(task_08_memory): support OpenClaw read tool name in read_notes grading
- grading code only checked read_file/readFile (Cursor/Windsurf names)
- OpenClaw uses read as the tool name; add it to the check so the criterion scores correctly

4 bugs found while benchmarking lenovo/aly model on Windows with OpenClaw: 1. fix(lib_agent): preserve bootstrap files across task workspace resets - save SOUL.md/BOOTSTRAP.md/USER.md/IDENTITY.md/HEARTBEAT.md/TOOLS.md before shutil.rmtree() and restore afterward - without this the agent re-enters onboarding flow on every task after the first, producing useless responses 2. fix(lib_agent): handle read-only files when clearing workspace on Windows - add onerror handler to shutil.rmtree() calls that chmod+deletes read-only files (git objects, etc.) - tasks that create a git repo (e.g. task_11_clawdhub) mark .git/objects as read-only; rmtree would crash with [WinError 5] on the next task, causing all subsequent tasks to score 0 3. fix(lib_grading): skip LLM judge when task execution failed with no transcript - when status != success and transcript is empty, return score=0.0 directly - previously every failed task still triggered a full judge agent run (~2-3 min each), wasting API quota and always returning 0 anyway 4. fix(lib_tasks): skip TASK_TEMPLATE.md placeholder task - filter out tasks whose id contains _XX_ or equals task_XX_name - the template was being loaded and run as a real task, consuming one benchmark slot and always scoring 0 5. fix(task_08_memory): support OpenClaw read tool name in read_notes grading - grading code only checked read_file/readFile (Cursor/Windsurf names) - OpenClaw uses read as the tool name; add it to the check so the criterion scores correctly

kilo-code-bot · 2026-03-24T10:46:14Z

scripts/lib_agent.py

            capture_output=True,
            text=True,
            check=False,
+            shell=USE_SHELL,


WARNING: shell=USE_SHELL with a list argument is a security concern on Windows. When shell=True and a list is passed, Python joins the list into a single string for cmd.exe. Any argument containing shell metacharacters (&, |, >, etc.) could lead to command injection.

For the "openclaw", "agents", "list" invocation here this is low risk since args are hardcoded, but this same pattern is applied in execute_openclaw_task (line 735) and run_openclaw_prompt (line 930) where session_prompt/send_chunk contain task-controlled or transcript-derived content. On Windows, a malicious or unexpected prompt containing & del /s /q C:\ would be interpreted by cmd.exe.

Consider using subprocess.list2cmdline() to properly quote the argument list when shell=True, or better yet, avoid shell=True and resolve the executable path via shutil.which("openclaw") instead — this is typically why shell=True is needed on Windows (to locate executables on PATH).

scripts/lib_agent.py

kilo-code-bot · 2026-03-24T10:46:14Z

scripts/lib_agent.py

+            # causing the message to be truncated after the first line.
+            # Escape newlines to literal \n sequences so the full prompt is received.
+            send_chunk = (
+                chunk.replace("\r\n", "\\n").replace("\n", "\\n").replace("\r", "\\n")


WARNING: This newline escaping is fragile and can corrupt prompt content. If the original chunk already contains literal \n text sequences (common in JSON strings within the judge prompt), those become indistinguishable from escaped actual newlines when the receiving side interprets the message.

For example, a JSON snippet {"key": "line1\nline2"} in the prompt would be double-escaped to {"key": "line1\\nline2"} — or if it contained real newlines alongside escaped ones, the escaping would be inconsistent.

Consider using base64 encoding, a temp file passed via --file argument, or stdin piping (subprocess.run(..., input=chunk)) instead, which avoids the need for shell escaping entirely.

kilo-code-bot · 2026-03-24T10:46:14Z

scripts/lib_grading.py

            if item.get("type") == "text":
                content_chunks.append(item.get("text", ""))
    raw_text = "\n".join(content_chunks).strip()
+    logger.info("   [VERBOSE] Judge raw response text (first 2000 chars):\n%s", raw_text[:2000])


WARNING: This logger.info call runs unconditionally on every judge grading, unlike the surrounding verbose-gated log lines. It will dump up to 2000 chars of raw judge response text in production logs even when verbose=False. This seems unintentional — the [VERBOSE] prefix suggests it should be guarded.

Note that verbose is not in scope here (it's a parameter of _grade_llm_judge, not _parse_judge_response). Either pass verbose to this function, or move this log into the caller after the call to _parse_judge_response.

kilo-code-bot · 2026-03-24T10:46:14Z

scripts/lib_grading.py

+    }
+    skip_dirs = {".git", ".openclaw", "__pycache__", "node_modules", "skills"}
+    file_contents: List[str] = []
+    for f in sorted(workspace.rglob("*")):


WARNING: No bound on total file count or aggregate size. A workspace containing hundreds of files (e.g., after npm install leaks through the skip_dirs filter, or a code generation task creates many files) could produce an extremely large string that blows up the judge prompt context window and wastes API tokens.

Consider adding a total size cap, e.g.:

MAX_WORKSPACE_CONTENT_CHARS = 30_000 total_chars = 0 for f in sorted(workspace.rglob("*")): ... content = f.read_text(encoding="utf-8") snippet = f"### File: {rel}\n{content[:3000]}" total_chars += len(snippet) if total_chars > MAX_WORKSPACE_CONTENT_CHARS: file_contents.append("... (truncated, workspace too large)") break file_contents.append(snippet)

kilo-code-bot · 2026-03-24T10:46:30Z

Code Review Summary

Status: 5 Issues Found | Recommendation: Address before merge

Overview

Severity	Count
CRITICAL	0
WARNING	5
SUGGESTION	0

Issue Details (click to expand)

WARNING

File	Line	Issue
`scripts/lib_agent.py`	171	`shell=USE_SHELL` with list args enables command injection on Windows — task prompts and transcript content passed as subprocess args can contain shell metacharacters
`scripts/lib_agent.py`	370	`_remove_readonly` onerror handler can itself raise, aborting `rmtree` midway and leaving workspace partially deleted
`scripts/lib_agent.py`	910	Newline escaping for Windows is fragile — can corrupt prompts containing literal `\n` text (e.g., JSON strings), and doesn't handle other shell metacharacters
`scripts/lib_grading.py`	392	`logger.info` with `[VERBOSE]` prefix runs unconditionally (not gated by `verbose` flag), dumping 2000 chars of judge response in production logs
`scripts/lib_grading.py`	323	`_read_workspace_files` has no total size bound — workspaces with many files can produce enormous judge prompts

Files Reviewed (4 files)

scripts/lib_agent.py - 3 issues
scripts/lib_grading.py - 2 issues
scripts/lib_tasks.py - 0 issues
tasks/task_08_memory.md - 0 issues

Fix these issues in Kilo Cloud

_{Reviewed by claude-4.6-opus-20260205 · 684,611 tokens}

olearycrew · 2026-03-24T17:32:32Z

@zhaixccc will you take a look at the review agent's comments and either agree or disagree with them?

Co-authored-by: kilo-code-bot[bot] <240665456+kilo-code-bot[bot]@users.noreply.github.com>

kilo-code-bot bot reviewed Mar 24, 2026

View reviewed changes

Apply suggestions from code review

7d5eadb

Co-authored-by: kilo-code-bot[bot] <240665456+kilo-code-bot[bot]@users.noreply.github.com>

olearycrew approved these changes Mar 31, 2026

View reviewed changes

olearycrew merged commit 431d8cd into pinchbench:main Mar 31, 2026
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Windows compat, grading correctness, and judge efficiency#74

fix: Windows compat, grading correctness, and judge efficiency#74
olearycrew merged 2 commits intopinchbench:mainfrom
zhaixccc:fix/windows-compat-and-grading-bugs

zhaixccc commented Mar 24, 2026 •

edited

Loading

Uh oh!

kilo-code-bot bot Mar 24, 2026

Uh oh!

Uh oh!

kilo-code-bot bot Mar 24, 2026

Uh oh!

kilo-code-bot bot Mar 24, 2026

Uh oh!

kilo-code-bot bot Mar 24, 2026

Uh oh!

kilo-code-bot bot commented Mar 24, 2026 •

edited

Loading

WARNING

Uh oh!

olearycrew commented Mar 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zhaixccc commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kilo-code-bot bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kilo-code-bot bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

kilo-code-bot bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

kilo-code-bot bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

kilo-code-bot bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Overview

WARNING

Uh oh!

olearycrew commented Mar 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhaixccc commented Mar 24, 2026 •

edited

Loading

kilo-code-bot bot commented Mar 24, 2026 •

edited

Loading