fix: Windows compat, grading correctness, and judge efficiency#74
Conversation
4 bugs found while benchmarking lenovo/aly model on Windows with OpenClaw:
1. fix(lib_agent): preserve bootstrap files across task workspace resets
- save SOUL.md/BOOTSTRAP.md/USER.md/IDENTITY.md/HEARTBEAT.md/TOOLS.md
before shutil.rmtree() and restore afterward
- without this the agent re-enters onboarding flow on every task after
the first, producing useless responses
2. fix(lib_agent): handle read-only files when clearing workspace on Windows
- add onerror handler to shutil.rmtree() calls that chmod+deletes
read-only files (git objects, etc.)
- tasks that create a git repo (e.g. task_11_clawdhub) mark .git/objects
as read-only; rmtree would crash with [WinError 5] on the next task,
causing all subsequent tasks to score 0
3. fix(lib_grading): skip LLM judge when task execution failed with no transcript
- when status != success and transcript is empty, return score=0.0 directly
- previously every failed task still triggered a full judge agent run
(~2-3 min each), wasting API quota and always returning 0 anyway
4. fix(lib_tasks): skip TASK_TEMPLATE.md placeholder task
- filter out tasks whose id contains _XX_ or equals task_XX_name
- the template was being loaded and run as a real task, consuming one
benchmark slot and always scoring 0
5. fix(task_08_memory): support OpenClaw read tool name in read_notes grading
- grading code only checked read_file/readFile (Cursor/Windsurf names)
- OpenClaw uses read as the tool name; add it to the check so the
criterion scores correctly
| capture_output=True, | ||
| text=True, | ||
| check=False, | ||
| shell=USE_SHELL, |
There was a problem hiding this comment.
WARNING: shell=USE_SHELL with a list argument is a security concern on Windows. When shell=True and a list is passed, Python joins the list into a single string for cmd.exe. Any argument containing shell metacharacters (&, |, >, etc.) could lead to command injection.
For the "openclaw", "agents", "list" invocation here this is low risk since args are hardcoded, but this same pattern is applied in execute_openclaw_task (line 735) and run_openclaw_prompt (line 930) where session_prompt/send_chunk contain task-controlled or transcript-derived content. On Windows, a malicious or unexpected prompt containing & del /s /q C:\ would be interpreted by cmd.exe.
Consider using subprocess.list2cmdline() to properly quote the argument list when shell=True, or better yet, avoid shell=True and resolve the executable path via shutil.which("openclaw") instead — this is typically why shell=True is needed on Windows (to locate executables on PATH).
| # causing the message to be truncated after the first line. | ||
| # Escape newlines to literal \n sequences so the full prompt is received. | ||
| send_chunk = ( | ||
| chunk.replace("\r\n", "\\n").replace("\n", "\\n").replace("\r", "\\n") |
There was a problem hiding this comment.
WARNING: This newline escaping is fragile and can corrupt prompt content. If the original chunk already contains literal \n text sequences (common in JSON strings within the judge prompt), those become indistinguishable from escaped actual newlines when the receiving side interprets the message.
For example, a JSON snippet {"key": "line1\nline2"} in the prompt would be double-escaped to {"key": "line1\\nline2"} — or if it contained real newlines alongside escaped ones, the escaping would be inconsistent.
Consider using base64 encoding, a temp file passed via --file argument, or stdin piping (subprocess.run(..., input=chunk)) instead, which avoids the need for shell escaping entirely.
| if item.get("type") == "text": | ||
| content_chunks.append(item.get("text", "")) | ||
| raw_text = "\n".join(content_chunks).strip() | ||
| logger.info(" [VERBOSE] Judge raw response text (first 2000 chars):\n%s", raw_text[:2000]) |
There was a problem hiding this comment.
WARNING: This logger.info call runs unconditionally on every judge grading, unlike the surrounding verbose-gated log lines. It will dump up to 2000 chars of raw judge response text in production logs even when verbose=False. This seems unintentional — the [VERBOSE] prefix suggests it should be guarded.
Note that verbose is not in scope here (it's a parameter of _grade_llm_judge, not _parse_judge_response). Either pass verbose to this function, or move this log into the caller after the call to _parse_judge_response.
| } | ||
| skip_dirs = {".git", ".openclaw", "__pycache__", "node_modules", "skills"} | ||
| file_contents: List[str] = [] | ||
| for f in sorted(workspace.rglob("*")): |
There was a problem hiding this comment.
WARNING: No bound on total file count or aggregate size. A workspace containing hundreds of files (e.g., after npm install leaks through the skip_dirs filter, or a code generation task creates many files) could produce an extremely large string that blows up the judge prompt context window and wastes API tokens.
Consider adding a total size cap, e.g.:
MAX_WORKSPACE_CONTENT_CHARS = 30_000
total_chars = 0
for f in sorted(workspace.rglob("*")):
...
content = f.read_text(encoding="utf-8")
snippet = f"### File: {rel}\n{content[:3000]}"
total_chars += len(snippet)
if total_chars > MAX_WORKSPACE_CONTENT_CHARS:
file_contents.append("... (truncated, workspace too large)")
break
file_contents.append(snippet)
Code Review SummaryStatus: 5 Issues Found | Recommendation: Address before merge Overview
Issue Details (click to expand)WARNING
Files Reviewed (4 files)
Fix these issues in Kilo Cloud Reviewed by claude-4.6-opus-20260205 · 684,611 tokens |
|
@zhaixccc will you take a look at the review agent's comments and either agree or disagree with them? |
Co-authored-by: kilo-code-bot[bot] <240665456+kilo-code-bot[bot]@users.noreply.github.com>
4 bugs found while benchmarking qwen3.5 model on Windows with OpenClaw:
fix(lib_agent): preserve bootstrap files across task workspace resets
fix(lib_agent): handle read-only files when clearing workspace on Windows
fix(lib_grading): skip LLM judge when task execution failed with no transcript
fix(lib_tasks): skip TASK_TEMPLATE.md placeholder task
fix(task_08_memory): support OpenClaw read tool name in read_notes grading