diff --git a/.claude/skills/nasde-benchmark-creator/SKILL.md b/.claude/skills/nasde-benchmark-creator/SKILL.md index 03d4b7e..55a3c23 100644 --- a/.claude/skills/nasde-benchmark-creator/SKILL.md +++ b/.claude/skills/nasde-benchmark-creator/SKILL.md @@ -14,6 +14,52 @@ description: | Create and configure coding agent benchmarks for evaluation with `nasde`. A benchmark is a set of coding tasks that AI agents solve inside isolated Docker containers, scored both by functional tests (pass/fail) and by an LLM-as-a-Judge architecture assessment. +## Critical: line endings on Windows (read this first) + +Benchmark scripts execute inside **Linux** sandboxes (Docker, Daytona). If `tests/test.sh`, `solution/solve.sh`, or `environment/Dockerfile` are checked out with **CRLF** line endings (the Windows git default when `core.autocrlf=true` and there is no `.gitattributes`), every trial fails immediately with: + +``` +bash: line 1: /tests/test.sh: cannot execute: required file not found +``` + +…because the kernel reads the shebang as `#!/bin/bash\r` and tries to execute a non-existent `/bin/bash\r`. The agent finishes its work, but the verifier never runs and Harbor reports `RewardFileNotFoundError`. + +**Mitigation (always do this for a new benchmark — `nasde init` does it for you, but verify):** + +1. The benchmark repo MUST have a `.gitattributes` file enforcing LF for shell scripts and Dockerfiles. The minimum content: + ```gitattributes + * text=auto eol=lf + *.sh text eol=lf + *.bash text eol=lf + Dockerfile text eol=lf + *.dockerfile text eol=lf + docker-compose.yaml text eol=lf + docker-compose.yml text eol=lf + + *.ps1 text eol=crlf + *.bat text eol=crlf + *.cmd text eol=crlf + ``` + `nasde init` writes this automatically. If you are adding a benchmark to an existing repo without `.gitattributes`, create one before adding any task. + +2. When **writing** `.sh` or `Dockerfile` content programmatically on Windows, write with explicit LF — not `path.write_text(content)` (which translates `\n`→`\r\n` on Windows), but `path.write_text(content, encoding="utf-8", newline="")` or open the file in binary mode. + +3. After committing on Windows for the first time, run: + ```bash + git add --renormalize . + git commit -m "normalize line endings" + ``` + to fix any files that landed before `.gitattributes` was in place. + +4. Sanity check before pushing a new task: + ```bash + file tasks//tests/test.sh + # MUST say "with LF line terminators" or omit line-terminator info entirely. + # If it says "with CRLF line terminators" — fix it (`sed -i 's/\r$//' file`). + ``` + +This applies equally when you're **adding tasks to a benchmark someone else created** — if their repo has no `.gitattributes` and you're on Windows, your contribution will silently break for them on Linux CI and vice versa. + ## Step 1: Understand what to evaluate Before creating files, clarify with the user: @@ -116,6 +162,8 @@ What the agent must NOT do (e.g., don't modify existing tests). ### environment/Dockerfile (required) +> **Reminder for Windows authors:** the Dockerfile and any helper scripts it `COPY`s in must have LF line endings — Docker tolerates CRLF in some commands but not in `RUN` shell snippets, and any shell script copied with CRLF will hit the same shebang failure as `test.sh`. + ```dockerfile FROM @@ -137,6 +185,8 @@ The Dockerfile MUST be self-contained — the agent starts working immediately. ### tests/test.sh (required — Harbor verifier) +> **Reminder for Windows authors:** this file MUST be saved with LF line endings. See "Critical: line endings on Windows" at the top of this skill. CRLF here = `bash: required file not found` and a wasted trial. + ```bash #!/bin/bash cd /app @@ -319,3 +369,11 @@ Before running with a real agent: ```bash nasde run --variant vanilla --tasks --without-eval -C . ``` + +4. **Final pre-flight on Windows authors** — verify no CRLF leaked in: + ```bash + find tasks -name '*.sh' -exec sh -c 'file "$1" | grep -q CRLF && echo "BAD: $1"' _ {} \; + find tasks -name 'Dockerfile' -exec sh -c 'file "$1" | grep -q CRLF && echo "BAD: $1"' _ {} \; + # Both should print nothing. + ``` + If anything prints, fix with `sed -i 's/\r$//' ` and re-commit. diff --git a/.claude/skills/nasde-benchmark-from-history/SKILL.md b/.claude/skills/nasde-benchmark-from-history/SKILL.md index 05292cb..698275c 100644 --- a/.claude/skills/nasde-benchmark-from-history/SKILL.md +++ b/.claude/skills/nasde-benchmark-from-history/SKILL.md @@ -19,6 +19,15 @@ Generate NASDE benchmark tasks by mining git history. You analyze commits, diffs - An existing NASDE benchmark project (run `nasde init` first, or use the `nasde-benchmark-creator` skill) - If the benchmark project doesn't exist yet, create it first — this skill generates tasks, not the project scaffold +## Critical: line endings on Windows (read this first) + +When generating `tests/test.sh`, `solution/solve.sh`, or `environment/Dockerfile` on a Windows host, write them with **LF** line endings or every trial fails with `bash: required file not found` (the kernel reads `#!/bin/bash\r` as the shebang). See the full explanation and `.gitattributes` template in the `nasde-benchmark-creator` skill. + +Quick rules: +- The benchmark project MUST have a `.gitattributes` enforcing `*.sh text eol=lf` and `Dockerfile text eol=lf`. `nasde init` creates this. If the existing project lacks it, **create `.gitattributes` before generating any task files**. +- When writing files programmatically, use `path.write_text(content, encoding="utf-8", newline="")` — never the bare default which translates `\n`→`\r\n` on Windows. +- Sanity-check after generation: `find tasks/ -name '*.sh' -o -name 'Dockerfile' | xargs file | grep CRLF` should print nothing. + ## Step 1: Identify the source repository and commit range Ask the user: diff --git a/.claude/skills/nasde-benchmark-from-public-repos/SKILL.md b/.claude/skills/nasde-benchmark-from-public-repos/SKILL.md index 48c5715..5fc106a 100644 --- a/.claude/skills/nasde-benchmark-from-public-repos/SKILL.md +++ b/.claude/skills/nasde-benchmark-from-public-repos/SKILL.md @@ -19,6 +19,15 @@ Build a diverse NASDE benchmark by curating tasks from multiple public GitHub re - A clear description of the skill being evaluated (what it does, what kinds of tasks it helps with) - Internet access (to browse and clone public repositories) +## Critical: line endings on Windows (read this first) + +When generating `tests/test.sh`, `solution/solve.sh`, or `environment/Dockerfile` on a Windows host, write them with **LF** line endings or every trial fails with `bash: required file not found` (the kernel reads `#!/bin/bash\r` as the shebang). See the full explanation and `.gitattributes` template in the `nasde-benchmark-creator` skill. + +Quick rules: +- The benchmark project MUST have a `.gitattributes` enforcing `*.sh text eol=lf` and `Dockerfile text eol=lf`. `nasde init` creates this. If the existing project lacks it, **create `.gitattributes` before generating any task files**. +- When writing files programmatically, use `path.write_text(content, encoding="utf-8", newline="")` — never the bare default which translates `\n`→`\r\n` on Windows. +- Sanity-check after generation: `find tasks/ -name '*.sh' -o -name 'Dockerfile' | xargs file | grep CRLF` should print nothing. + ## Step 1: Understand the skill under test Ask the user: diff --git a/.gitattributes b/.gitattributes new file mode 100644 index 0000000..873b804 --- /dev/null +++ b/.gitattributes @@ -0,0 +1,46 @@ +# Default: let Git detect text vs binary, force LF in working tree. +# Critical: shell scripts and Dockerfiles MUST be LF — they are executed by +# Linux interpreters in benchmark sandboxes (Daytona, Docker). CRLF causes +# `bash: required file not found` because the shebang becomes `#!/bin/bash\r`. +* text=auto eol=lf + +# Source code that runs in Linux containers / cross-platform tooling: force LF. +*.sh text eol=lf +*.bash text eol=lf +*.py text eol=lf +Dockerfile text eol=lf +*.dockerfile text eol=lf +docker-compose.yaml text eol=lf +docker-compose.yml text eol=lf +*.toml text eol=lf +*.yaml text eol=lf +*.yml text eol=lf +*.json text eol=lf +*.md text eol=lf +Makefile text eol=lf +*.mk text eol=lf + +# PowerShell expects CRLF on Windows. Keep as-is so PS5.1 parses cleanly. +*.ps1 text eol=crlf +*.psd1 text eol=crlf +*.psm1 text eol=crlf + +# Windows batch files require CRLF. +*.bat text eol=crlf +*.cmd text eol=crlf + +# Binary assets: never touch line endings. +*.png binary +*.jpg binary +*.jpeg binary +*.gif binary +*.ico binary +*.pdf binary +*.zip binary +*.gz binary +*.tgz binary +*.tar binary +*.whl binary +*.so binary +*.dll binary +*.exe binary diff --git a/src/nasde_toolkit/scaffold/__init__.py b/src/nasde_toolkit/scaffold/__init__.py index 647450d..b0deaa6 100644 --- a/src/nasde_toolkit/scaffold/__init__.py +++ b/src/nasde_toolkit/scaffold/__init__.py @@ -101,6 +101,49 @@ jobs/ """ +GITATTRIBUTES_TEMPLATE = """\ +# Critical: files executed inside benchmark sandboxes (Linux containers via +# Docker / Daytona / Modal / etc.) MUST be LF. CRLF on a shebang line causes +# `bash: required file not found` because the kernel reads `#!/bin/bash\\r`. +* text=auto eol=lf + +*.sh text eol=lf +*.bash text eol=lf +Dockerfile text eol=lf +*.dockerfile text eol=lf +docker-compose.yaml text eol=lf +docker-compose.yml text eol=lf +*.toml text eol=lf +*.yaml text eol=lf +*.yml text eol=lf +*.json text eol=lf +*.md text eol=lf +*.py text eol=lf + +# PowerShell / Windows batch keep CRLF. +*.ps1 text eol=crlf +*.psd1 text eol=crlf +*.psm1 text eol=crlf +*.bat text eol=crlf +*.cmd text eol=crlf + +# Binary assets — never touch line endings. +*.png binary +*.jpg binary +*.jpeg binary +*.gif binary +*.ico binary +*.pdf binary +*.zip binary +*.gz binary +*.tar binary +*.tgz binary +*.whl binary +*.so binary +*.dll binary +*.exe binary +""" + def create_project(project_dir: Path, name: str) -> None: """Scaffold a new evaluation project structure.""" @@ -111,6 +154,7 @@ def create_project(project_dir: Path, name: str) -> None: _write_if_missing(project_dir / "nasde.toml", NASDE_TOML_TEMPLATE.format(name=name)) _write_if_missing(project_dir / "assessment_dimensions.json", ASSESSMENT_DIMENSIONS_TEMPLATE) _write_if_missing(project_dir / ".gitignore", GITIGNORE_TEMPLATE) + _write_if_missing(project_dir / ".gitattributes", GITATTRIBUTES_TEMPLATE) tasks_dir.mkdir(parents=True, exist_ok=True) variants_dir.mkdir(parents=True, exist_ok=True) @@ -158,4 +202,4 @@ def _write_if_missing(path: Path, content: str) -> None: console.print(f" [yellow]Skipping[/yellow] {path.name} (already exists)") return path.parent.mkdir(parents=True, exist_ok=True) - path.write_text(content) + path.write_text(content, encoding="utf-8", newline="")