Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 58 additions & 0 deletions .claude/skills/nasde-benchmark-creator/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,52 @@ description: |

Create and configure coding agent benchmarks for evaluation with `nasde`. A benchmark is a set of coding tasks that AI agents solve inside isolated Docker containers, scored both by functional tests (pass/fail) and by an LLM-as-a-Judge architecture assessment.

## Critical: line endings on Windows (read this first)

Benchmark scripts execute inside **Linux** sandboxes (Docker, Daytona). If `tests/test.sh`, `solution/solve.sh`, or `environment/Dockerfile` are checked out with **CRLF** line endings (the Windows git default when `core.autocrlf=true` and there is no `.gitattributes`), every trial fails immediately with:

```
bash: line 1: /tests/test.sh: cannot execute: required file not found
```

…because the kernel reads the shebang as `#!/bin/bash\r` and tries to execute a non-existent `/bin/bash\r`. The agent finishes its work, but the verifier never runs and Harbor reports `RewardFileNotFoundError`.

**Mitigation (always do this for a new benchmark — `nasde init` does it for you, but verify):**

1. The benchmark repo MUST have a `.gitattributes` file enforcing LF for shell scripts and Dockerfiles. The minimum content:
```gitattributes
* text=auto eol=lf
*.sh text eol=lf
*.bash text eol=lf
Dockerfile text eol=lf
*.dockerfile text eol=lf
docker-compose.yaml text eol=lf
docker-compose.yml text eol=lf

*.ps1 text eol=crlf
*.bat text eol=crlf
*.cmd text eol=crlf
```
`nasde init` writes this automatically. If you are adding a benchmark to an existing repo without `.gitattributes`, create one before adding any task.

2. When **writing** `.sh` or `Dockerfile` content programmatically on Windows, write with explicit LF — not `path.write_text(content)` (which translates `\n`→`\r\n` on Windows), but `path.write_text(content, encoding="utf-8", newline="")` or open the file in binary mode.

3. After committing on Windows for the first time, run:
```bash
git add --renormalize .
git commit -m "normalize line endings"
```
to fix any files that landed before `.gitattributes` was in place.

4. Sanity check before pushing a new task:
```bash
file tasks/<task>/tests/test.sh
# MUST say "with LF line terminators" or omit line-terminator info entirely.
# If it says "with CRLF line terminators" — fix it (`sed -i 's/\r$//' file`).
```

This applies equally when you're **adding tasks to a benchmark someone else created** — if their repo has no `.gitattributes` and you're on Windows, your contribution will silently break for them on Linux CI and vice versa.

## Step 1: Understand what to evaluate

Before creating files, clarify with the user:
Expand Down Expand Up @@ -116,6 +162,8 @@ What the agent must NOT do (e.g., don't modify existing tests).

### environment/Dockerfile (required)

> **Reminder for Windows authors:** the Dockerfile and any helper scripts it `COPY`s in must have LF line endings — Docker tolerates CRLF in some commands but not in `RUN` shell snippets, and any shell script copied with CRLF will hit the same shebang failure as `test.sh`.

```dockerfile
FROM <base-image>

Expand All @@ -137,6 +185,8 @@ The Dockerfile MUST be self-contained — the agent starts working immediately.

### tests/test.sh (required — Harbor verifier)

> **Reminder for Windows authors:** this file MUST be saved with LF line endings. See "Critical: line endings on Windows" at the top of this skill. CRLF here = `bash: required file not found` and a wasted trial.

```bash
#!/bin/bash
cd /app
Expand Down Expand Up @@ -319,3 +369,11 @@ Before running with a real agent:
```bash
nasde run --variant vanilla --tasks <task-name> --without-eval -C .
```

4. **Final pre-flight on Windows authors** — verify no CRLF leaked in:
```bash
find tasks -name '*.sh' -exec sh -c 'file "$1" | grep -q CRLF && echo "BAD: $1"' _ {} \;
find tasks -name 'Dockerfile' -exec sh -c 'file "$1" | grep -q CRLF && echo "BAD: $1"' _ {} \;
# Both should print nothing.
```
If anything prints, fix with `sed -i 's/\r$//' <file>` and re-commit.
9 changes: 9 additions & 0 deletions .claude/skills/nasde-benchmark-from-history/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,15 @@ Generate NASDE benchmark tasks by mining git history. You analyze commits, diffs
- An existing NASDE benchmark project (run `nasde init` first, or use the `nasde-benchmark-creator` skill)
- If the benchmark project doesn't exist yet, create it first — this skill generates tasks, not the project scaffold

## Critical: line endings on Windows (read this first)

When generating `tests/test.sh`, `solution/solve.sh`, or `environment/Dockerfile` on a Windows host, write them with **LF** line endings or every trial fails with `bash: required file not found` (the kernel reads `#!/bin/bash\r` as the shebang). See the full explanation and `.gitattributes` template in the `nasde-benchmark-creator` skill.

Quick rules:
- The benchmark project MUST have a `.gitattributes` enforcing `*.sh text eol=lf` and `Dockerfile text eol=lf`. `nasde init` creates this. If the existing project lacks it, **create `.gitattributes` before generating any task files**.
- When writing files programmatically, use `path.write_text(content, encoding="utf-8", newline="")` — never the bare default which translates `\n`→`\r\n` on Windows.
- Sanity-check after generation: `find tasks/<new-task> -name '*.sh' -o -name 'Dockerfile' | xargs file | grep CRLF` should print nothing.

## Step 1: Identify the source repository and commit range

Ask the user:
Expand Down
9 changes: 9 additions & 0 deletions .claude/skills/nasde-benchmark-from-public-repos/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,15 @@ Build a diverse NASDE benchmark by curating tasks from multiple public GitHub re
- A clear description of the skill being evaluated (what it does, what kinds of tasks it helps with)
- Internet access (to browse and clone public repositories)

## Critical: line endings on Windows (read this first)

When generating `tests/test.sh`, `solution/solve.sh`, or `environment/Dockerfile` on a Windows host, write them with **LF** line endings or every trial fails with `bash: required file not found` (the kernel reads `#!/bin/bash\r` as the shebang). See the full explanation and `.gitattributes` template in the `nasde-benchmark-creator` skill.

Quick rules:
- The benchmark project MUST have a `.gitattributes` enforcing `*.sh text eol=lf` and `Dockerfile text eol=lf`. `nasde init` creates this. If the existing project lacks it, **create `.gitattributes` before generating any task files**.
- When writing files programmatically, use `path.write_text(content, encoding="utf-8", newline="")` — never the bare default which translates `\n`→`\r\n` on Windows.
- Sanity-check after generation: `find tasks/<new-task> -name '*.sh' -o -name 'Dockerfile' | xargs file | grep CRLF` should print nothing.

## Step 1: Understand the skill under test

Ask the user:
Expand Down
46 changes: 46 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Default: let Git detect text vs binary, force LF in working tree.
# Critical: shell scripts and Dockerfiles MUST be LF — they are executed by
# Linux interpreters in benchmark sandboxes (Daytona, Docker). CRLF causes
# `bash: required file not found` because the shebang becomes `#!/bin/bash\r`.
* text=auto eol=lf

# Source code that runs in Linux containers / cross-platform tooling: force LF.
*.sh text eol=lf
*.bash text eol=lf
*.py text eol=lf
Dockerfile text eol=lf
*.dockerfile text eol=lf
docker-compose.yaml text eol=lf
docker-compose.yml text eol=lf
*.toml text eol=lf
*.yaml text eol=lf
*.yml text eol=lf
*.json text eol=lf
*.md text eol=lf
Makefile text eol=lf
*.mk text eol=lf

# PowerShell expects CRLF on Windows. Keep as-is so PS5.1 parses cleanly.
*.ps1 text eol=crlf
*.psd1 text eol=crlf
*.psm1 text eol=crlf

# Windows batch files require CRLF.
*.bat text eol=crlf
*.cmd text eol=crlf

# Binary assets: never touch line endings.
*.png binary
*.jpg binary
*.jpeg binary
*.gif binary
*.ico binary
*.pdf binary
*.zip binary
*.gz binary
*.tgz binary
*.tar binary
*.whl binary
*.so binary
*.dll binary
*.exe binary
46 changes: 45 additions & 1 deletion src/nasde_toolkit/scaffold/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,49 @@
jobs/
"""

GITATTRIBUTES_TEMPLATE = """\
# Critical: files executed inside benchmark sandboxes (Linux containers via
# Docker / Daytona / Modal / etc.) MUST be LF. CRLF on a shebang line causes
# `bash: required file not found` because the kernel reads `#!/bin/bash\\r`.
* text=auto eol=lf

*.sh text eol=lf
*.bash text eol=lf
Dockerfile text eol=lf
*.dockerfile text eol=lf
docker-compose.yaml text eol=lf
docker-compose.yml text eol=lf
*.toml text eol=lf
*.yaml text eol=lf
*.yml text eol=lf
*.json text eol=lf
*.md text eol=lf
*.py text eol=lf

# PowerShell / Windows batch keep CRLF.
*.ps1 text eol=crlf
*.psd1 text eol=crlf
*.psm1 text eol=crlf
*.bat text eol=crlf
*.cmd text eol=crlf

# Binary assets — never touch line endings.
*.png binary
*.jpg binary
*.jpeg binary
*.gif binary
*.ico binary
*.pdf binary
*.zip binary
*.gz binary
*.tar binary
*.tgz binary
*.whl binary
*.so binary
*.dll binary
*.exe binary
"""


def create_project(project_dir: Path, name: str) -> None:
"""Scaffold a new evaluation project structure."""
Expand All @@ -111,6 +154,7 @@ def create_project(project_dir: Path, name: str) -> None:
_write_if_missing(project_dir / "nasde.toml", NASDE_TOML_TEMPLATE.format(name=name))
_write_if_missing(project_dir / "assessment_dimensions.json", ASSESSMENT_DIMENSIONS_TEMPLATE)
_write_if_missing(project_dir / ".gitignore", GITIGNORE_TEMPLATE)
_write_if_missing(project_dir / ".gitattributes", GITATTRIBUTES_TEMPLATE)

tasks_dir.mkdir(parents=True, exist_ok=True)
variants_dir.mkdir(parents=True, exist_ok=True)
Expand Down Expand Up @@ -158,4 +202,4 @@ def _write_if_missing(path: Path, content: str) -> None:
console.print(f" [yellow]Skipping[/yellow] {path.name} (already exists)")
return
path.parent.mkdir(parents=True, exist_ok=True)
path.write_text(content)
path.write_text(content, encoding="utf-8", newline="")
Loading