diff --git a/README.md b/README.md index fd1c5b81..0c495d28 100644 --- a/README.md +++ b/README.md @@ -128,7 +128,7 @@ Options: --spec PATH Build a single spec file (skip discovery) --dry-run Discover specs and print plan, no execution --max-commits-per-pr N PR commit cap (default: 8, max: 100) - --max-loc-per-pr N PR line-of-code cap (default: 400, range: 50-5000) + --max-loc-per-pr N PR line-of-code cap (default: 250, range: 50-5000) --platform PLATFORM github, gitlab, or auto (default: auto) --parallel N Concurrent agentic loops, HF engine only (default: 1) --skip-auth-check Skip gh/glab auth validation (for CI with GITHUB_TOKEN) @@ -146,7 +146,7 @@ Environment variables: Codelicious produces small, focused, human-reviewable PRs by default: - **8 commits max per PR** (`--max-commits-per-pr`, range 1–100) -- **400 LOC max per PR** (`--max-loc-per-pr`, range 50–5000) +- **250 LOC max per PR** (`--max-loc-per-pr`, range 50–5000) When either cap is hit, the current PR is transitioned to review and a continuation branch (`-part-2`, `-part-3`, ...) is opened so work diff --git a/docs/specs/28_bite_sized_continuous_prs_v1.md b/docs/specs/28_bite_sized_continuous_prs_v1.md index e9fe27e6..a5ec9fda 100644 --- a/docs/specs/28_bite_sized_continuous_prs_v1.md +++ b/docs/specs/28_bite_sized_continuous_prs_v1.md @@ -42,7 +42,7 @@ many small PRs rather than a few large ones. the old default - [x] Update the banner print line that echoes the cap so it reflects the new default -- [x] Add a `--max-loc-per-pr` CLI flag, default `400`, range 50–5000, +- [x] Add a `--max-loc-per-pr` CLI flag, default `250`, range 50–5000, mapped via the same `_INT_KEYS` mechanism as `max_commits_per_pr` - [x] Pass `max_loc_per_pr` into `V2Orchestrator(...)` constructor (constructor accepts the kwarg now; cap-enforcement logic lands in Phase 2.2) @@ -50,9 +50,9 @@ many small PRs rather than a few large ones. **Claude Code prompt:** > Open `src/codelicious/cli.py`. In `_parse_args`, change the default > for `max_commits_per_pr` from 50 to 8. Add a new option -> `max_loc_per_pr` with default 400, validated to be between 50 and 5000, +> `max_loc_per_pr` with default 250, validated to be between 50 and 5000, > and add the flag `--max-loc-per-pr` to the arg map. Update `_INT_KEYS` -> to include it. In `main()`, pass `max_loc_per_pr=opts.get("max_loc_per_pr", 400)` +> to include it. In `main()`, pass `max_loc_per_pr=opts.get("max_loc_per_pr", 250)` > when constructing `V2Orchestrator`. Update the banner to print both caps. > Do not change behavior when the operator explicitly sets the cap. @@ -97,7 +97,7 @@ many small PRs rather than a few large ones. **File:** `src/codelicious/orchestrator.py` -- [x] Add `max_loc_per_pr: int = 400` parameter to `V2Orchestrator.__init__` +- [x] Add `max_loc_per_pr: int = 250` parameter to `V2Orchestrator.__init__` (done in Phase 1.1) - [x] In `run()`, after the existing commit-cap check (around line 1204), add a parallel check: if `max_loc_per_pr > 0` and @@ -111,7 +111,7 @@ many small PRs rather than a few large ones. **Claude Code prompt:** > In `src/codelicious/orchestrator.py`, extend `V2Orchestrator`. Add -> `max_loc_per_pr: int = 400` to `__init__`. Extract the existing +> `max_loc_per_pr: int = 250` to `__init__`. Extract the existing > commit-cap split logic in `run()` (around line 1206) into a helper > method `_split_pr_and_continue` that takes `spec_id_str`, `spec_title`, > current `pr_part`, and returns `(new_pr_part, new_pr_number)`. Then @@ -256,7 +256,7 @@ and `TestSkipCredentialProbeFlag` (2 cases). Full suite: 1901 passed. ## Acceptance Criteria - [ ] Running `codelicious .` on a multi-task spec produces PRs of ≤ 8 commits - and ≤ 400 LOC each, splitting into part-2 / part-3 branches as needed. + and ≤ 250 LOC each, splitting into part-2 / part-3 branches as needed. *Verified via unit tests; live end-to-end run requires a real repo + `gh` auth and is left as a manual smoke test.* - [x] Existing tests still pass: `pytest` → **1901 passed**. diff --git a/docs/specs/spec-v29_gap_closure_v1.md b/docs/specs/spec-v29_gap_closure_v1.md new file mode 100644 index 00000000..2d19467d --- /dev/null +++ b/docs/specs/spec-v29_gap_closure_v1.md @@ -0,0 +1,599 @@ +--- +version: 1.0.0 +status: Complete +created: 2026-05-03 +updated: 2026-05-04 +related_specs: ["27_codelicious_v2_rewrite.md", "28_bite_sized_continuous_prs_v1.md", "06_production_hardening.md", "07_sandbox_security_hardening.md"] +progress: + completed: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18] + pending: [] +--- + +# Spec v29 — Gap Closure (2026-05-03) + +## Background + +Codelicious has evolved across 28 specs and a v2 rewrite (spec 27) plus a +bite-sized-PR tightening pass (spec 28). The codebase is mostly aligned to +the v2 architecture, but a deep audit surfaced **18 concrete gaps** between +what is specified and what is implemented, plus several low-grade resilience +and consistency issues. + +Headline findings: + +* **Spec drift:** spec 28 documents a default LOC cap of 400, but a recent + commit (`258e5e90 lower default PR cap to 250 LOC`) silently changed the + runtime default to 250 without updating the spec. The spec and the code + now disagree. +* **Legacy 4-phase orchestrator (`Orchestrator` BUILD/MERGE/REVIEW/FIX) is + still present** alongside the v2 chunk-based `V2Orchestrator`, yet spec 27 + Phase 4.1 explicitly mandates removal. ~1,100 lines of dead code remain + reachable through the legacy `run_build_cycle()` engine path. +* **No jitter on exponential backoff.** `LLMClient` and `loop_controller` + use plain `2**attempt` backoff, which CLAUDE.md ("LLM calls use + exponential backoff with jitter for transient errors") explicitly forbids. +* **Two test files are missing for major modules** (`scaffolder.py`'s + ~558 LOC has shallow tests, `tools/audit_logger.py` has none of its own + test file), and `chunker.chunk_spec_with_llm` has no dedicated tests. +* The Claude engine's `execute_chunk` does **not pass `stream-json`, + `--output-format`, or the `--allowedTools` list specified in spec 27 + Phase 3.2** — the prompt is built but the CLI flags spec 27 mandates are + not visible at the call site (they may live in `agent_runner`, but the + spec wants them per-chunk). + +This spec closes those gaps. It is ordered by severity: critical security +and correctness items first, then resilience, then consistency / cleanup, +then test coverage. Each step is self-contained — pick one off the top, +execute it, ship a small PR, repeat. + +## Gap Inventory + +| # | Title | Severity | +|---|---|---| +| 1 | Spec 28 documents 400 LOC default but code uses 250 | High (consistency) — **DONE** | +| 2 | Legacy 4-phase `Orchestrator` not removed per spec 27 Phase 4.1 | High (drift) — **DONE** (orchestrator.py 1499→501 LOC; `V2Orchestrator` renamed to `Orchestrator`, alias retained; `tests/test_orchestrator.py` deleted) | +| 3 | LLM backoff has no jitter (CLAUDE.md violation) | High (resilience) — **DONE** | +| 4 | `LLMClient` rate-limit retry ignores `Retry-After` header | High (resilience) — **DONE** | +| 5 | `agent_runner` Claude rate-limit detection exits with retry_after_s=60 always — no provider-supplied delay | Medium (resilience) — **DONE** (`_parse_claude_reset_seconds`) | +| 6 | `verify_chunk` runs the entire repo's verifier — not chunk-scoped (spec 27 §3.1) | Medium (correctness) — **DONE** (`verify_paths` runs scoped ruff/bandit/pytest; both engines call it when chunk has files) | +| 7 | `chunk_spec_with_llm` truncates spec body at 5000 chars silently | Medium (correctness) — **DONE** (`_split_spec_for_llm` produces overlapping windows up to `_LLM_MAX_WINDOWS`; deduped on title) | +| 8 | `engines/base.py` still exposes the legacy `run_build_cycle` abstract method, blocking removal of legacy path | Medium (drift) — **DONE** (`run_build_cycle` and `BuildResult` removed; engine ABC has exactly 3 abstract methods) | +| 9 | `claude_engine.execute_chunk` doesn't surface `--allowedTools`/`--output-format stream-json` flags from spec 27 §3.2 at the chunk call-site | Medium (drift) — **DONE** (`_DEFAULT_ALLOWED_TOOLS` + `_ChunkConfig.allowed_tools/output_format`; `_build_agent_command` consumes them) | +| 10 | No deadline enforcement *per chunk* in `V2Orchestrator.run()` | Medium (resilience) — **DONE** (`>=` deadline gate before each `engine.execute_chunk` call) | +| 11 | `huggingface_engine.execute_chunk` lacks the reflection step's verification — model can fake `CHUNK_COMPLETE` | Medium (correctness) — **DONE** (`_HF_MAX_FIX_ATTEMPTS=2` cap on fix-cycle loop; verify must pass before success=True) | +| 12 | `audit_logger.py` has no dedicated unit-test file | Medium (testing) — **DONE** (`tests/test_audit_logger.py`, 15 cases) | +| 13 | `chunk_spec_with_llm` has no dedicated unit-test coverage | Medium (testing) — **DONE** (`TestChunkSpecWithLlm`, 10 cases) | +| 14 | `_probe_git_credentials` in `cli.py` defaults conservative on failure but doesn't distinguish `ssh-add` exit-1 (no agent) from exit-0-empty (agent but no keys) | Low (UX) — **DONE** (`ssh_agent_state` field) | +| 15 | Sandbox post-write `O_NOFOLLOW` re-check is conditional on `hasattr(os, "O_NOFOLLOW")` — no fallback warning when running on a platform without it (Windows) | Low (security) — **DONE** | +| 16 | `pyproject.toml` declares `Python :: 3.14` classifier without CI matrix coverage | Low (consistency) — **DONE** (CI matrix `3.10/3.11/3.12/3.13/3.14-dev` matches classifiers; 3.14-dev marked `continue-on-error`) | +| 17 | `prompts.py` retains multi-phase legacy templates that spec 27 §6.2 marked for removal | Low (cleanup) — **DONE** (templates already absent; live set documented in module docstring; AGENT_BUILD_SPEC stays until Step 2 lands) | +| 18 | `_DEFAULT_PR_LOC_CAP` constant is duplicated as a literal `250` in cli.py and `400` in the README/spec — single source of truth missing | Low (consistency) — **DONE** (Step 1 added `_DEFAULT_PR_LOC_CAP`; Step 18 added `_FORBIDDEN_PATTERNS_DOC` tuple in scaffolder) | + +--- + +## Step-by-step Fix Instructions + +### Step 1: Reconcile spec 28 with the code's 250 LOC default [High] + +**Context:** Commit `258e5e90` lowered the runtime default from 400 to 250 +LOC per PR but the spec 28 still says 400 and the README's "Bite-sized PR +mode" example uses 400. Operators reading the spec will be surprised the +tool splits earlier than documented. + +**Prompt for Claude Code:** +> Open `docs/specs/28_bite_sized_continuous_prs_v1.md` and update every +> reference from `400` LOC to `250` LOC: section 1.1 ("max_loc_per_pr CLI +> flag, default 400"), the Claude-Code prompt block in 1.1 that sets +> `max_loc_per_pr=opts.get("max_loc_per_pr", 400)`, the 2.2 default in the +> `__init__` signature, and the Acceptance Criteria line that says "≤ 400 +> LOC each". Also open `README.md`, find the "Bite-sized PR mode" section, +> and adjust the documented default. Then introduce a single-source-of- +> truth constant `_DEFAULT_PR_LOC_CAP = 250` (and `_DEFAULT_PR_COMMIT_CAP = +> 8`) in `src/codelicious/cli.py` near the existing `_INT_KEYS` +> declaration; replace the four magic-number occurrences in `_parse_args` +> defaults, banner print lines, and the `V2Orchestrator(...)` call site. +> Add a regression test to `tests/test_cli.py` that asserts the parsed +> `max_loc_per_pr` default equals the constant. + +**Acceptance criteria:** +- Spec 28 and README consistently quote 250 LOC. +- A single named constant drives the default in `cli.py`. +- Test asserts `_parse_args([])["max_loc_per_pr"] == 250`. + +--- + +### Step 2: Remove the legacy 4-phase `Orchestrator` [High] + +**Context:** Spec 27 Phase 4.1 mandates "Remove the 4-phase model +(BUILD → MERGE → REVIEW → FIX) — the new model is simpler". `V2Orchestrator` +exists in `orchestrator.py` (line 1124+) but the legacy `Orchestrator` +class (lines 1–1123) is still present, still exported in `__all__`, and +still entry-pointed via `engines/*.run_build_cycle()`. This is dead code +weight + drift risk. + +**Prompt for Claude Code:** +> Open `src/codelicious/orchestrator.py` and identify the legacy +> `Orchestrator` class plus its supporting helpers `REVIEWER_PROMPTS`, +> `_MERGE_ABORT_TIMEOUT_S`, and any worktree/merge code that +> `V2Orchestrator` does not call. Trace callers: `grep -rn +> "Orchestrator(" src/ tests/` to find them. Migrate any test that still +> targets the old class to `V2Orchestrator` if equivalent coverage is +> needed; otherwise delete the test. Remove the `Orchestrator` class and +> its helpers, drop them from `__all__`, and update `cli.py` and engine +> entry points so only `V2Orchestrator` is referenced. After deletion, +> rename `V2Orchestrator` → `Orchestrator` (since v2 is now THE +> orchestrator) but keep a `V2Orchestrator = Orchestrator` alias for one +> release cycle to avoid breaking any external import. Also update +> `engines/base.py` to drop `run_build_cycle` from the abstract surface +> (Step 8 covers this in detail). Run `pytest -x` and confirm green. + +**Acceptance criteria:** +- `grep -rn "BUILD.*MERGE.*REVIEW.*FIX" src/` returns zero hits. +- `pytest` still passes. +- Net diff is a deletion of >800 LOC. +- No test imports `Orchestrator` to mean the old 4-phase variant. + +--- + +### Step 3: Add jitter to all exponential backoff sites [High] + +**Context:** CLAUDE.md states "LLM calls use exponential backoff with +jitter for transient errors (429, 5xx)". Today +`src/codelicious/llm_client.py:221` uses `backoff = self._BACKOFF_BASE_S * +(2**attempt)` with no jitter, and `loop_controller.py:28` uses +`_LLM_BACKOFF_BASE_S` similarly. Without jitter, multiple concurrent +codelicious runs against the same provider thunder-herd on the same retry +cycle. + +**Prompt for Claude Code:** +> Add a `_jittered_backoff(base: float, attempt: int) -> float` helper to +> `src/codelicious/llm_client.py` that returns `base * (2**attempt) * +> (0.5 + secrets.SystemRandom().random())` — i.e. multiplicative jitter +> in the range [0.5x, 1.5x]. Use the `secrets` module per the security +> rules ("Use secrets module for random token generation, not random") — +> `SystemRandom().random()` is the correct primitive here. Replace both +> `time.sleep(backoff)` call sites in `chat_completion()` (HTTP retry +> branch and network-error retry branch) to use the new helper. Make the +> identical change in `src/codelicious/loop_controller.py` around line +> 200 where `_LLM_BACKOFF_BASE_S` drives a sleep. Add a unit test in +> `tests/test_llm_client.py` that monkey-patches `secrets.SystemRandom` +> and asserts the actual sleep duration falls within +> `[0.5*expected, 1.5*expected]` for three consecutive attempts. + +**Acceptance criteria:** +- Both retry call sites use jittered backoff. +- Jitter is sourced from `secrets`, not `random`. +- New test verifies jitter range. + +--- + +### Step 4: Honor provider-supplied `Retry-After` on 429 in LLMClient [High] + +**Context:** `llm_client.py` lines 213–232 handle HTTP 429 by using its +own backoff schedule. HuggingFace and most providers return a `Retry-After` +header (seconds or HTTP date). Ignoring it wastes our retry budget. + +**Prompt for Claude Code:** +> In `src/codelicious/llm_client.py`, inside the +> `except urllib.error.HTTPError as e:` branch, after sanitizing the +> error body, read `e.headers.get("Retry-After")`. Parse it as: integer +> seconds first, then HTTP-date (use +> `email.utils.parsedate_to_datetime`). Cap parsed values at 120 seconds +> (defensive: a malicious or buggy provider could send a huge value). If +> `Retry-After` parses successfully and `e.code == 429`, use that value +> instead of the computed exponential backoff (still apply jitter from +> Step 3). Log at INFO level when the provider-supplied value is used: +> "Honoring Retry-After: %.1fs for HTTP 429". Add a test in +> `tests/test_llm_client.py` that builds a mock `HTTPError` whose +> `headers` provides `Retry-After: 7` and asserts `time.sleep` was called +> with a value in [3.5, 10.5]. + +**Acceptance criteria:** +- `Retry-After` header (integer or HTTP-date) is parsed and respected. +- Cap of 120 s applied. +- Jitter still applied on top. +- New test verifies behavior. + +--- + +### Step 5: Surface provider Retry-After in agent_runner Claude rate-limit path [Medium] + +**Context:** `agent_runner.py:263` raises `ClaudeRateLimitError(..., +retry_after_s=60.0)` as a hardcoded constant. The Claude CLI's stderr +message often includes the actual reset window ("resets at 5pm"); we +should parse it where present. + +**Prompt for Claude Code:** +> In `src/codelicious/agent_runner.py`, near the rate-limit detection +> block at line 243–264, add a helper `_parse_claude_reset_seconds(text: +> str) -> float | None` that scans `text` for patterns like +> `resets in (\d+) seconds`, `try again in (\d+)m`, or +> `Retry-After: (\d+)`. Return parsed seconds (clamped to [10, 3600]) or +> None. In the existing rate-limit branch, call the parser on +> `(stderr_text + stdout_text)` and pass the result (or 60.0) to +> `ClaudeRateLimitError(retry_after_s=...)`. Add tests in +> `tests/test_agent_runner.py` covering: no match → 60.0, "resets in 30 +> seconds" → 30.0, "try again in 5m" → 300.0, value above cap → 3600.0. + +**Acceptance criteria:** +- Helper parses three formats and clamps. +- Tests cover all four cases. +- Existing rate-limit tests still pass. + +--- + +### Step 6: Make `verify_chunk` chunk-scoped, not whole-repo [Medium] + +**Context:** Spec 27 §3.1 specifies `verify_chunk` "Verify a *completed +chunk* passes lint/test/security checks". The current Claude implementation +(`engines/claude_engine.py:152`) and HF implementation both call the +repo-wide `verifier.verify(repo_path)`. On a 50-chunk run this re-runs the +entire test suite 50 times. + +**Prompt for Claude Code:** +> In `src/codelicious/verifier.py`, add a new function `verify_paths(repo: +> Path, paths: list[Path]) -> VerifyResult` that runs lint/security +> checks scoped to the provided paths (pass them as positional args to +> `ruff check` and `bandit`) and the test suite scoped via +> `pytest ` where `` are derived +> by mapping each modified `src/.py` to `tests/test_.py` +> if it exists, falling back to the full suite when no mapping exists. +> Update `engines/claude_engine.py:verify_chunk` and +> `engines/huggingface_engine.py:verify_chunk` to use the new +> `verify_paths(repo, chunk.estimated_files + result.files_modified)` +> when `files_modified` is non-empty; fall back to `verify(repo_path)` +> otherwise. Add tests in `tests/test_verifier.py::TestVerifyPaths` +> covering: empty paths → falls back to full verify, single src file with +> matching test → only that test runs, source file with no matching test +> → falls back to full suite. + +**Acceptance criteria:** +- Chunk-scoped verification runs only mapped tests when mapping exists. +- Lint and bandit only scan modified files. +- 3 new tests pass. + +--- + +### Step 7: Stop silent truncation in `chunk_spec_with_llm` [Medium] + +**Context:** `chunker.py:212` truncates the spec body to 5000 chars +(`spec_content[:5000]`). For a 12k-line spec like 27, the LLM only sees the +first ~10% and silently produces an incomplete chunking. There is no +warning logged. + +**Prompt for Claude Code:** +> In `src/codelicious/chunker.py`, replace the bare `[:5000]` slice with +> a windowing strategy: if `len(spec_content) > 5000`, log a WARNING +> ("Spec too large for single LLM pass: %d chars, splitting into N +> windows of ~4500 chars with 500-char overlap") and call the LLM once +> per window, merging the resulting chunk lists with deduplication on +> normalized title. Cap total windows at 10 to prevent runaway. If +> windowing produces > `_MAX_CHUNKS_PER_SPEC` items, truncate to the cap +> and log a WARNING. Add tests in `tests/test_chunker.py` covering: short +> spec (no windowing), exactly-5000-char spec (no windowing), 12k-char +> spec (3 windows expected), 60k-char spec (capped at 10 windows + warning). + +**Acceptance criteria:** +- Long specs no longer silently lose content. +- Window count and overlap configurable via module-level constants. +- 4 new tests pass. + +--- + +### Step 8: Drop `run_build_cycle` from the engine abstract surface [Medium] + +**Context:** `engines/base.py:135` still declares `run_build_cycle` as +`@abc.abstractmethod`. Spec 27 §6.1 says "Merge `executor.py` into +`orchestrator.py`" — the same simplification mandates removing the +legacy entry from the engine interface so engines can't accidentally +re-grow a 6-phase loop. + +**Prompt for Claude Code:** +> Open `src/codelicious/engines/base.py`. Remove the abstract +> `run_build_cycle` method, the `BuildResult` dataclass, and the legacy +> docstring sections (the docstring should now describe only the +> chunk-level interface). In `engines/claude_engine.py` and +> `engines/huggingface_engine.py`, remove the `run_build_cycle` +> implementations. Verify that nothing in `cli.py` or `orchestrator.py` +> still calls `run_build_cycle` (Step 2 should have already eliminated +> this; if any remains, route it through `V2Orchestrator.run()`). Update +> `tests/test_engine_base.py` to reflect the new minimal abstract +> surface — assert that subclasses without `execute_chunk` / +> `verify_chunk` / `fix_chunk` raise TypeError on instantiation. + +**Acceptance criteria:** +- Engine ABC has exactly three abstract methods. +- `run_build_cycle` and `BuildResult` are gone. +- `pytest tests/test_engine_base.py` passes. + +--- + +### Step 9: Surface Claude CLI flags from spec 27 §3.2 at the chunk call site [Medium] + +**Context:** Spec 27 §3.2 specifies the exact Claude CLI invocation: +`claude -p "{prompt}" --output-format stream-json --max-turns 50 +--allowedTools "Edit,Write,Bash(git status:*),Bash(pytest:*),Bash(ruff:*), +Read,Glob,Grep"`. `claude_engine.py:execute_chunk` builds a `_ChunkConfig` +with only `model`, `effort`, `max_turns`, `agent_timeout_s`, `dry_run` — +the allowedTools allow-list and stream-json output format are not visible +at the chunk site (they live in `agent_runner` defaults, but spec 27 wants +chunk-level configurability so future engines can tighten the tool list). + +**Prompt for Claude Code:** +> Extend `_ChunkConfig` in `engines/claude_engine.py:execute_chunk` to +> include `allowed_tools: list[str]` and `output_format: str` attributes, +> defaulting to the exact list from spec 27 §3.2 and `"stream-json"`. +> Open `src/codelicious/agent_runner.py`, find where the `claude` argv is +> assembled (search for `--max-turns` or `--allowedTools`), and make those +> values read from `config.allowed_tools` and `config.output_format` when +> set, falling back to current defaults. Add a unit test in +> `tests/test_agent_runner.py` (or `test_engine_claude_chunk.py`) that +> patches `subprocess.Popen` and asserts the argv contains the expected +> `--allowedTools` string and `--output-format stream-json`. + +**Acceptance criteria:** +- Chunk-level config controls allow-list and output format. +- Default allow-list matches spec 27 verbatim. +- Argv-assertion test passes. + +--- + +### Step 10: Enforce per-chunk deadline in `V2Orchestrator.run()` [Medium] + +**Context:** CLAUDE.md: "Always check the build deadline before starting a +new phase". `V2Orchestrator.run()` accepts a `deadline: float = 0.0` +parameter (orchestrator.py:1202) but the per-chunk loop body does not +check it before each chunk. + +**Prompt for Claude Code:** +> In `src/codelicious/orchestrator.py:V2Orchestrator.run()`, locate the +> for-each-chunk loop. Before invoking `engine.execute_chunk(...)`, add: +> `if self.deadline and time.monotonic() >= self.deadline: logger.warning( +> "Deadline reached after %d/%d chunks; stopping early.", ...); +> break`. Pass the *remaining* deadline to the engine via +> `EngineContext(deadline=self.deadline - 0)` so engines can budget their +> own subprocess timeouts. Add a test in +> `tests/test_v2_orchestrator.py::TestDeadlineEnforcement` that uses a +> mock engine; sets `deadline = time.monotonic() - 1` (already expired) +> and asserts zero chunks executed. A second test sets a future deadline +> and asserts all chunks ran. + +**Acceptance criteria:** +- Loop breaks cleanly when deadline elapses. +- Engine receives current deadline. +- 2 new tests pass. + +--- + +### Step 11: Add a verification gate before HF engine returns CHUNK_COMPLETE [Medium] + +**Context:** Spec 27 §3.3 says the HF engine's reflection step should +"review its own changes and fix any issues before signaling +CHUNK_COMPLETE". Today the HF engine accepts the model's +`CHUNK_COMPLETE` token at face value (huggingface_engine.py:188 reflection +step exists, but its output is *advisory*, not gating). + +**Prompt for Claude Code:** +> In `src/codelicious/engines/huggingface_engine.py:execute_chunk`, after +> the reflection step (around line 188–220), add a post-reflection +> verification: if `self.verify_chunk(chunk, repo_path)` returns +> `success=False`, do NOT return `ChunkResult(success=True)`. Instead, set +> the model up for one fix-cycle iteration (call `fix_chunk` with the +> failure message, then re-verify). Cap at 2 fix-cycle attempts to +> prevent infinite loops. If still failing, return +> `ChunkResult(success=False, message="HF engine could not satisfy +> verification after 2 fix attempts")`. Add a test in +> `tests/test_engine_huggingface_chunk.py` that mocks `verify_chunk` to +> fail twice then pass, asserts 2 fix attempts and a final success; a +> second test where verify always fails asserts the final ChunkResult is +> not-success and `retries_used >= 2`. + +**Acceptance criteria:** +- HF engine never returns success=True without passing verification. +- Hard cap of 2 fix iterations. +- 2 new tests pass. + +--- + +### Step 12: Add dedicated test file for `tools/audit_logger.py` [Medium] + +**Context:** `src/codelicious/tools/audit_logger.py` is 256 lines and is +referenced indirectly via `test_registry.py::test_dispatch_calls_audit_ +logger`, but has no `tests/test_audit_logger.py`. Audit logging is a +security-relevant code path (every tool call goes through it) and deserves +direct coverage. + +**Prompt for Claude Code:** +> Create `tests/test_audit_logger.py`. Read +> `src/codelicious/tools/audit_logger.py` to enumerate its public surface. +> Write tests for: file rotation (writing > size threshold spawns a new +> file), redaction (an entry containing a fake API key like +> `sk-test-12345abcdef` does not appear in the rotated file), atomic +> append behavior (use `os.fork` is forbidden — instead use threads: +> 10 threads each appending 100 entries should yield 1000 distinct lines +> with no truncation), and the JSON-line format (each line parses as +> JSON). Use `pytest` fixtures with `tmp_path`. Aim for 8+ test cases. + +**Acceptance criteria:** +- File `tests/test_audit_logger.py` exists with 8+ tests. +- Coverage of `audit_logger.py` rises above 90%. +- All tests pass. + +--- + +### Step 13: Add `chunk_spec_with_llm` test coverage [Medium] + +**Context:** `chunker.py:165` exposes a public function that has been +implemented but has no dedicated test cases (existing +`tests/test_chunker.py` only exercises the deterministic `chunk_spec`). +With Step 7's windowing logic landing, the LLM path needs explicit +coverage. + +**Prompt for Claude Code:** +> In `tests/test_chunker.py`, add a test class +> `TestChunkSpecWithLLM` that uses a mock `llm_client` whose +> `chat_completion` method returns canned JSON. Cover: valid 3-chunk +> response (assert 3 WorkChunks produced with correct titles), JSON +> wrapped in markdown code fences (parser strips them), invalid JSON +> (falls back to deterministic `chunk_spec`), circular dependency (chunk +> A depends_on B, B depends_on A — assert the bad pair is dropped or the +> entire response is rejected and we fall back), path-traversal attempt +> (a chunk's `files: ["../../etc/passwd"]` — assert the file hint is +> sanitized away), and exceeding `_MAX_CHUNKS_PER_SPEC` (mock returns 150 +> chunks — assert truncated to 100 with a warning). Use `caplog` to +> assert warnings. + +**Acceptance criteria:** +- 6 new tests under `TestChunkSpecWithLLM`. +- All pass. +- Coverage of `chunk_spec_with_llm` reaches > 95%. + +--- + +### Step 14: Distinguish ssh-add states more precisely [Low] + +**Context:** `cli.py:_probe_git_credentials` treats `ssh-add` failure as +"not loaded". But `ssh-add -l` returns exit-1 in two distinct cases: no +agent running vs. agent running but empty. The conservative-default +behavior is correct, but the user-facing prompt could be sharper. + +**Prompt for Claude Code:** +> In `src/codelicious/cli.py:_probe_git_credentials`, capture the exact +> exit code from `ssh-add -l`: 0 = keys present, 1 = agent running but +> no keys, 2 = no agent. Return an extra dict key `ssh_agent_state` +> with one of `"keys_loaded"`, `"empty"`, `"no_agent"`, `"unknown"`. In +> `_ensure_git_credentials_unlocked`, branch on this value: when +> `no_agent`, suggest `eval $(ssh-agent)` first; when `empty`, just run +> `ssh-add`. Update existing tests in +> `tests/test_cli.py::TestProbeGitCredentials` to verify the new key, +> and add a new test for the `no_agent` branch (mock subprocess to +> return exit 2). + +**Acceptance criteria:** +- New `ssh_agent_state` key in the probe dict. +- Tests cover all three states. +- User-facing message changes by state. + +--- + +### Step 15: Sandbox: explicit Windows fallback warning when O_NOFOLLOW unavailable [Low] + +**Context:** `sandbox.py:387`: `if os.path.exists(str(resolved)) and +hasattr(os, "O_NOFOLLOW"):`. On platforms lacking `O_NOFOLLOW` (Windows +mainly), the symlink TOCTOU re-check is silently skipped. Codelicious +claims "TOCTOU-safe operations" in CLAUDE.md. + +**Prompt for Claude Code:** +> In `src/codelicious/sandbox.py`, locate the `hasattr(os, "O_NOFOLLOW")` +> guard around line 387. Replace the bare `if not hasattr(...)` skip +> with a startup-time check (in `Sandbox.__init__`): if +> `os.O_NOFOLLOW` is unavailable, log a single WARNING: +> "Sandbox: os.O_NOFOLLOW unavailable on this platform — TOCTOU +> protection on atomic writes is best-effort. Run codelicious on a POSIX +> system for full guarantees." Add a test in `tests/test_sandbox.py` +> that monkey-patches `os` to remove `O_NOFOLLOW` and asserts the +> warning is emitted exactly once per Sandbox instance. + +**Acceptance criteria:** +- Operators see a clear warning on Windows. +- Test verifies single-emission behavior. + +--- + +### Step 16: Align Python classifier list with CI matrix [Low] + +**Context:** `pyproject.toml` lists Python 3.10 through 3.14 as +supported but the project's CI (if any) likely doesn't cover all five. +Claiming 3.14 support is forward-looking but unverified. + +**Prompt for Claude Code:** +> Open `pyproject.toml`. Run `gh workflow list` and inspect the test +> workflow YAML to enumerate the actual Python versions exercised in CI. +> For any version listed in the `classifiers` block but NOT in CI, either +> (a) drop the classifier, or (b) add the version to the CI matrix — pick +> based on whether the project intends to support that version. Default +> to dropping 3.13 and 3.14 from the classifiers if no CI exists, and +> open a follow-up issue titled "Add Python 3.13/3.14 to CI matrix once +> released and stable". + +**Acceptance criteria:** +- `pyproject.toml` classifiers match CI reality. +- Follow-up issue exists if classifiers were trimmed. + +--- + +### Step 17: Remove obsolete multi-phase prompt templates [Low] + +**Context:** Spec 27 §6.2: "Remove multi-phase prompt templates +(SCAFFOLD, ANALYZE, REFLECT, etc.)". `prompts.py` is 286 lines — +inspect to find any template not in the chunk-focused set +(CHUNK_EXECUTE / CHUNK_VERIFY / CHUNK_FIX) and remove it. + +**Prompt for Claude Code:** +> Open `src/codelicious/prompts.py`. List every named template constant. +> Run `grep -rn "" src/ tests/` for each — any template +> with zero non-test references is dead code. Delete it and update the +> module docstring to enumerate only the live templates. Update +> `tests/test_prompts.py` to drop assertions on deleted templates and +> add a smoke test that asserts CHUNK_EXECUTE, CHUNK_VERIFY, CHUNK_FIX, +> and `render()` exist. + +**Acceptance criteria:** +- No dead templates remain. +- Live templates enumerated in module docstring. +- `tests/test_prompts.py` passes. + +--- + +### Step 18: Audit `scaffolder.py` for security-rule self-violations [Low] + +**Context:** `scaffolder.py:148, 291, 335-337` contain string literals +listing forbidden patterns (`eval(`, `shell=True`, `os.system`) used as +prose in scaffolded README content. These are scaffolding output, not +runtime calls — but a future refactor that turns them into f-strings +without care could regress security. Tighten the source to make this +harder. + +**Prompt for Claude Code:** +> In `src/codelicious/scaffolder.py`, replace the three string-literal +> blocks at lines ~148, ~291, ~335 that enumerate forbidden patterns +> with reads from a single module-level frozen tuple +> `_FORBIDDEN_PATTERNS_DOC = ("eval(", "exec(", "shell=True", ...)`. +> Format that tuple into the scaffolded README/CLAUDE.md/security.md +> output via `"\n".join(f"- {p}" for p in _FORBIDDEN_PATTERNS_DOC)`. The +> tuple becomes the single source of truth and any future change must be +> intentional. Add a test that asserts the scaffolded output contains +> every entry from `_FORBIDDEN_PATTERNS_DOC`. + +**Acceptance criteria:** +- Single tuple drives the forbidden-pattern documentation. +- Tests cover the round-trip. + +--- + +## Roll-out Order + +Execute the steps in numerical order. Steps 1–3 are mergeable +independently. Steps 6, 8, 9, 10, 11 should land after Step 2 (legacy +removal) so diffs stay clean. Steps 12–18 are pure additions and can be +parallelized. + +## Acceptance Criteria (overall) + +- [ ] All 18 steps complete and merged. +- [ ] `pytest` passes with coverage ≥ 90% (existing project gate). +- [ ] `ruff check src/ tests/` clean. +- [ ] `bandit -r src/` clean (zero new high/medium findings). +- [ ] `grep -rn "Orchestrator(" src/` returns only `V2Orchestrator`-derived hits. +- [ ] No new runtime dependencies in `pyproject.toml`. +- [ ] Spec 28 and README quote the same LOC default as the code. +- [ ] All retries (LLM, HF tool loop, Claude CLI) include jitter. + +## Out of Scope + +- New engine backends (Anthropic API, OpenAI, Gemini) — already deferred to + spec 27 §8.2. +- Stacked PRs / DAG executor — deferred per spec 28 "Out of Scope". +- Webhook/Slack/Jira triggers — deferred per spec 27 §8.1. +- Replacing the `urllib`-based HTTP client with `httpx` — would break the + zero-runtime-dependency invariant. diff --git a/docs/specs/spec-v30_operational_resilience_and_idempotency_v1.md b/docs/specs/spec-v30_operational_resilience_and_idempotency_v1.md new file mode 100644 index 00000000..b4e75402 --- /dev/null +++ b/docs/specs/spec-v30_operational_resilience_and_idempotency_v1.md @@ -0,0 +1,497 @@ +--- +version: 1.0.0 +status: Complete +created: 2026-05-03 +updated: 2026-05-04 +related_specs: ["27_codelicious_v2_rewrite.md", "28_bite_sized_continuous_prs_v1.md", "spec-v29_gap_closure_v1.md", "18_operational_resilience_v1.md"] +progress: + completed: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] + pending: [] +--- + +# Spec v30 — Operational Resilience & Idempotency Gap Closure (2026-05-03) + +## Background + +Spec v29 closes 18 gaps, mostly around legacy-orchestrator cleanup, jitter, +chunk-scoped verification, and missing tests for individual modules. After +landing v29, a second pass surfaced a different *category* of gap: things +that work fine for a single, lucky, end-to-end run but fall over under +realistic operational conditions — concurrent invocations, partial failures +followed by resumes, oversized specs, rate-limit storms, log-file +contention, and CLI flags that are only validated at one layer. + +These gaps are **orthogonal to v29**: they are not on v29's inventory, they +are not addressed by deleting the legacy `Orchestrator`, and they will +remain after every step of v29 is merged. They are ordered here by impact +on a real user running `codelicious build` against a non-trivial spec on a +shared workstation. + +## Gap Inventory + +| # | Title | Severity | +|---|---|---| +| 1 | No concurrent-run lockfile — two `codelicious build` invocations on the same repo can race on git, sandbox, and `latest.log` | High (correctness) — **DONE** (`_run_lock` advisory `fcntl.flock`; second invocation exits 75) | +| 2 | No idempotent resume — if a run aborts mid-spec (rate-limit, SIGTERM, crash), restarting redoes already-merged chunks and may double-PR | High (correctness) — **DONE** (atomic JSON ledger at `.codelicious/state/.json`; `--no-resume` and `--reset-ledger` flags) | +| 3 | `--endpoint` CLI flag is not validated for HTTPS at the CLI layer; config-layer validation runs only when LLMClient is constructed, after a banner has already been printed | High (security) — **DONE** (`_validate_endpoint_url_strict` runs before banner; rejects http, user-info, empty host with masked output, exits 2) | +| 4 | SIGTERM handler raises `SystemExit(143)` but there is no integration test asserting that an in-flight chunk and its subprocess (Claude CLI) both terminate within the 5 s grace window | High (resilience) — **DONE** (`TestSigtermIntegration` in `test_cli.py` spawns a real Python child holding the run-lock, sends SIGTERM, asserts exit 143 within 8s and lockfile cleanup) | +| 5 | No engine fallback on persistent rate-limit — Claude CLI rate-limit aborts the build (per recent commit `9b68ea84`) instead of failing over to the HuggingFace engine when both are configured | Medium (resilience) — **DONE** (`Orchestrator(engines=[…])` ordered list; rate-limited engine is dropped and the chunk retries on the next one; CLI auto-builds the list when both creds are present) | +| 6 | Chunker has no token-budget awareness — a chunk whose estimated files exceed the engine context window will be dispatched anyway and fail at runtime | Medium (correctness) — **DONE** (`enforce_token_budget` halves over-budget chunks recursively up to depth 3; orchestrator runs it after deterministic chunking) | +| 7 | `latest.log` symlink is updated non-atomically on every run — two simultaneous runs (or a fast restart) leave a dangling or wrong-target symlink | Medium (operational) — **DONE** (`_atomic_symlink_update` uses tmp + `os.replace`; Windows fallback writes `.txt`) | +| 8 | Verifier measures coverage but does not enforce a gate — a chunk whose patch drops coverage below the project floor still returns `success=True` | Medium (correctness) — **DONE** (`resolve_min_coverage` + `_enforce_coverage_floor`; CLI `--min-coverage`, `[tool.codelicious].min_coverage`, default 90) | +| 9 | PR description body lacks chunk metadata (parent spec link, dependency chain, verifier summary) — humans reviewing the PR have no context | Medium (UX) — **DONE** (`_build_pr_body` accepts `chunk_metadata`; renders Chunk Context, Verifier Summary, Audit Log sections; missing fields → `n/a`) | +| 10 | Branch-name collision — two chunks whose titles normalize to the same slug produce the same branch; the second `git checkout -b` fails with a cryptic error | Medium (correctness) — **DONE** (`_disambiguate_branch` probes local + remote, suffix or unix-ts fallback, INFO logged) | +| 11 | `audit_logger` rotation has a thread-safe append path but no `fcntl.flock` cross-process guard; concurrent `codelicious` runs interleave audit lines | Medium (security) — **DONE** (`_cross_process_lock` wraps every append with `fcntl.flock` on `.audit.lock`; Windows `msvcrt.locking` fallback) | +| 12 | No diagnostic dump on abnormal exit — when the build aborts (deadline, rate-limit, exception), there is no single artifact summarizing what completed, what failed, and what to retry | Low (UX) — **DONE** (`_write_postmortem` aggregates ledger counts + log tail + resume hint; called on abort) | + +--- + +## Step-by-step Fix Instructions + +### Step 1: Add a per-repo lockfile to prevent concurrent runs [High] + +**Context:** Nothing today prevents two `codelicious build` invocations +from running against the same repository checkout. Both will try to +create branches, write to `.codelicious/`, append to `audit.log`, and +update the `latest.log` symlink. The git operations alone can corrupt +each other (one process's `git checkout -b` racing the other's +`git commit`). On a shared workstation or CI runner reusing a workspace +this is a realistic foot-gun. + +**Prompt for Claude Code:** +> Read `src/codelicious/cli.py` to find where the build command begins +> executing (look for the `cmd_build` or equivalent function near line +> 800). Before any work starts — after argument parsing but before any +> git, sandbox, or LLM call — acquire an advisory lock at +> `/.codelicious/run.lock` using `fcntl.flock` with +> `fcntl.LOCK_EX | fcntl.LOCK_NB`. If the lock cannot be acquired, +> read the PID from inside the lockfile (we will write it in a moment) +> and print: `Error: another codelicious run is in progress (pid +> ). If you believe this is stale, delete .` and exit with +> code 75 (EX_TEMPFAIL). On successful acquisition, write the current +> PID into the lockfile and register an `atexit` handler that releases +> the lock and removes the file. Wrap this in a context manager +> `_run_lock(repo_root: Path)` so the lifecycle is obvious. On +> non-POSIX platforms where `fcntl` is unavailable, log a single +> WARNING ("concurrent-run protection unavailable on this platform") +> and proceed without locking. Add tests in +> `tests/test_cli.py::TestRunLock` that: (a) acquires the lock and +> asserts a second acquisition in a child process exits with 75, (b) +> confirms the lockfile is removed on normal exit, (c) confirms the +> lockfile is removed on `SystemExit(143)` (SIGTERM path). + +**Acceptance criteria:** +- Second concurrent invocation exits with code 75 and a clear message. +- Lockfile is removed on normal exit, SIGTERM, and uncaught exception. +- 3 new tests pass on POSIX; Windows path logs the warning. + +--- + +### Step 2: Idempotent resume — skip already-merged chunks on restart [High] + +**Context:** A build can abort mid-spec for many reasons (rate-limit +abort per `9b68ea84`, SIGTERM, deadline, network blip). When the user +re-runs `codelicious build` against the same spec, the chunker +re-produces the same chunk list, but `V2Orchestrator.run()` does not +check whether a chunk's expected output already exists in git. Result: +duplicate PRs, wasted LLM budget, and possible merge conflicts when the +re-run touches files the prior run already merged. + +**Prompt for Claude Code:** +> In `src/codelicious/orchestrator.py:V2Orchestrator`, introduce a +> persistent chunk-status ledger at `/.codelicious/state/< +> spec_slug>.json` with the schema `{"chunks": {"": +> {"status": "merged"|"failed"|"in_progress", "pr_url": "...", +> "branch": "...", "completed_at": ""}}}`. On `run()` start, +> load the ledger if present; for every chunk in the plan, if its +> ledger entry is `"merged"`, skip execution and log INFO ("Skipping +> already-merged chunk : "). After each successful +> `engine.execute_chunk` + verify + git push + PR create, update the +> ledger atomically (write to `.tmp` then `os.rename`). Use the chunk's +> stable `id` (already produced by the chunker) as the key — verify +> chunker IDs are deterministic across runs by reading `chunker.py`; +> if not, derive an ID from `sha256(title + sorted(files))[:12]` and +> persist that. Add `--no-resume` and `--reset-ledger` CLI flags in +> `cli.py` that bypass or delete the ledger respectively. Tests in +> `tests/test_v2_orchestrator.py::TestIdempotentResume`: (a) ledger +> with one chunk marked merged → that chunk is skipped, (b) full +> ledger → no chunks executed, (c) `--no-resume` ignores the ledger, +> (d) `--reset-ledger` deletes it before run. + +**Acceptance criteria:** +- A second run after a partial completion executes only the unfinished chunks. +- Ledger writes are atomic (no half-written JSON on crash). +- Tests for skip, full-skip, no-resume, and reset all pass. +- `--no-resume` and `--reset-ledger` are documented in `--help`. + +--- + +### Step 3: Validate `--endpoint` for HTTPS at the CLI layer [High] + +**Context:** `src/codelicious/config.py:136` and +`src/codelicious/llm_client.py:97` both check `scheme != "https"` and +reject non-HTTPS endpoints — but only when `LLMClient` is instantiated, +which happens *after* the CLI has already parsed args and printed the +banner. A user passing `--endpoint http://internal-proxy/...` sees the +banner with their (insecure) endpoint, then a confusing error several +lines later. CLAUDE.md says "All LLM endpoint URLs validated for HTTPS" +— validation should fail fast. + +**Prompt for Claude Code:** +> In `src/codelicious/cli.py`, locate `_parse_args` (around line 600) +> and `_validate_opts` (around line 700). Add a validation block: if +> `opts.get("endpoint")` is set, parse it with `urllib.parse.urlsplit` +> and reject any scheme other than `https`. Reject also if the host is +> empty or if the URL contains user-info (e.g. +> `https://user:pass@host`) — credentials in URLs are a known +> exfiltration vector. On rejection, print +> `Error: --endpoint must be an https:// URL with no embedded +> credentials, got: <sanitized>` (sanitize by masking any user-info +> with `***`) and exit with code 2. Add tests in +> `tests/test_cli.py::TestEndpointValidation` covering: http URL +> rejected, https URL accepted, https URL with user-info rejected, +> empty host rejected, malformed URL rejected. Ensure the same +> validation is *not* duplicated — refactor `config.py:_validate_url` +> to expose a single `validate_endpoint_url(url) -> None` helper that +> both the CLI and `LLMClient.__init__` call. + +**Acceptance criteria:** +- Insecure endpoints rejected before the banner prints. +- Credentials-in-URL rejected with masked output. +- Single helper used by both CLI and LLMClient. +- 5 new tests pass. + +--- + +### Step 4: Integration test for SIGTERM graceful shutdown [High] + +**Context:** `cli.py:37-39` registers a SIGTERM handler that raises +`SystemExit(143)`. `agent_runner.py:46` defines `_SIGTERM_GRACE_S = 5` +and uses it on subprocess timeouts. There is no test that runs a full +build, sends SIGTERM mid-chunk, and asserts: (a) the subprocess +(Claude CLI mock) receives SIGTERM, (b) it gets SIGKILL after 5 s if +unresponsive, (c) the lockfile from Step 1 is released, (d) the ledger +from Step 2 marks the in-flight chunk as `"failed"` not +`"in_progress"`. CLAUDE.md mandates this works; no test proves it +does. + +**Prompt for Claude Code:** +> Create `tests/test_sigterm_integration.py`. The test launches +> `codelicious build` as a subprocess against a fixture spec +> (write a tiny one to `tmp_path/spec.md`). Use a mock engine +> shimmed via env var (e.g. `CODELICIOUS_ENGINE=mock` reads from +> `src/codelicious/engines/mock_engine.py` — create that engine if +> missing, with a configurable per-chunk sleep). Configure the mock +> engine to sleep 30 s in `execute_chunk`. From the parent test, after +> 2 s, send `SIGTERM` to the child PID. Assert: child exits with code +> 143 within 8 s, the lockfile under `.codelicious/run.lock` no longer +> exists, the ledger entry for the in-flight chunk has status +> `"failed"` with reason `"sigterm"`, and stderr contains "Received +> SIGTERM". Also assert that any subprocess spawned by the agent +> runner (use `psutil` if already a dep — otherwise spawn a sentinel +> file-creating subprocess and confirm it was killed) is no longer +> alive. Mark the test `@pytest.mark.skipif(sys.platform == "win32", +> ...)`. + +**Acceptance criteria:** +- Test exists and passes on POSIX. +- Child process exits 143 within 8 seconds of SIGTERM. +- No orphaned subprocesses, lockfile, or `in_progress` ledger entries remain. + +--- + +### Step 5: Engine fallback on persistent Claude rate-limit [Medium] + +**Context:** Commit `9b68ea84` ("abort build on Claude CLI rate-limit +instead of burning through every chunk") is correct given a single +engine, but spec 27 §1 advertises a "dual-engine architecture". When a +user has both `ANTHROPIC_API_KEY` (or Claude CLI auth) *and* +`HF_TOKEN` configured, a Claude rate-limit should fail over to the HF +engine for the remainder of the spec rather than aborting. + +**Prompt for Claude Code:** +> In `src/codelicious/orchestrator.py:V2Orchestrator`, accept a list +> `engines: list[Engine]` (ordered by preference) instead of a single +> `engine`. In the per-chunk loop, on `ClaudeRateLimitError` (or +> equivalent rate-limit signal from any engine), log WARNING +> ("<engine_name> rate-limited; failing over to <next_engine_name> +> for the remainder of this spec"), drop the rate-limited engine +> from the active list, and retry the same chunk on the next engine. +> If the active list becomes empty, abort with the existing rate-limit +> behavior (do not retry the same engine). Update `cli.py` to assemble +> the engine list based on which credentials are present: +> Claude-first if both, HF-only if only HF, Claude-only if only +> Claude. Add tests in `tests/test_v2_orchestrator.py::TestEngineFallback` +> with two mock engines: (a) primary rate-limits on chunk 2 → fallback +> takes over from chunk 2, (b) both rate-limit → abort, (c) primary +> rate-limits on chunk 2, secondary rate-limits on chunk 4 → abort +> at chunk 4. Verify the chunk-2 retry on the fallback engine +> actually executes (don't double-skip). + +**Acceptance criteria:** +- Build continues on the fallback engine after primary rate-limits. +- Rate-limited engine is removed for the rest of the run (no flapping). +- 3 new tests pass. +- `--engine` flag (if present) can still force a single engine. + +--- + +### Step 6: Token-budget-aware chunk sizing [Medium] + +**Context:** `chunker.py` produces chunks based on LOC and commit +caps but does not estimate the prompt token count for each chunk. +A chunk whose `estimated_files` collectively exceed the engine's +context window (e.g., Claude Sonnet 4.6 at 200k tokens, or HF at much +less) is dispatched anyway and fails inside the engine with a +context-overflow error — wasting one full chunk's worth of LLM budget +to discover the problem. + +**Prompt for Claude Code:** +> Add `_estimate_chunk_tokens(chunk: WorkChunk, repo: Path) -> int` to +> `src/codelicious/chunker.py`. Estimate as: 4 chars/token, summing +> over (chunk title + description + chunk-execute prompt template + +> sum of file sizes for `chunk.files`). Add a module-level constant +> `_MAX_CHUNK_TOKENS_PER_ENGINE = {"claude": 150_000, "huggingface": +> 24_000, "default": 24_000}` (leaving 50k headroom on Claude's 200k +> for response). After chunking (deterministic and LLM paths), iterate +> chunks and for any chunk exceeding the budget for the *least +> capable* engine in the active engine list, split it: keep the title, +> halve the file list, mark the second half as depending on the first +> with id-suffix `_b`. Cap recursive splits at depth 3 and log +> WARNING if a chunk still exceeds budget after 3 splits ("Chunk +> <id> still <N> tokens after 3 splits — dispatch anyway, may fail"). +> Tests in `tests/test_chunker.py::TestTokenBudget`: under-budget +> chunk → unchanged, over-budget chunk → split with correct +> dependency, recursively over-budget → 4 sub-chunks max, splits +> preserve total file coverage. + +**Acceptance criteria:** +- No chunk silently exceeds the engine context window. +- Splits preserve dependency ordering and file coverage. +- 4 new tests pass. + +--- + +### Step 7: Atomic update of the `latest.log` symlink [Medium] + +**Context:** Recent commit `258e5e90` adds a `latest.log` symlink. +The current implementation likely uses `os.symlink(target, link)` +which fails if `link` exists, *or* `os.unlink(link); os.symlink(...)` +which is non-atomic. Two simultaneous runs (Step 1 will eventually +prevent this, but a fast restart after crash within the lock-release +window can reproduce it) can leave the symlink dangling or pointing +to a partially-written file. POSIX guarantees atomic symlink update +via `os.symlink(target, tmp); os.rename(tmp, link)`. + +**Prompt for Claude Code:** +> In `src/codelicious/logger.py`, find where `latest.log` is created +> (search for `"latest.log"` or `os.symlink`). Replace the +> non-atomic update with: `tmp = link.with_suffix(f".{os.getpid()}. +> tmp"); os.symlink(target.name, tmp); os.replace(tmp, link)`. Use +> `os.replace` (atomic on POSIX, atomic on NTFS for symlinks since +> Win10). On platforms where `os.symlink` fails due to permissions +> (Windows without developer mode), fall back to writing a plain +> file `latest.log.txt` containing the path to the active log, and +> log INFO once per process. Add a test in `tests/test_logger.py` +> that runs the symlink update twice in quick succession and asserts +> the symlink resolves to the most recent log file with no +> intermediate broken state (use `os.readlink` between updates from +> a thread to detect a window where `readlink` raises). + +**Acceptance criteria:** +- Symlink update is atomic via tmp + rename. +- Windows fallback documented and tested. +- Race-detection test passes. + +--- + +### Step 8: Enforce coverage gate in verifier, not just measure [Medium] + +**Context:** `src/codelicious/verifier.py` runs `pytest --cov` (per +recent specs) but the result feeds into `VerifyResult` only +informationally. A chunk that drops project coverage from 92% to 71% +returns `success=True`. CLAUDE.md and the project itself promise +"coverage ≥ 90%" — this is enforced for human-written code by CI but +not for codelicious-generated code at chunk time, when the violation +is cheapest to fix. + +**Prompt for Claude Code:** +> In `src/codelicious/verifier.py`, locate the coverage parsing +> (search for `--cov` or `coverage`). Add a parameter +> `min_coverage: float = 90.0` to the public `verify()` and +> `verify_paths()` functions (Step 6 of v29 introduces +> `verify_paths`). After parsing the coverage percentage, if it is +> below `min_coverage`, set `result.success = False` and append to +> `result.failures`: f"coverage {pct:.1f}% below floor +> {min_coverage:.1f}%". Read the threshold from +> `pyproject.toml::[tool.codelicious].min_coverage` if present +> (parse with `tomllib`), else use 90.0. Add a `--min-coverage` CLI +> flag in `cli.py` that overrides both. Tests in +> `tests/test_verifier.py::TestCoverageGate`: 95% coverage with +> 90% floor → success, 85% with 90% floor → failure with the +> expected message, override via pyproject → uses pyproject value, +> override via CLI flag → CLI wins. + +**Acceptance criteria:** +- Coverage drops below floor → `success=False`. +- Three-way override precedence: CLI > pyproject > default. +- 4 new tests pass. + +--- + +### Step 9: Enrich PR description with chunk metadata [Medium] + +**Context:** `src/codelicious/git/` produces PRs but their descriptions +do not include: a link to the parent spec file, the chunk's position +in the dependency graph, the verifier summary (lint warnings, test +count, coverage delta), or a link to the audit log slice for this +chunk. A reviewer opens the PR and sees only the commit messages. +This is the single highest-leverage UX improvement for the human +reviewer. + +**Prompt for Claude Code:** +> In `src/codelicious/git/` (locate the PR-create function — likely +> `pr.py` or similar; `grep -rn "gh pr create\|create_pr" src/`), +> extend the PR body template to include a `## Chunk Context` +> section with: spec source path (relative to repo root), chunk id +> and title, list of chunks this one depends on, list of chunks +> that depend on it, and a `## Verifier Summary` section with +> `tests passed: N | lint warnings: N | coverage: P% (Δ +/-X.X +> pp)`. Read these from the `ChunkResult` and `VerifyResult` already +> available at PR-create time. If any field is missing, render +> `n/a` rather than crashing. Add a `## Audit Log` section with the +> path to the rotated audit log file for the run (from Step 1's +> ledger). Update tests in `tests/test_git_pr.py` (or create) to +> assert each section appears in the rendered body for a sample +> ChunkResult. + +**Acceptance criteria:** +- PR body includes Chunk Context, Verifier Summary, Audit Log sections. +- Missing fields render as `n/a`, never raise. +- Tests assert each section's presence. + +--- + +### Step 10: Detect and disambiguate branch-name collisions [Medium] + +**Context:** The branch name is derived from a slug of the chunk +title. Two chunks titled "fix: handle empty input" (e.g., produced by +the LLM in a long spec) normalize to identical slugs. The first +`git checkout -b` succeeds; the second fails with "branch already +exists" and the error surfaces as a generic chunk failure with no +hint that the cause is a name collision. + +**Prompt for Claude Code:** +> In `src/codelicious/git/` find the slug-to-branch-name function +> (likely `_slugify` or `_branch_name_for_chunk`). After computing +> the candidate slug, run `git branch --list <slug>` and +> `git ls-remote --heads origin <slug>`; if either returns a hit, +> append a suffix `-<chunk_id_short>` (first 6 chars of the chunk +> id) to disambiguate. If even *that* collides (extremely unlikely), +> append `-<unix_timestamp>`. Log INFO when disambiguation triggers: +> "Branch <slug> exists; using <slug>-<suffix>". Add tests in +> `tests/test_git_branches.py`: (a) unique slug → unchanged, (b) local +> collision → suffixed with chunk id, (c) remote-only collision → +> suffixed, (d) both collisions of suffixed name → timestamp suffix. + +**Acceptance criteria:** +- Two chunks with identical titles produce distinct branches. +- Disambiguation is logged so the reviewer can match branch to chunk. +- 4 new tests pass. + +--- + +### Step 11: Cross-process file lock on audit-log rotation [Medium] + +**Context:** v29 Step 12 adds a thread-safe append test for +`audit_logger.py` (10 threads × 100 entries). That covers +intra-process safety. But two `codelicious` processes (Step 1 will +prevent same-repo collisions; cross-repo audit logs may still share a +location if `CODELICIOUS_AUDIT_DIR` is set globally) can interleave +audit lines across the rotation boundary, producing a corrupt +JSON-lines file. + +**Prompt for Claude Code:** +> In `src/codelicious/tools/audit_logger.py`, wrap the append + rotate +> critical section with `fcntl.flock(fd, LOCK_EX)` / `LOCK_UN`. Use a +> dedicated lock file at `<audit_dir>/.audit.lock` rather than locking +> the audit log itself (rotation moves the audit log; a lock on a +> moved fd is undefined). On Windows, use `msvcrt.locking` if +> available; otherwise log a one-shot WARNING and proceed without +> cross-process locking (intra-process locks remain). Add a test in +> the v29 Step 12 file `tests/test_audit_logger.py` that uses +> `multiprocessing` to spawn 4 workers each writing 50 entries — +> assert all 200 lines parse as JSON and no line is interleaved +> (`worker_id` field unique per line). Skip on Windows. + +**Acceptance criteria:** +- Cross-process audit-log writes never interleave on POSIX. +- 200 entries from 4 processes all parse as JSON. +- Windows fallback documented and warned about. + +--- + +### Step 12: Diagnostic dump on abnormal exit [Low] + +**Context:** When a build aborts (deadline reached, rate-limit on all +engines, uncaught exception), the operator must `tail` the log and +mentally reconstruct what completed. With the ledger from Step 2 +already capturing per-chunk status, dumping a one-page summary +("postmortem") on abnormal exit is nearly free. + +**Prompt for Claude Code:** +> In `src/codelicious/cli.py`, register an `atexit` handler (also +> invoked on SystemExit and uncaught Exception via `sys.excepthook`) +> that, if the run did not complete normally, writes +> `<repo_root>/.codelicious/postmortem-<timestamp>.md` with: the +> abort reason (deadline / rate-limit / signal / exception), counts +> of merged / failed / skipped / pending chunks (from the Step 2 +> ledger), the last 50 lines of the run log, and a "Resume command" +> footer that prints the exact `codelicious build <spec> +> [--reset-ledger if appropriate]` command needed to continue. On +> normal exit (all chunks merged), do NOT write a postmortem. Print +> the postmortem path to stderr on abort. Add tests in +> `tests/test_cli.py::TestPostmortem`: (a) deadline abort writes +> postmortem with status counts, (b) normal completion writes none, +> (c) postmortem path is printed to stderr. + +**Acceptance criteria:** +- Postmortem appears only on abnormal exit. +- Includes status counts, abort reason, resume command. +- 3 new tests pass. + +--- + +## Roll-out Order + +Steps **1, 2, and 4** form a tightly coupled trio (lockfile, ledger, +SIGTERM test) and should land together or in immediate succession — +each makes the others meaningful. Step **3** is independent and +high-value (security); land it first to remove the foot-gun. Steps +**5 (engine fallback)** and **6 (token budget)** depend on the v29 +legacy-orchestrator removal (v29 Step 2) being complete. Steps +**7–11** are independent and parallelizable. Step **12 (postmortem)** +depends on Step 2's ledger format and should land last. + +**Suggested merge order:** 3 → 1 → 2 → 4 → 7 → 10 → 11 → 8 → 9 → 5 → 6 → 12. + +## Acceptance Criteria (overall) + +- [ ] All 12 steps complete and merged, each as a separate PR. +- [ ] `pytest` passes with coverage ≥ 90%. +- [ ] `ruff check src/ tests/` clean. +- [ ] `bandit -r src/` clean. +- [ ] Two simultaneous `codelicious build` invocations on the same repo: second exits 75. +- [ ] Killing a build with SIGTERM mid-chunk leaves no lockfile, no orphan subprocesses, and a `failed` ledger entry. +- [ ] Re-running `codelicious build` against a partially-completed spec only executes the unfinished chunks. +- [ ] Passing `--endpoint http://...` exits 2 before the banner prints. +- [ ] On Claude rate-limit with HF configured, build continues on HF. +- [ ] PR descriptions include chunk context, verifier summary, and audit-log link. +- [ ] On any abnormal exit, a postmortem markdown appears under `.codelicious/`. + +## Out of Scope + +- Distributed-locking across machines (would require Redis or similar — violates zero-runtime-dependency rule). +- Resuming an aborted *single chunk* mid-flight (the unit of resumption is the chunk; partial chunk work is discarded). A future spec could explore checkpointing within `agent_runner` if needed. +- A web UI for the postmortem / ledger — markdown is sufficient for the headless-CLI value proposition. +- Mocking out git for the SIGTERM integration test — the test should exercise the real git path; if that proves flaky, factor a thin git facade in a follow-up spec. diff --git a/src/codelicious/agent_runner.py b/src/codelicious/agent_runner.py index a30ee4b2..f24a6eb3 100644 --- a/src/codelicious/agent_runner.py +++ b/src/codelicious/agent_runner.py @@ -15,6 +15,7 @@ import logging import pathlib import queue +import re import shutil import subprocess import threading @@ -152,11 +153,15 @@ def _build_agent_command( list[str] Command list suitable for subprocess.Popen. """ + # spec v29 Step 9: chunk-level config can override the default + # ``--output-format`` and ``--allowedTools`` values from spec 27 §3.2. + output_format_attr = getattr(config, "output_format", "") + output_format = output_format_attr if isinstance(output_format_attr, str) and output_format_attr else "stream-json" cmd: list[str] = [ claude_bin, "--print", "--output-format", - "stream-json", + output_format, "--verbose", # bypassPermissions lets the agent edit/write/run shell commands inside # the project working directory without per-action prompts. Codelicious @@ -167,6 +172,15 @@ def _build_agent_command( "bypassPermissions", ] + allowed_tools = getattr(config, "allowed_tools", None) + # Accept only an explicit list/tuple of strings or a pre-joined CSV string; + # ignore everything else (including MagicMock attribute auto-creation). + if isinstance(allowed_tools, (list, tuple)) and allowed_tools: + allowed_tools_str = ",".join(str(t) for t in allowed_tools) + cmd.extend(["--allowedTools", allowed_tools_str]) + elif isinstance(allowed_tools, str) and allowed_tools.strip(): + cmd.extend(["--allowedTools", allowed_tools.strip()]) + model = getattr(config, "model", "") if model: cmd.extend(["--model", model]) @@ -189,6 +203,51 @@ def _build_agent_command( return cmd +# Bounds for parsed Claude rate-limit retry windows. Below 10 s is suspect +# (provider noise); above 1 hour we'd rather fail fast than block a build. +_CLAUDE_RETRY_AFTER_MIN_S: float = 10.0 +_CLAUDE_RETRY_AFTER_MAX_S: float = 3600.0 +_CLAUDE_RETRY_AFTER_DEFAULT_S: float = 60.0 + + +def _parse_claude_reset_seconds(text: str) -> float | None: + """Parse a Claude CLI rate-limit message for a reset window in seconds. + + Recognises three common shapes that appear in Claude CLI output: + + * ``resets in <N> seconds`` / ``reset in <N>s`` + * ``try again in <N> minutes`` / ``try again in <N>m`` + * ``Retry-After: <N>`` (seconds) + + Returns the parsed delay clamped to + ``[_CLAUDE_RETRY_AFTER_MIN_S, _CLAUDE_RETRY_AFTER_MAX_S]`` or ``None`` if + no recognised pattern is present. + """ + if not text: + return None + lowered = text.lower() + + seconds_match = re.search(r"reset[s]?\s+in\s+(\d+)\s*(?:seconds?|secs?|s)\b", lowered) + if seconds_match: + return _clamp_retry_after(float(seconds_match.group(1))) + + minutes_match = re.search(r"try\s+again\s+in\s+(\d+)\s*(?:minutes?|mins?|m)\b", lowered) + if minutes_match: + return _clamp_retry_after(float(minutes_match.group(1)) * 60.0) + + # Retry-After header occasionally surfaces verbatim in stderr; match in + # the original (case-insensitive) text since header names are mixed-case. + header_match = re.search(r"retry-after\s*:\s*(\d+)", text, re.IGNORECASE) + if header_match: + return _clamp_retry_after(float(header_match.group(1))) + + return None + + +def _clamp_retry_after(value: float) -> float: + return max(_CLAUDE_RETRY_AFTER_MIN_S, min(_CLAUDE_RETRY_AFTER_MAX_S, value)) + + def _check_agent_errors( returncode: int, stdout_lines: list[str], @@ -258,9 +317,11 @@ def _check_agent_errors( safe_stderr, ) safe_combined = sanitize_message((stderr_text + stdout_text)[-500:]) + retry_after = _parse_claude_reset_seconds(stderr_text + stdout_text) + retry_after_s = retry_after if retry_after is not None else _CLAUDE_RETRY_AFTER_DEFAULT_S raise ClaudeRateLimitError( f"Claude CLI rate limited (exit code {returncode}): {safe_combined}", - retry_after_s=60.0, + retry_after_s=retry_after_s, ) logger.warning( diff --git a/src/codelicious/chunker.py b/src/codelicious/chunker.py index 8f297bf3..83a61ecb 100644 --- a/src/codelicious/chunker.py +++ b/src/codelicious/chunker.py @@ -22,6 +22,27 @@ _MAX_CHUNKS_PER_SPEC = 100 +# spec v29 Step 7: spec-windowing constants for ``chunk_spec_with_llm``. +# Windowing is triggered when ``len(spec_content) > _LLM_WINDOW_SIZE``. Each +# window covers ``_LLM_WINDOW_SIZE`` characters with ``_LLM_WINDOW_OVERLAP`` +# characters of overlap to keep cross-window references intact. We cap the +# total number of windows so an enormous spec doesn't burn 100 LLM calls. +_LLM_WINDOW_SIZE: int = 5000 +_LLM_WINDOW_OVERLAP: int = 500 +_LLM_MAX_WINDOWS: int = 10 + +# spec v30 Step 6: token budgets per engine. Estimated as ~4 chars/token. The +# Claude budget leaves ~50k tokens of headroom on the 200k context window for +# the response; HuggingFace defaults are far tighter. Resolution order: +# explicit ``engines`` arg > "default". +_MAX_CHUNK_TOKENS_PER_ENGINE: dict[str, int] = { + "claude": 150_000, + "huggingface": 24_000, + "default": 24_000, +} +_TOKEN_CHARS_PER_TOKEN: int = 4 +_MAX_CHUNK_SPLIT_DEPTH: int = 3 + # Matches a markdown checkbox line, capturing the text after the box. _CHECKBOX_LINE_RE = re.compile(r"^\s*-\s*\[\s*\]\s*(.+)", re.MULTILINE) @@ -197,56 +218,77 @@ def chunk_spec_with_llm( logger.warning("Cannot read spec %s for LLM chunking; falling back.", spec_path) return chunk_spec(spec_path, repo_path) - # Build the prompt - prompt = ( - "You are a software architect. Given the following spec, decompose it into " - "independent, commit-sized units of work. Each chunk should:\n" - "- Touch a small number of files\n" - "- Be independently testable\n" - "- Have a clear title (under 72 chars)\n" - "- List which files it likely modifies\n" - "- Specify dependencies on other chunks (by index)\n\n" - f"Respond ONLY with a JSON array. Each element must have:\n" - ' {{"title": "...", "description": "...", "files": ["..."], ' - '"depends_on_indices": [], "validation": "..."}}\n\n' - f"## Spec\n{spec_content[:5000]}\n" - ) - - messages = [ - { - "role": "system", - "content": "You decompose specs into commit-sized work chunks. Respond only with valid JSON.", - }, - {"role": "user", "content": prompt}, - ] - - try: - response = llm_client.chat_completion(messages, tools=[], role="planner") - content = "" - choices = response.get("choices") or [] - if choices and isinstance(choices[0], dict): - msg = choices[0].get("message", {}) - content = msg.get("content", "") if isinstance(msg, dict) else "" - except Exception as e: - logger.warning("LLM chunking failed: %s; falling back to deterministic.", e) - return chunk_spec(spec_path, repo_path) + # spec v29 Step 7: window oversized specs instead of silently truncating. + windows = _split_spec_for_llm(spec_content) + raw_chunks: list[dict] = [] + seen_titles: set[str] = set() + for window_idx, window_text in enumerate(windows): + prompt = ( + "You are a software architect. Given the following spec, decompose it into " + "independent, commit-sized units of work. Each chunk should:\n" + "- Touch a small number of files\n" + "- Be independently testable\n" + "- Have a clear title (under 72 chars)\n" + "- List which files it likely modifies\n" + "- Specify dependencies on other chunks (by index)\n\n" + "Respond ONLY with a JSON array. Each element must have:\n" + ' {"title": "...", "description": "...", "files": ["..."], ' + '"depends_on_indices": [], "validation": "..."}\n\n' + f"## Spec (window {window_idx + 1}/{len(windows)})\n{window_text}\n" + ) - # Parse the JSON response - try: - # Extract JSON array from the response (may be wrapped in markdown code block) - json_str = content.strip() - if json_str.startswith("```"): - # Strip markdown code fences - lines = json_str.splitlines() - lines = [ln for ln in lines if not ln.strip().startswith("```")] - json_str = "\n".join(lines) - - raw_chunks = json.loads(json_str) - if not isinstance(raw_chunks, list): - raise ValueError("LLM response is not a JSON array") - except (json.JSONDecodeError, ValueError) as e: - logger.warning("LLM returned invalid JSON: %s; falling back.", e) - return chunk_spec(spec_path, repo_path) + messages = [ + { + "role": "system", + "content": "You decompose specs into commit-sized work chunks. Respond only with valid JSON.", + }, + {"role": "user", "content": prompt}, + ] + + try: + response = llm_client.chat_completion(messages, tools=[], role="planner") + content = "" + choices = response.get("choices") or [] + if choices and isinstance(choices[0], dict): + msg = choices[0].get("message", {}) + content = msg.get("content", "") if isinstance(msg, dict) else "" + except Exception as e: + logger.warning("LLM chunking failed on window %d: %s; falling back to deterministic.", window_idx + 1, e) + return chunk_spec(spec_path, repo_path) + + try: + json_str = content.strip() + if json_str.startswith("```"): + lines = json_str.splitlines() + lines = [ln for ln in lines if not ln.strip().startswith("```")] + json_str = "\n".join(lines) + + window_raw = json.loads(json_str) + if not isinstance(window_raw, list): + raise ValueError("LLM response is not a JSON array") + except (json.JSONDecodeError, ValueError) as e: + logger.warning("LLM returned invalid JSON on window %d: %s; falling back.", window_idx + 1, e) + return chunk_spec(spec_path, repo_path) + + # Deduplicate on normalised title across windows: the overlap region + # invites the LLM to re-emit the same chunk in two adjacent windows. + for entry in window_raw: + if not isinstance(entry, dict): + continue + title_norm = str(entry.get("title", "")).strip().lower() + if title_norm and title_norm in seen_titles: + continue + seen_titles.add(title_norm) + raw_chunks.append(entry) + + if len(raw_chunks) > _MAX_CHUNKS_PER_SPEC: + logger.warning( + "LLM produced %d chunks across %d windows; truncating to %d.", + len(raw_chunks), + len(windows), + _MAX_CHUNKS_PER_SPEC, + ) + raw_chunks = raw_chunks[:_MAX_CHUNKS_PER_SPEC] # Validate and convert to WorkChunk objects chunks: list[WorkChunk] = [] @@ -291,9 +333,6 @@ def chunk_spec_with_llm( ) ) - if len(chunks) > _MAX_CHUNKS_PER_SPEC: - raise ValueError(f"LLM decomposed spec into {len(chunks)} chunks (max {_MAX_CHUNKS_PER_SPEC}).") - # Validate no circular dependencies if _has_circular_deps(chunks): logger.warning("LLM produced circular dependencies; falling back to deterministic.") @@ -307,6 +346,143 @@ def chunk_spec_with_llm( return chunks +def _estimate_chunk_tokens(chunk: WorkChunk, repo: pathlib.Path) -> int: + """Estimate the prompt token cost of dispatching ``chunk`` to an engine. + + Sums the title, description, a fixed prompt-template overhead, and the + sizes of every existing file in ``chunk.estimated_files``. Divides the + total character count by ``_TOKEN_CHARS_PER_TOKEN``. + """ + chars = len(chunk.title) + len(chunk.description) + chars += 2_000 # rough prompt-template / system-message overhead + for fname in chunk.estimated_files or []: + # Reject path traversal up-front so this helper is safe even when + # called on untrusted chunks (the chunker validates LLM-supplied + # paths but the deterministic chunker uses regex hints). + if ".." in fname or fname.startswith("/"): + continue + try: + path = repo / fname + if path.is_file(): + chars += path.stat().st_size + except (OSError, ValueError): + continue + return max(0, chars // _TOKEN_CHARS_PER_TOKEN) + + +def _resolve_token_budget(engines: list[str] | None) -> int: + """Return the smallest budget across the active engines (worst-case).""" + if not engines: + return _MAX_CHUNK_TOKENS_PER_ENGINE["default"] + budgets = [_MAX_CHUNK_TOKENS_PER_ENGINE.get(name, _MAX_CHUNK_TOKENS_PER_ENGINE["default"]) for name in engines] + return min(budgets) if budgets else _MAX_CHUNK_TOKENS_PER_ENGINE["default"] + + +def _split_chunk_in_half(chunk: WorkChunk, suffix: str) -> tuple[WorkChunk, WorkChunk]: + """Split ``chunk`` by halving its file list. The second half depends on the first.""" + files = list(chunk.estimated_files or []) + mid = max(1, len(files) // 2) + head_files = files[:mid] + tail_files = files[mid:] + head = WorkChunk( + id=chunk.id, + spec_path=chunk.spec_path, + title=chunk.title, + description=chunk.description, + depends_on=list(chunk.depends_on or []), + estimated_files=head_files, + validation=chunk.validation, + ) + tail = WorkChunk( + id=f"{chunk.id}_{suffix}", + spec_path=chunk.spec_path, + title=chunk.title, + description=chunk.description, + depends_on=[head.id], + estimated_files=tail_files, + validation=chunk.validation, + ) + return head, tail + + +def enforce_token_budget( + chunks: list[WorkChunk], + repo: pathlib.Path, + *, + engines: list[str] | None = None, +) -> list[WorkChunk]: + """Recursively split chunks that exceed the active token budget (spec v30 Step 6). + + The recursion is capped at ``_MAX_CHUNK_SPLIT_DEPTH`` levels (i.e. up to 8 + sub-chunks per original chunk). When a chunk still exceeds the budget at + that depth a WARNING is logged and the chunk is dispatched anyway — + failing fast at the engine boundary is preferable to dropping work. + """ + budget = _resolve_token_budget(engines) + out: list[WorkChunk] = [] + # Each entry: (chunk, depth, suffix_seed). suffix_seed cycles ``b → c → ...``. + stack: list[tuple[WorkChunk, int, int]] = [(c, 0, 0) for c in chunks] + suffix_alphabet = "bcdefghij" + while stack: + chunk, depth, seed = stack.pop(0) + tokens = _estimate_chunk_tokens(chunk, repo) + if tokens <= budget: + out.append(chunk) + continue + if depth >= _MAX_CHUNK_SPLIT_DEPTH or len(chunk.estimated_files or []) <= 1: + logger.warning( + "Chunk %s still %d tokens after %d splits — dispatch anyway, may fail.", + chunk.id, + tokens, + depth, + ) + out.append(chunk) + continue + suffix = suffix_alphabet[min(seed, len(suffix_alphabet) - 1)] + head, tail = _split_chunk_in_half(chunk, suffix) + # Push back onto the front so dependent ordering is preserved. + stack.insert(0, (head, depth + 1, seed + 1)) + stack.insert(1, (tail, depth + 1, seed + 1)) + return out + + +def _split_spec_for_llm(spec_content: str) -> list[str]: + """Split a spec body into overlapping windows for ``chunk_spec_with_llm``. + + Returns ``[spec_content]`` when the body is short enough to fit in a + single LLM pass. Otherwise produces sliding windows of length + ``_LLM_WINDOW_SIZE`` with ``_LLM_WINDOW_OVERLAP`` characters of overlap, + capped at ``_LLM_MAX_WINDOWS``. + """ + if len(spec_content) <= _LLM_WINDOW_SIZE: + return [spec_content] + + step = max(_LLM_WINDOW_SIZE - _LLM_WINDOW_OVERLAP, 1) + windows: list[str] = [] + start = 0 + while start < len(spec_content) and len(windows) < _LLM_MAX_WINDOWS: + windows.append(spec_content[start : start + _LLM_WINDOW_SIZE]) + if start + _LLM_WINDOW_SIZE >= len(spec_content): + break + start += step + + if len(windows) >= _LLM_MAX_WINDOWS and (windows[-1] is not None) and start + _LLM_WINDOW_SIZE < len(spec_content): + logger.warning( + "Spec exceeded %d windows of %d chars; truncating remaining content.", + _LLM_MAX_WINDOWS, + _LLM_WINDOW_SIZE, + ) + + logger.warning( + "Spec too large for single LLM pass: %d chars, splitting into %d window(s) of ~%d chars with %d-char overlap.", + len(spec_content), + len(windows), + _LLM_WINDOW_SIZE, + _LLM_WINDOW_OVERLAP, + ) + return windows + + def _has_circular_deps(chunks: list[WorkChunk]) -> bool: """Check for circular dependencies in a list of chunks using DFS.""" ids = {c.id for c in chunks} diff --git a/src/codelicious/cli.py b/src/codelicious/cli.py index 3b268001..0b97e75a 100644 --- a/src/codelicious/cli.py +++ b/src/codelicious/cli.py @@ -1,3 +1,5 @@ +import atexit +import contextlib import dataclasses import logging import os @@ -29,6 +31,242 @@ class PreFlightResult: skipped: bool # True when --skip-auth-check was used +_DEFAULT_PR_COMMIT_CAP = 8 +_DEFAULT_PR_LOC_CAP = 250 + + +def _write_postmortem( + repo_path: Path, + *, + abort_reason: str, + log_path: Path | None = None, + spec_arg: str = "", +) -> Path | None: + """Write a one-page postmortem under ``.codelicious/`` (spec v30 Step 12). + + Aggregates per-chunk status from any ledger files written during the run + (spec v30 Step 2) and the tail of the active log file. Returns the path + to the written file, or ``None`` if writing failed. Never raises. + """ + import datetime as _dt + import json as _json + + ts = _dt.datetime.now(_dt.timezone.utc).strftime("%Y%m%dT%H%M%SZ") + out_dir = repo_path / ".codelicious" + try: + out_dir.mkdir(parents=True, exist_ok=True) + except OSError: + return None + out_path = out_dir / f"postmortem-{ts}.md" + + counts = {"merged": 0, "failed": 0, "in_progress": 0, "other": 0} + failed_titles: list[str] = [] + state_dir = out_dir / "state" + if state_dir.is_dir(): + for ledger_file in sorted(state_dir.glob("*.json")): + try: + data = _json.loads(ledger_file.read_text(encoding="utf-8")) + except (OSError, ValueError): + continue + for entry in (data.get("chunks") or {}).values(): + status = entry.get("status", "other") + counts[status] = counts.get(status, 0) + 1 + if status == "failed": + failed_titles.append(str(entry.get("title", "?"))) + + log_tail = "" + if log_path and log_path.is_file(): + try: + lines = log_path.read_text(encoding="utf-8", errors="replace").splitlines() + log_tail = "\n".join(lines[-50:]) + except OSError: + log_tail = "" + + resume_cmd = f"codelicious build {spec_arg or '.'}" + if abort_reason in ("rate-limit", "exception"): + # A re-run on the same ledger is the right move; don't suggest --reset-ledger. + pass + elif abort_reason == "ledger-corrupt": + resume_cmd += " --reset-ledger" + + body = [ + f"# Codelicious postmortem ({ts})", + "", + f"**Abort reason:** {abort_reason}", + "", + "## Chunk status", + f"- merged: {counts['merged']}", + f"- failed: {counts['failed']}", + f"- in_progress: {counts['in_progress']}", + f"- other: {counts['other']}", + "", + ] + if failed_titles: + body.append("### Failed chunks") + for title in failed_titles[:25]: + body.append(f"- {title}") + body.append("") + if log_tail: + body.append("## Log tail (last 50 lines)") + body.append("```") + body.append(log_tail) + body.append("```") + body.append("") + body.append("## Resume") + body.append(f"```\n{resume_cmd}\n```") + + try: + out_path.write_text("\n".join(body) + "\n", encoding="utf-8") + except OSError: + return None + return out_path + + +def _atomic_symlink_update(link: Path, target_name: str) -> None: + """Atomically point ``link`` at ``target_name`` (spec v30 Step 7). + + Creates ``link.<pid>.tmp`` as a symlink to ``target_name``, then uses + ``os.replace`` to swap it into place — atomic on POSIX and on NTFS for + symlinks since Windows 10. On platforms that disallow symlink creation + (e.g. Windows without developer mode), falls back to writing a plain + file ``<link>.txt`` containing ``target_name`` and logs an INFO once. + """ + tmp = link.with_suffix(link.suffix + f".{os.getpid()}.tmp") + try: + # If a stale tmp from a prior crashed run exists, remove it. + if tmp.exists() or tmp.is_symlink(): + try: + tmp.unlink() + except OSError: + pass + os.symlink(target_name, tmp) + os.replace(tmp, link) + except OSError: + # Symlink creation failed (Windows non-developer mode, or read-only + # parent). Fall back to a sibling text pointer; never raise. + try: + (link.with_suffix(link.suffix + ".txt")).write_text(target_name, encoding="utf-8") + logging.getLogger("codelicious").info( + "latest-log symlink unsupported on this platform; wrote %s.txt instead.", link + ) + except OSError: + pass + + +@contextlib.contextmanager +def _run_lock(repo_root: Path): + """Acquire a per-repo advisory lock to prevent concurrent runs (spec v30 Step 1). + + On POSIX uses ``fcntl.flock`` with ``LOCK_EX | LOCK_NB``; on platforms + without ``fcntl`` (Windows) logs a single warning and proceeds without + locking. Writes the holder PID into the lockfile, cleans up on normal + exit *and* SystemExit (atexit handler). + """ + lock_dir = repo_root / ".codelicious" + lock_dir.mkdir(parents=True, exist_ok=True) + lock_path = lock_dir / "run.lock" + + try: + import fcntl + except ImportError: + logging.getLogger("codelicious").warning("concurrent-run protection unavailable on this platform") + yield None + return + + fd = os.open(str(lock_path), os.O_CREAT | os.O_RDWR, 0o600) + try: + try: + fcntl.flock(fd, fcntl.LOCK_EX | fcntl.LOCK_NB) + except OSError: + # Read existing PID for the diagnostic. + try: + with open(lock_path, encoding="utf-8") as fh: + other_pid = fh.read().strip() or "?" + except OSError: + other_pid = "?" + print( + f"Error: another codelicious run is in progress (pid {other_pid}). " + f"If you believe this is stale, delete {lock_path}.", + file=sys.stderr, + ) + os.close(fd) + sys.exit(75) + + # Write our PID and ensure cleanup on every exit path. + os.ftruncate(fd, 0) + os.write(fd, f"{os.getpid()}\n".encode()) + os.fsync(fd) + + def _release() -> None: + try: + fcntl.flock(fd, fcntl.LOCK_UN) + except OSError: + pass + try: + os.close(fd) + except OSError: + pass + try: + lock_path.unlink() + except OSError: + pass + + atexit.register(_release) + try: + yield lock_path + finally: + _release() + except SystemExit: + raise + + +def _validate_endpoint_url_strict(url: str, *, source: str = "LLM_ENDPOINT") -> None: + """Reject non-HTTPS endpoints and credentials-in-URL at the CLI layer (spec v30 Step 3). + + Raises ``SystemExit(2)`` with a clear, sanitized error before any banner + or downstream call site sees the value. Empty input is accepted (the + feature is disabled). + """ + if not url: + return + + import urllib.parse + + try: + parts = urllib.parse.urlsplit(url) + except Exception: + print(f"Error: --endpoint / {source} is not a parseable URL.", file=sys.stderr) + sys.exit(2) + + # Sanitize for display: mask any user-info before echoing. + sanitized = url + if "@" in (parts.netloc or ""): + userinfo, _, hostpart = parts.netloc.partition("@") + if userinfo: + sanitized_netloc = f"***@{hostpart}" + sanitized = urllib.parse.urlunsplit( + (parts.scheme, sanitized_netloc, parts.path, parts.query, parts.fragment) + ) + + if parts.scheme.lower() != "https": + print( + f"Error: {source} must be an https:// URL with no embedded credentials, got: {sanitized}", + file=sys.stderr, + ) + sys.exit(2) + + if not parts.hostname: + print(f"Error: {source} has no hostname: {sanitized}", file=sys.stderr) + sys.exit(2) + + if parts.username or parts.password: + print( + f"Error: {source} must be an https:// URL with no embedded credentials, got: {sanitized}", + file=sys.stderr, + ) + sys.exit(2) + + def _handle_sigterm(signum: int, frame: object) -> None: """Handle SIGTERM for graceful shutdown in container/orchestrator environments.""" logging.getLogger("codelicious").warning("Received SIGTERM (signal %d), shutting down gracefully", signum) @@ -69,10 +307,11 @@ def _probe_git_credentials(repo_path: Path) -> dict: * ``gpg --list-secret-keys --with-colons`` → whether GPG agent has a usable key Returns a dict with keys: ``transport`` (``"ssh"|"https"|"unknown"``), - ``gpg_signing`` (bool), ``ssh_key_loaded`` (bool), ``gpg_agent_warm`` (bool). - On any failure the conservative value is used so the orchestrator - surfaces an interactive prompt rather than letting an autonomous - cycle hang on an invisible passphrase prompt. + ``gpg_signing`` (bool), ``ssh_key_loaded`` (bool), ``gpg_agent_warm`` (bool), + and ``ssh_agent_state`` (one of ``"keys_loaded" | "empty" | "no_agent" | + "unknown"``). On any failure the conservative value is used so the + orchestrator surfaces an interactive prompt rather than letting an + autonomous cycle hang on an invisible passphrase prompt. """ _T = 15 info: dict = { @@ -80,6 +319,7 @@ def _probe_git_credentials(repo_path: Path) -> dict: "gpg_signing": False, "ssh_key_loaded": False, "gpg_agent_warm": False, + "ssh_agent_state": "unknown", } # ── Remote URL → transport ──────────────────────────────────────── @@ -120,13 +360,28 @@ def _probe_git_credentials(repo_path: Path) -> dict: text=True, timeout=_T, ) - # ssh-add -l: rc 0 → keys loaded; rc 1 → no keys; rc 2 → no agent - info["ssh_key_loaded"] = ssh_result.returncode == 0 and bool(ssh_result.stdout.strip()) + # ssh-add -l: rc 0 → keys loaded; rc 1 → agent running, no keys; + # rc 2 → no agent reachable. Any other rc is treated as "unknown" + # and falls into the conservative not-loaded branch. + if ssh_result.returncode == 0 and ssh_result.stdout.strip(): + info["ssh_agent_state"] = "keys_loaded" + info["ssh_key_loaded"] = True + elif ssh_result.returncode == 1: + info["ssh_agent_state"] = "empty" + info["ssh_key_loaded"] = False + elif ssh_result.returncode == 2: + info["ssh_agent_state"] = "no_agent" + info["ssh_key_loaded"] = False + else: + info["ssh_agent_state"] = "unknown" + info["ssh_key_loaded"] = False except (subprocess.TimeoutExpired, OSError): + info["ssh_agent_state"] = "unknown" info["ssh_key_loaded"] = False else: # Not an SSH push; key state is irrelevant — treat as "ready" info["ssh_key_loaded"] = True + info["ssh_agent_state"] = "keys_loaded" # ── GPG agent probe (only meaningful when signing) ──────────────── if info["gpg_signing"]: @@ -181,9 +436,17 @@ def _ensure_git_credentials_unlocked( # ── SSH key prompt ─────────────────────────────────────────────── if info["transport"] == "ssh" and not info["ssh_key_loaded"]: - print("\n SSH push transport detected, but no key is loaded in the agent.") - print(" Codelicious will run `ssh-add` so you can unlock your key now —") - print(" otherwise the autonomous loop would block on a passphrase prompt.\n") + agent_state = info.get("ssh_agent_state", "unknown") + if agent_state == "no_agent": + print("\n SSH push transport detected, but no ssh-agent is running.") + print(" Start one with `eval $(ssh-agent)` and then `ssh-add` your key,") + print(" otherwise the autonomous loop would block on a passphrase prompt.\n") + else: + # "empty" (agent running, no keys) or "unknown" (probe failure): + # running ssh-add directly is the right move in both cases. + print("\n SSH push transport detected, but no key is loaded in the agent.") + print(" Codelicious will run `ssh-add` so you can unlock your key now —") + print(" otherwise the autonomous loop would block on a passphrase prompt.\n") try: subprocess.run(["ssh-add"], timeout=300) except (subprocess.TimeoutExpired, OSError) as e: @@ -411,13 +674,11 @@ def _attach_file_log_handler(repo_path: Path) -> None: fh.addFilter(SanitizingFilter()) logging.getLogger().addHandler(fh) logging.getLogger().setLevel(logging.DEBUG) + # spec v30 Step 7: update ``latest.log`` atomically via tmp + rename + # so a fast restart (or, pre-Step 1, a concurrent run) cannot leave the + # symlink dangling between unlink and re-create. latest = log_dir / "latest.log" - try: - if latest.is_symlink() or latest.exists(): - latest.unlink() - latest.symlink_to(log_path.name) - except OSError: - pass + _atomic_symlink_update(latest, log_path.name) logging.getLogger("codelicious").info("Logging build to %s", log_path) except OSError as e: logging.getLogger("codelicious").warning("Could not attach file logger: %s", e) @@ -609,12 +870,15 @@ def _parse_args(argv: list[str]) -> dict: "skip_auth_check": False, "dry_run": False, "spec": "", - "max_commits_per_pr": 8, - "max_loc_per_pr": 250, + "max_commits_per_pr": _DEFAULT_PR_COMMIT_CAP, + "max_loc_per_pr": _DEFAULT_PR_LOC_CAP, "platform": "auto", "continuous": False, "cycle_sleep_s": 60, "skip_credential_probe": False, + "no_resume": False, + "reset_ledger": False, + "min_coverage": -1.0, } # Flags that take a value @@ -629,6 +893,7 @@ def _parse_args(argv: list[str]) -> dict: "--max-loc-per-pr": "max_loc_per_pr", "--platform": "platform", "--cycle-sleep-s": "cycle_sleep_s", + "--min-coverage": "min_coverage", } # Integer-valued flags that need int() conversion @@ -646,6 +911,9 @@ def _parse_args(argv: list[str]) -> dict: "--skip-credential-probe": "skip_credential_probe", "--dry-run": "dry_run", "--continuous": "continuous", + # spec v30 Step 2: idempotent-resume controls. + "--no-resume": "no_resume", + "--reset-ledger": "reset_ledger", } i = 0 @@ -693,6 +961,12 @@ def _parse_args(argv: list[str]) -> dict: except ValueError: print(f"Error: {args[i]} requires an integer, got '{value}'") sys.exit(2) + elif key == "min_coverage": + try: + value = float(value) + except ValueError: + print(f"Error: {args[i]} requires a number, got '{value}'") + sys.exit(2) opts[key] = value i += 2 elif not args[i].startswith("-") and not opts["repo_path"]: @@ -742,11 +1016,23 @@ def main(): opts = _parse_args(sys.argv) + # spec v30 Step 3: reject insecure / credential-bearing LLM endpoints + # before any downstream code (or the banner) sees the value. Validates + # the env var the LLM client ultimately reads. + _validate_endpoint_url_strict(os.environ.get("LLM_ENDPOINT", ""), source="LLM_ENDPOINT") + if opts.get("endpoint"): + _validate_endpoint_url_strict(opts["endpoint"], source="--endpoint") + repo_path = Path(opts["repo_path"]).resolve() if not repo_path.is_dir(): logger.error("Repository path %s does not exist or is not a directory.", repo_path) sys.exit(1) + # spec v30 Step 1: per-repo advisory lock — second concurrent invocation + # exits 75 (EX_TEMPFAIL) before any git, sandbox, or LLM call happens. + _run_lock_cm = _run_lock(repo_path) + _run_lock_cm.__enter__() + _attach_file_log_handler(repo_path) logger.info("Starting Codelicious workflow in %s", repo_path) @@ -780,6 +1066,19 @@ def main(): logger.error(str(e)) sys.exit(1) + # spec v30 Step 5: when the user did NOT force a specific engine and both + # Claude and HuggingFace credentials are configured, build a fallback list + # so a Claude rate-limit fails over to HF for the remainder of the spec. + requested_engine = opts.get("engine") or os.environ.get("CODELICIOUS_ENGINE", "auto") + fallback_engines: list = [engine] + if requested_engine in ("auto", ""): + hf_creds = bool(os.environ.get("HF_TOKEN") or os.environ.get("LLM_API_KEY")) + if engine.name == "claude" and hf_creds: + try: + fallback_engines.append(select_engine("huggingface")) + except RuntimeError: + pass + # 2. Initialize Git Orchestration git_manager = GitManager(repo_path) git_manager.verify_git_identity() # spec-27 Phase 0.2 @@ -837,8 +1136,8 @@ def main(): except OSError: pass print() - print(f"[codelicious] Max commits per PR: {opts.get('max_commits_per_pr', 8)}") - print(f"[codelicious] Max LOC per PR: {opts.get('max_loc_per_pr', 250)}") + print(f"[codelicious] Max commits per PR: {opts.get('max_commits_per_pr', _DEFAULT_PR_COMMIT_CAP)}") + print(f"[codelicious] Max LOC per PR: {opts.get('max_loc_per_pr', _DEFAULT_PR_LOC_CAP)}") print(f"[codelicious] Platform: {opts.get('platform', 'auto')}") print() sys.exit(0) @@ -849,14 +1148,28 @@ def main(): # 5. Run the v2 chunk-based orchestration loop (spec-27 Phase 4.1) from codelicious.orchestrator import V2Orchestrator + # spec v30 Step 2: --reset-ledger wipes any persisted chunk state before + # the orchestrator runs so the next build re-executes everything. + if opts.get("reset_ledger"): + ledger_dir = repo_path / ".codelicious" / "state" + if ledger_dir.is_dir(): + for ledger_file in ledger_dir.glob("*.json"): + try: + ledger_file.unlink() + except OSError: + pass + logger.info("Reset chunk-status ledger directory: %s", ledger_dir) + v2_orch = V2Orchestrator( repo_path=repo_path, git_manager=git_manager, engine=engine, - max_commits_per_pr=opts.get("max_commits_per_pr", 8), - max_loc_per_pr=opts.get("max_loc_per_pr", 250), + engines=fallback_engines if len(fallback_engines) > 1 else None, + max_commits_per_pr=opts.get("max_commits_per_pr", _DEFAULT_PR_COMMIT_CAP), + max_loc_per_pr=opts.get("max_loc_per_pr", _DEFAULT_PR_LOC_CAP), model=opts.get("model", ""), progress_callback=_print_progress, + no_resume=opts.get("no_resume", False), ) continuous = opts.get("continuous", False) diff --git a/src/codelicious/engines/__init__.py b/src/codelicious/engines/__init__.py index 2b4350c2..84fce40d 100644 --- a/src/codelicious/engines/__init__.py +++ b/src/codelicious/engines/__init__.py @@ -6,11 +6,11 @@ import os import shutil -from codelicious.engines.base import BuildEngine, BuildResult, ChunkResult, EngineContext +from codelicious.engines.base import BuildEngine, ChunkResult, EngineContext logger = logging.getLogger("codelicious.engines") -__all__ = ["BuildEngine", "BuildResult", "ChunkResult", "EngineContext", "select_engine"] +__all__ = ["BuildEngine", "ChunkResult", "EngineContext", "select_engine"] def select_engine(engine_preference: str = "auto") -> BuildEngine: diff --git a/src/codelicious/engines/base.py b/src/codelicious/engines/base.py index 03b1ac69..a461170c 100644 --- a/src/codelicious/engines/base.py +++ b/src/codelicious/engines/base.py @@ -1,8 +1,10 @@ """Abstract base class and shared dataclasses for codelicious build engines. -Spec-27 Phase 3.1 adds the chunk-level interface (``execute_chunk``, -``verify_chunk``, ``fix_chunk``) alongside the legacy ``run_build_cycle`` -so both old and new orchestration paths work during migration. +Spec-27 Phase 3.1 defines the chunk-level interface: + +* ``execute_chunk`` — implement one work chunk +* ``verify_chunk`` — lint/test/security check a completed chunk +* ``fix_chunk`` — attempt to fix verification failures """ from __future__ import annotations @@ -11,21 +13,6 @@ import dataclasses import pathlib -# --------------------------------------------------------------------------- -# Legacy result (kept for backward compatibility during migration) -# --------------------------------------------------------------------------- - - -@dataclasses.dataclass -class BuildResult: - """Result from a legacy full-cycle build engine run.""" - - success: bool - message: str = "" - session_id: str = "" - elapsed_s: float = 0.0 - - # --------------------------------------------------------------------------- # spec-27 Phase 3.1: Chunk-level dataclasses # --------------------------------------------------------------------------- @@ -69,12 +56,10 @@ class BuildEngine(abc.ABC): """Abstract base for all codelicious build engines. **Chunk-level interface** (spec-27 Phase 3.1): - - ``execute_chunk`` — implement one work chunk - - ``verify_chunk`` — lint/test/security check a completed chunk - - ``fix_chunk`` — attempt to fix verification failures - **Legacy interface** (kept for migration): - - ``run_build_cycle`` — run the full build lifecycle + * ``execute_chunk`` — implement one work chunk + * ``verify_chunk`` — lint/test/security check a completed chunk + * ``fix_chunk`` — attempt to fix verification failures """ @property @@ -126,37 +111,3 @@ def fix_chunk( ``verify_chunk`` call. The engine should try to resolve them and return the updated file list. """ - - # ------------------------------------------------------------------ - # Legacy interface (kept for backward compatibility) - # ------------------------------------------------------------------ - - @abc.abstractmethod - def run_build_cycle( - self, - repo_path: pathlib.Path, - git_manager: object, - cache_manager: object, - spec_filter: str | None = None, - **kwargs, - ) -> BuildResult: - """Run the full build cycle (legacy interface). - - Parameters - ---------- - repo_path: - Path to the target repository. - git_manager: - GitManager instance for branch/commit/PR operations. - cache_manager: - CacheManager instance for state persistence. - spec_filter: - Optional: path to a specific spec to build. - **kwargs: - Engine-specific configuration (model, timeout, etc.) - - Returns - ------- - BuildResult - Outcome of the build cycle. - """ diff --git a/src/codelicious/engines/claude_engine.py b/src/codelicious/engines/claude_engine.py index eb9077c0..f2068f12 100644 --- a/src/codelicious/engines/claude_engine.py +++ b/src/codelicious/engines/claude_engine.py @@ -1,9 +1,8 @@ """Claude Code CLI build engine (spec-27 Phase 3.2). Delegates to the ``claude`` binary in headless mode for chunk execution. -The v2 orchestrator (``V2Orchestrator``) drives the chunk loop — this -engine only implements ``execute_chunk``, ``verify_chunk``, ``fix_chunk``, -and a ``run_build_cycle`` that delegates to ``V2Orchestrator``. +The orchestrator drives the chunk loop — this engine implements +``execute_chunk``, ``verify_chunk``, and ``fix_chunk``. """ from __future__ import annotations @@ -13,10 +12,25 @@ import sys import time -from codelicious.engines.base import BuildEngine, BuildResult, ChunkResult, EngineContext +from codelicious.engines.base import BuildEngine, ChunkResult, EngineContext logger = logging.getLogger("codelicious.engines.claude") +# spec v29 Step 9 / spec 27 §3.2: the default Claude CLI ``--allowedTools`` +# allow-list applied to every chunk. Surfacing this list at the chunk site +# (rather than burying it in ``agent_runner``) lets future engines tighten +# the surface per chunk without forking the runner. +_DEFAULT_ALLOWED_TOOLS: tuple[str, ...] = ( + "Edit", + "Write", + "Bash(git status:*)", + "Bash(pytest:*)", + "Bash(ruff:*)", + "Read", + "Glob", + "Grep", +) + class ClaudeCodeEngine(BuildEngine): """Build engine that uses the Claude Code CLI as its backend.""" @@ -79,6 +93,11 @@ class _ChunkConfig: config.max_turns = 50 config.agent_timeout_s = max(int(context.deadline - time.monotonic()), 60) if context.deadline else 1800 config.dry_run = False + # spec v29 Step 9: surface the spec 27 §3.2 Claude CLI flags at the + # chunk call-site so per-chunk callers can tighten the tool allow-list + # or switch output formats without editing agent_runner. + config.output_format = "stream-json" + config.allowed_tools = list(_DEFAULT_ALLOWED_TOOLS) try: result = run_agent( @@ -162,9 +181,11 @@ def verify_chunk( chunk_id = getattr(chunk, "id", "unknown") try: - from codelicious.verifier import verify + from codelicious.verifier import verify, verify_paths - vresult = verify(repo_path) + # spec v29 Step 6: scope checks to the chunk's files when known. + scoped = list(getattr(chunk, "estimated_files", []) or []) + vresult = verify_paths(repo_path, scoped) if scoped else verify(repo_path) if vresult.all_passed: logger.info("Verification passed for chunk %s.", chunk_id) return ChunkResult(success=True, message="All checks passed.") @@ -252,48 +273,3 @@ class _FixConfig: message=f"Fix attempt for chunk {chunk_id}", retries_used=1, ) - - # ------------------------------------------------------------------ - # Legacy interface — delegates to V2Orchestrator - # ------------------------------------------------------------------ - - def run_build_cycle( - self, - repo_path: pathlib.Path, - git_manager: object, - cache_manager: object, - spec_filter: str | None = None, - **kwargs, - ) -> BuildResult: - """Run the build lifecycle by delegating to V2Orchestrator. - - This method exists for backward compatibility with the ``BuildEngine`` - interface. The ``cli.py`` main entry point now calls ``V2Orchestrator`` - directly, so this path is only used if an external caller invokes the - engine directly. - """ - from codelicious.orchestrator import V2Orchestrator - from codelicious.spec_discovery import discover_incomplete_specs - - start = time.monotonic() - repo_path = pathlib.Path(repo_path).resolve() - agent_timeout_s = kwargs.get("agent_timeout_s", 1800) - push_pr = kwargs.get("push_pr", False) - max_commits_per_pr = kwargs.get("max_commits_per_pr", 50) - - specs = discover_incomplete_specs(repo_path) - if not specs: - return BuildResult(success=True, message="No incomplete specs found.", elapsed_s=time.monotonic() - start) - - orch = V2Orchestrator(repo_path, git_manager, self, max_commits_per_pr=max_commits_per_pr) - result = orch.run( - specs=specs, - deadline=start + agent_timeout_s, - push_pr=push_pr, - ) - - return BuildResult( - success=result.success, - message=result.message, - elapsed_s=result.elapsed_s, - ) diff --git a/src/codelicious/engines/huggingface_engine.py b/src/codelicious/engines/huggingface_engine.py index cae22b4a..eaa9f4c4 100644 --- a/src/codelicious/engines/huggingface_engine.py +++ b/src/codelicious/engines/huggingface_engine.py @@ -13,12 +13,17 @@ import random import time -from codelicious.engines.base import BuildEngine, BuildResult, ChunkResult, EngineContext +from codelicious.engines.base import BuildEngine, ChunkResult, EngineContext from codelicious.errors import LLMRateLimitError from codelicious.loop_controller import MAX_HISTORY_TOKENS, MAX_TOOL_RESULT_BYTES, truncate_history logger = logging.getLogger("codelicious.engines.huggingface") +# spec v29 Step 11: cap on the number of post-reflection fix-cycle attempts +# the HF engine will make before giving up. Prevents infinite loops when the +# model cannot satisfy verification. +_HF_MAX_FIX_ATTEMPTS: int = 2 + def _is_transient(exc: Exception) -> bool: """Classify an exception as transient (retryable) vs fatal.""" @@ -251,6 +256,33 @@ def execute_chunk( except Exception: files = [] + # spec v29 Step 11: gating verification before signalling success. + # The HF model can emit CHUNK_COMPLETE without actually satisfying the + # spec; run our own verifier and, if it fails, give the model up to two + # fix-cycle attempts before giving up. + retries_used = 0 + if completed: + verification = self.verify_chunk(chunk, repo_path) + while not verification.success and retries_used < _HF_MAX_FIX_ATTEMPTS: + retries_used += 1 + logger.info( + "Chunk %s: post-reflection verification failed (attempt %d/%d); invoking fix_chunk.", + chunk_id, + retries_used, + _HF_MAX_FIX_ATTEMPTS, + ) + self.fix_chunk(chunk, repo_path, [verification.message or "verification failed"]) + verification = self.verify_chunk(chunk, repo_path) + if not verification.success: + return ChunkResult( + success=False, + files_modified=files, + message=( + f"HF engine could not satisfy verification after " + f"{_HF_MAX_FIX_ATTEMPTS} fix attempts: {verification.message}" + ), + ) + return ChunkResult( success=completed, files_modified=files, @@ -264,9 +296,11 @@ def verify_chunk( ) -> ChunkResult: """Run verification checks on the repo after a chunk.""" try: - from codelicious.verifier import verify + from codelicious.verifier import verify, verify_paths - vresult = verify(repo_path) + # spec v29 Step 6: scope checks to the chunk's files when known. + scoped = list(getattr(chunk, "estimated_files", []) or []) + vresult = verify_paths(repo_path, scoped) if scoped else verify(repo_path) if vresult.all_passed: return ChunkResult(success=True, message="All checks passed.") @@ -315,48 +349,3 @@ class _FixChunk: message=result.message, retries_used=1, ) - - # ------------------------------------------------------------------ - # Legacy interface — delegates to V2Orchestrator - # ------------------------------------------------------------------ - - def run_build_cycle( - self, - repo_path: pathlib.Path, - git_manager: object, - cache_manager: object, - spec_filter: str | None = None, - **kwargs, - ) -> BuildResult: - """Run the build lifecycle by delegating to V2Orchestrator. - - This method exists for backward compatibility with the ``BuildEngine`` - interface. The ``cli.py`` main entry point now calls ``V2Orchestrator`` - directly, so this path is only used if an external caller invokes the - engine directly. - """ - from codelicious.orchestrator import V2Orchestrator - from codelicious.spec_discovery import discover_incomplete_specs - - start = time.monotonic() - repo_path = pathlib.Path(repo_path).resolve() - agent_timeout_s = kwargs.get("agent_timeout_s", 1800) - push_pr = kwargs.get("push_pr", False) - max_commits_per_pr = kwargs.get("max_commits_per_pr", 50) - - specs = discover_incomplete_specs(repo_path) - if not specs: - return BuildResult(success=True, message="No incomplete specs found.", elapsed_s=time.monotonic() - start) - - orch = V2Orchestrator(repo_path, git_manager, self, max_commits_per_pr=max_commits_per_pr) - result = orch.run( - specs=specs, - deadline=start + agent_timeout_s, - push_pr=push_pr, - ) - - return BuildResult( - success=result.success, - message=result.message, - elapsed_s=result.elapsed_s, - ) diff --git a/src/codelicious/git/git_orchestrator.py b/src/codelicious/git/git_orchestrator.py index f06fbffa..657b1984 100644 --- a/src/codelicious/git/git_orchestrator.py +++ b/src/codelicious/git/git_orchestrator.py @@ -643,6 +643,7 @@ def ensure_draft_pr_exists( part: int = 0, prev_pr_url: str = "", chunk_summaries: list[str] | None = None, + chunk_metadata: dict | None = None, ) -> int | None: """Ensure exactly one PR/MR exists for the current spec (spec-27 Phase 5). @@ -707,7 +708,7 @@ def ensure_draft_pr_exists( title = f"{title} (part {part})" title = title.replace("\n", " ").replace("\r", " ").replace("\x00", "")[:70] - body = self._build_pr_body(spec_id, chunk_summaries, prev_pr_url) + body = self._build_pr_body(spec_id, chunk_summaries, prev_pr_url, chunk_metadata) if platform == "gitlab": return self._create_gitlab_mr(cli_tool, title, body, _TIMEOUT_S) @@ -718,8 +719,16 @@ def _build_pr_body( spec_id: str, chunk_summaries: list[str] | None, prev_pr_url: str, + chunk_metadata: dict | None = None, ) -> str: - """Build the PR/MR body with spec link, chunk summary, and part links.""" + """Build the PR/MR body with spec link, chunk summary, and part links. + + spec v30 Step 9: when ``chunk_metadata`` is provided, render two extra + sections — ``## Chunk Context`` (spec source path, chunk id/title, + dependency graph slice) and ``## Verifier Summary`` (test count, lint + warnings, coverage Δ) — so a human reviewer has full context in one + glance. Missing fields render as ``n/a``. + """ parts = [ "## Summary\n", f"Autonomous implementation by Codelicious (spec-{spec_id}).\n", @@ -729,6 +738,39 @@ def _build_pr_body( for cs in chunk_summaries[:50]: parts.append(f"- {cs}") parts.append("") + + if chunk_metadata is not None: + md = chunk_metadata + parts.append("## Chunk Context") + parts.append(f"- Spec source: `{md.get('spec_path') or 'n/a'}`") + parts.append(f"- Chunk id: `{md.get('chunk_id') or 'n/a'}`") + parts.append(f"- Chunk title: {md.get('chunk_title') or 'n/a'}") + depends_on = md.get("depends_on") or [] + dependents = md.get("dependents") or [] + parts.append(f"- Depends on: {', '.join(depends_on) if depends_on else 'n/a'}") + parts.append(f"- Dependents: {', '.join(dependents) if dependents else 'n/a'}") + parts.append("") + + verifier = md.get("verifier") or {} + parts.append("## Verifier Summary") + tests = verifier.get("tests_passed", "n/a") + warnings = verifier.get("lint_warnings", "n/a") + coverage = verifier.get("coverage_pct", "n/a") + cov_delta = verifier.get("coverage_delta_pp", "n/a") + cov_str = f"{coverage}%" if coverage != "n/a" else "n/a" + delta_str = ( + f" (Δ {'+' if isinstance(cov_delta, (int, float)) and cov_delta >= 0 else ''}{cov_delta} pp)" + if cov_delta != "n/a" + else "" + ) + parts.append(f"tests passed: {tests} | lint warnings: {warnings} | coverage: {cov_str}{delta_str}") + parts.append("") + + audit_log = md.get("audit_log_path") + parts.append("## Audit Log") + parts.append(f"- {audit_log if audit_log else 'n/a'}") + parts.append("") + if prev_pr_url: parts.append(f"**Previous part:** {prev_pr_url}\n") parts.append("This PR updates automatically as new commits are pushed.\n") @@ -1144,17 +1186,72 @@ def revert_chunk_changes(self) -> bool: logger.error("Failed to revert chunk changes: %s", e) return False + def _branch_exists_locally(self, branch: str) -> bool: + """Return True iff ``branch`` is a local ref.""" + try: + result = self._run_cmd(["git", "branch", "--list", branch], check=False) + except RuntimeError: + return False + # `git branch --list <name>` prints the branch (with optional `*` prefix) + # when present, empty output otherwise. + return bool((getattr(result, "stdout", "") or "").strip()) + + def _branch_exists_remotely(self, branch: str) -> bool: + """Return True iff ``branch`` exists on ``origin``. Network failures + treated as "unknown / assume not present" — disambiguation is best-effort.""" + try: + result = self._run_cmd( + ["git", "ls-remote", "--heads", "origin", branch], + check=False, + timeout=15, + ) + except (RuntimeError, TypeError): + return False + return bool((getattr(result, "stdout", "") or "").strip()) + + def _disambiguate_branch(self, candidate: str, *, suffix_hint: str = "") -> str: + """Resolve branch-name collisions (spec v30 Step 10). + + If ``candidate`` exists either locally or on origin, append ``-suffix_hint`` + (or a 6-char hex tag derived from the current time when no hint is given). + If even that suffixed name collides, fall back to ``-<unix_ts>``. Logs INFO + on every disambiguation so reviewers can match branch back to chunk. + """ + local = self._branch_exists_locally(candidate) + remote = self._branch_exists_remotely(candidate) + if not (local or remote): + return candidate + + import secrets as _secrets + import time as _time + + suffix = suffix_hint or _secrets.token_hex(3) + first = f"{candidate}-{suffix}" + if not (self._branch_exists_locally(first) or self._branch_exists_remotely(first)): + logger.info("Branch %s exists; using %s", candidate, first) + return first + + ts = int(_time.time()) + fallback = f"{candidate}-{ts}" + logger.info("Branch %s and %s exist; using %s", candidate, first, fallback) + return fallback + def create_continuation_branch(self, spec_id: str, part: int) -> str: """Create a new branch for the next part of a split PR (spec-27 Phase 2.3). Returns the new branch name. """ - branch_name = f"codelicious/spec-{spec_id}-part-{part}" + # spec v30 Step 10: disambiguate ahead of `git checkout -b` so a stale + # branch from a previous run doesn't get silently re-checked-out. + candidate = f"codelicious/spec-{spec_id}-part-{part}" + branch_name = self._disambiguate_branch(candidate) try: self._run_cmd(["git", "checkout", "-b", branch_name]) logger.info("Created continuation branch: %s", branch_name) except RuntimeError: - # Branch might already exist + # Local checkout -b can still fail if the branch was created since + # our disambiguation probe; checkout the existing one rather than + # crashing the build. self._run_cmd(["git", "checkout", branch_name]) logger.info("Checked out existing continuation branch: %s", branch_name) return branch_name diff --git a/src/codelicious/llm_client.py b/src/codelicious/llm_client.py index b99dc31c..cb694f42 100644 --- a/src/codelicious/llm_client.py +++ b/src/codelicious/llm_client.py @@ -1,9 +1,11 @@ from __future__ import annotations +import email.utils import ipaddress import json import logging import os +import secrets import socket import ssl import time @@ -25,6 +27,46 @@ _DEFAULT_ENDPOINT = "https://router.huggingface.co/sambanova/v1/chat/completions" +# Cap for any provider-supplied Retry-After value, in seconds. A misbehaving +# or malicious provider could otherwise stall codelicious indefinitely. +_RETRY_AFTER_CAP_S = 120.0 + + +def _jittered_backoff(base: float, attempt: int) -> float: + """Compute exponential backoff with multiplicative jitter in [0.5x, 1.5x]. + + Uses ``secrets.SystemRandom`` per project security rules. + """ + return base * (2**attempt) * (0.5 + secrets.SystemRandom().random()) + + +def _parse_retry_after(value: str | None) -> float | None: + """Parse an HTTP Retry-After header value (seconds or HTTP-date). + + Returns the wait in seconds, capped at ``_RETRY_AFTER_CAP_S``, or None if + the header is missing/unparseable/in the past. + """ + if not value: + return None + value = value.strip() + try: + secs = float(value) + if secs >= 0: + return min(secs, _RETRY_AFTER_CAP_S) + except ValueError: + pass + try: + when = email.utils.parsedate_to_datetime(value) + except (TypeError, ValueError): + return None + if when is None: + return None + delta = when.timestamp() - time.time() + if delta <= 0: + return None + return min(delta, _RETRY_AFTER_CAP_S) + + # Known-good endpoint base URLs that bypass DNS resolution checks (S20-P1-1) _ALLOWED_ENDPOINT_BASES: frozenset[str] = frozenset( { @@ -218,9 +260,15 @@ def chat_completion( logger.debug("LLM API error body (status %s): %s", e.code, sanitized_body) if e.code in self._RETRYABLE_HTTP_CODES and attempt < self._MAX_RETRIES: - backoff = self._BACKOFF_BASE_S * (2**attempt) + retry_after = _parse_retry_after(e.headers.get("Retry-After")) if e.code == 429 else None + if retry_after is not None: + # Apply jitter to provider-supplied delay so concurrent runs don't thunder-herd. + backoff = retry_after * (0.5 + secrets.SystemRandom().random()) + logger.info("Honoring Retry-After: %.1fs for HTTP 429 (%s)", retry_after, model) + else: + backoff = _jittered_backoff(self._BACKOFF_BASE_S, attempt) logger.warning( - "Transient HTTP %d from LLM API (%s); retrying in %.0fs (attempt %d/%d).", + "Transient HTTP %d from LLM API (%s); retrying in %.1fs (attempt %d/%d).", e.code, model, backoff, @@ -235,7 +283,7 @@ def chat_completion( raise RuntimeError("LLM API Error (%s): HTTP %s - see debug logs for details" % (model, e.code)) from e except (TimeoutError, urllib.error.URLError, ssl.SSLError, ConnectionResetError, OSError) as e: if attempt < self._MAX_RETRIES: - backoff = self._BACKOFF_BASE_S * (2**attempt) + backoff = _jittered_backoff(self._BACKOFF_BASE_S, attempt) logger.warning( "Transient network error from LLM API (%s): %s; retrying in %.0fs (attempt %d/%d).", model, diff --git a/src/codelicious/loop_controller.py b/src/codelicious/loop_controller.py index b53cb56c..66fa8496 100644 --- a/src/codelicious/loop_controller.py +++ b/src/codelicious/loop_controller.py @@ -197,7 +197,9 @@ def _execute_agentic_iteration(self) -> bool: break except Exception as llm_exc: last_llm_error = llm_exc - wait_s = _LLM_BACKOFF_BASE_S * (2**_attempt) + from codelicious.llm_client import _jittered_backoff + + wait_s = _jittered_backoff(_LLM_BACKOFF_BASE_S, _attempt) logger.warning( "LLM call failed (attempt %d/%d): %s — retrying in %.1fs", _attempt + 1, diff --git a/src/codelicious/orchestrator.py b/src/codelicious/orchestrator.py index 2a2fdbe0..cf26973e 100644 --- a/src/codelicious/orchestrator.py +++ b/src/codelicious/orchestrator.py @@ -1,154 +1,43 @@ -"""Phase-based orchestrator for parallel agent workflows. +"""Chunk-based orchestrator for codelicious v2 (spec-27 Phase 4.1). -This module implements the conflict-free parallelization strategy: +Runs the simplified workflow:: - Phase 1: BUILD — one builder per spec, each in an isolated git worktree - Phase 2: MERGE — deterministic serial merge of worktree branches - Phase 3: REVIEW — N read-only reviewer agents in parallel (security, QA, perf, …) - Phase 4: FIX — single fixer agent applies triaged findings serially + for each spec: + chunk the spec → for each chunk: + execute → verify → fix → commit → push + transition PR to review -The key invariant: **two writers never touch the same working tree at the -same time**. Builders get worktree isolation. Reviewers are read-only. -The fixer is the only writer during Phase 4. +No worktree isolation. Each spec gets a branch. Chunks are executed +serially. One commit per chunk. -Usage from the engine:: +Usage:: - orch = Orchestrator(repo_path, git_manager, config) - result = orch.run(specs=[...], reviewers=["security", "qa", "performance"]) + from codelicious.orchestrator import Orchestrator + + orch = Orchestrator(repo_path, git_manager, engine) + result = orch.run(specs=[Path("docs/specs/01_feature.md")]) """ from __future__ import annotations -import concurrent.futures -import json import logging import pathlib import re import subprocess -import sys -import threading import time from dataclasses import dataclass, field logger = logging.getLogger("codelicious.orchestrator") __all__ = [ - "REVIEWER_PROMPTS", + "ChunkOutcome", "Finding", "Orchestrator", "OrchestratorResult", - "ReviewRole", "V2Orchestrator", ] -# --------------------------------------------------------------------------- -# Reviewer roles and prompts -# --------------------------------------------------------------------------- - - -@dataclass(frozen=True) -class ReviewRole: - """A named reviewer role with a system prompt.""" - - name: str - prompt: str - - -REVIEWER_PROMPTS: dict[str, str] = { - "security": """\ -You are a **security engineer** reviewing {{project_name}}. - -GUARDRAILS: Do NOT modify any files. Read only. - -Perform a thorough security audit: -- OWASP Top 10 vulnerabilities (injection, XSS, SSRF, IDOR, etc.) -- Hardcoded secrets, API keys, tokens in source code -- Unsafe deserialization, eval(), exec(), shell=True -- Path traversal, symlink attacks, TOCTOU races -- Authentication and authorization gaps -- Cryptographic misuse (weak algorithms, hardcoded IVs) - -For every finding: -- Cite the exact file path and line number -- Rate severity: P1 (critical), P2 (important), P3 (minor) -- Describe the attack scenario -- Suggest the fix - -Write ALL findings as JSON to `.codelicious/review_security.json`: -```json -[{"severity": "P1", "file": "src/foo.py", "line": 42, "title": "...", "description": "...", "fix": "..."}] -``` - -""", - "qa": """\ -You are a **QA engineer** reviewing {{project_name}}. - -GUARDRAILS: Do NOT modify any files. Read only. - -Run the full test suite and analyze coverage: -- Identify untested code paths and edge cases -- Check for flaky tests (non-deterministic, order-dependent) -- Verify error handling paths are tested -- Look for missing boundary condition tests -- Check that test assertions are meaningful (not just "assert True") -- Verify mocks match real interfaces - -For every finding, cite file:line and severity (P1/P2/P3). - -Write ALL findings as JSON to `.codelicious/review_qa.json`: -```json -[{"severity": "P2", "file": "tests/test_foo.py", "line": 10, "title": "...", "description": "...", "fix": "..."}] -``` - -""", - "performance": """\ -You are a **performance engineer** reviewing {{project_name}}. - -GUARDRAILS: Do NOT modify any files. Read only. - -Analyze for performance issues: -- O(n^2) or worse algorithms where O(n) would work -- Unnecessary memory allocations or copies -- Missing caching opportunities -- Unbounded data structures (lists that grow without limit) -- Regex catastrophic backtracking (ReDoS) -- Blocking I/O in hot paths -- N+1 query patterns - -For every finding, cite file:line and severity (P1/P2/P3). - -Write ALL findings as JSON to `.codelicious/review_performance.json`: -```json -[{"severity": "P2", "file": "src/foo.py", "line": 99, "title": "...", "description": "...", "fix": "..."}] -``` - -""", - "reliability": """\ -You are a **reliability engineer** reviewing {{project_name}}. - -GUARDRAILS: Do NOT modify any files. Read only. - -Analyze for reliability issues: -- Race conditions and thread safety problems -- Resource leaks (file handles, connections, threads) -- Missing timeouts on I/O operations -- Unhandled exceptions that crash the process -- Missing retry logic for transient failures -- Deadlock potential in concurrent code -- State corruption from partial failures - -For every finding, cite file:line and severity (P1/P2/P3). - -Write ALL findings as JSON to `.codelicious/review_reliability.json`: -```json -[{"severity": "P1", "file": "src/foo.py", "line": 55, "title": "...", "description": "...", "fix": "..."}] -``` - -""", -} - - # --------------------------------------------------------------------------- # Data structures # --------------------------------------------------------------------------- @@ -189,903 +78,12 @@ class OrchestratorResult: cycles_completed: int = 0 -# --------------------------------------------------------------------------- -# Worktree helpers -# --------------------------------------------------------------------------- - -_WORKTREE_TIMEOUT_S: int = 120 # Max seconds for worktree subprocess operations -_MERGE_ABORT_TIMEOUT_S: int = 30 # Max seconds for git merge --abort - - -def _create_worktree(repo_path: pathlib.Path, branch_name: str) -> pathlib.Path: - """Create a git worktree for isolated building. - - Returns the path to the new worktree directory. - """ - # Sanitize branch_name to prevent path traversal (Finding 30) - safe_branch = re.sub(r"[^a-zA-Z0-9_\-/]", "_", branch_name) - safe_branch = safe_branch.replace("..", "_") - worktree_dir = repo_path / ".codelicious" / "worktrees" / safe_branch - worktrees_root = (repo_path / ".codelicious" / "worktrees").resolve() - if not worktree_dir.resolve().is_relative_to(worktrees_root): - raise RuntimeError(f"Worktree path escapes allowed directory: {branch_name}") - worktree_dir.parent.mkdir(parents=True, exist_ok=True) - - # Clean up stale worktree if it exists - if worktree_dir.exists(): - try: - subprocess.run( - ["git", "worktree", "remove", "--force", str(worktree_dir)], - cwd=str(repo_path), - capture_output=True, - timeout=_WORKTREE_TIMEOUT_S, - ) - except subprocess.TimeoutExpired: - logger.warning("Timed out removing stale worktree %s; proceeding anyway.", worktree_dir) - - # Create the worktree with a new branch - try: - result = subprocess.run( - ["git", "worktree", "add", "-b", safe_branch, str(worktree_dir)], - cwd=str(repo_path), - capture_output=True, - text=True, - timeout=_WORKTREE_TIMEOUT_S, - ) - except subprocess.TimeoutExpired as exc: - raise RuntimeError(f"Timed out creating worktree for branch {branch_name}") from exc - - if result.returncode != 0: - # Branch might already exist — try without -b - try: - result = subprocess.run( - ["git", "worktree", "add", str(worktree_dir), safe_branch], - cwd=str(repo_path), - capture_output=True, - text=True, - timeout=_WORKTREE_TIMEOUT_S, - ) - except subprocess.TimeoutExpired as exc: - raise RuntimeError(f"Timed out creating worktree (fallback) for branch {branch_name}") from exc - if result.returncode != 0: - raise RuntimeError(f"Failed to create worktree: {result.stderr}") - - logger.info("Created worktree at %s (branch: %s)", worktree_dir, branch_name) - return worktree_dir - - -def _remove_worktree(repo_path: pathlib.Path, worktree_dir: pathlib.Path) -> None: - """Remove a git worktree.""" - try: - subprocess.run( - ["git", "worktree", "remove", "--force", str(worktree_dir)], - cwd=str(repo_path), - capture_output=True, - timeout=_WORKTREE_TIMEOUT_S, - ) - except subprocess.TimeoutExpired: - logger.warning("Timed out removing worktree %s; it may need manual cleanup.", worktree_dir) - return - logger.info("Removed worktree: %s", worktree_dir) - - -def _commit_worktree_changes(worktree_dir: pathlib.Path, spec_name: str) -> bool: - """Stage and commit all changes in a worktree. - - The build agent is forbidden from running git commands, so the - orchestrator must commit changes on the agent's behalf before the - worktree is removed. Without this commit, changes would be lost. - - Excludes ``.codelicious/`` from the commit to prevent merge conflicts - when multiple worktrees modify STATE.md or BUILD_COMPLETE. - - Attempts a GPG-signed commit first. Falls back to ``--no-gpg-sign`` - only when GPG-related errors are detected in stderr (e.g. no GPG agent - is available in the worktree environment). - - Returns True if a commit was created, False if the worktree was clean. - """ - # Stage everything EXCEPT .codelicious/ (which causes merge conflicts) - try: - subprocess.run( - ["git", "add", "--all", "--", ".", ":!.codelicious/"], - cwd=str(worktree_dir), - capture_output=True, - timeout=_WORKTREE_TIMEOUT_S, - ) - except subprocess.TimeoutExpired: - logger.warning("Timed out staging changes in worktree %s.", worktree_dir) - return False - - # Check if there's anything staged - try: - status = subprocess.run( - ["git", "diff", "--cached", "--quiet"], - cwd=str(worktree_dir), - capture_output=True, - timeout=_WORKTREE_TIMEOUT_S, - ) - except subprocess.TimeoutExpired: - logger.warning("Timed out checking staged diff in worktree %s.", worktree_dir) - return False - - if status.returncode == 0: - logger.debug("Worktree %s has no staged changes — nothing to commit.", worktree_dir) - return False - - # Try a signed commit first (Finding 42: honour GPG signing policy) - try: - result = subprocess.run( - ["git", "commit", "-m", f"codelicious: build {spec_name}"], - cwd=str(worktree_dir), - capture_output=True, - text=True, - timeout=_WORKTREE_TIMEOUT_S, - ) - except subprocess.TimeoutExpired: - logger.warning("Timed out committing worktree changes for %s.", spec_name) - return False - - if result.returncode != 0: - stderr_lower = result.stderr.lower() - gpg_related = any(kw in stderr_lower for kw in ("gpg", "signing", "sign", "secret key")) - if gpg_related: - logger.warning( - "GPG signing unavailable in worktree (no GPG agent); falling back to unsigned commit. stderr: %s", - result.stderr.strip(), - ) - try: - result = subprocess.run( - ["git", "commit", "--no-gpg-sign", "-m", f"codelicious: build {spec_name}"], - cwd=str(worktree_dir), - capture_output=True, - text=True, - timeout=_WORKTREE_TIMEOUT_S, - ) - except subprocess.TimeoutExpired: - logger.warning("Timed out committing (unsigned) worktree changes for %s.", spec_name) - return False - if result.returncode != 0: - logger.warning("Failed to commit worktree changes: %s", result.stderr.strip()) - return False - - logger.info("Committed agent changes in worktree for %s", spec_name) - return True - - -def _abort_merge(repo_path: pathlib.Path) -> None: - """Abort an in-progress git merge, with timeout and error handling.""" - try: - abort_result = subprocess.run( - ["git", "merge", "--abort"], - cwd=str(repo_path), - capture_output=True, - text=True, - timeout=_MERGE_ABORT_TIMEOUT_S, - ) - if abort_result.returncode != 0: - logger.critical( - "git merge --abort failed (exit %d): %s", - abort_result.returncode, - abort_result.stderr.strip(), - ) - else: - logger.info("Merge aborted successfully.") - except subprocess.TimeoutExpired: - logger.critical( - "git merge --abort timed out after %ds — repository may be in a dirty state.", - _MERGE_ABORT_TIMEOUT_S, - ) - - -def _merge_worktree_branch(repo_path: pathlib.Path, branch_name: str) -> bool: - """Merge a worktree branch back into the current branch. - - Returns True on success, False on merge conflict or timeout. - """ - try: - result = subprocess.run( - ["git", "merge", "--no-ff", "-m", f"codelicious: merge {branch_name}", branch_name], - cwd=str(repo_path), - capture_output=True, - text=True, - timeout=_WORKTREE_TIMEOUT_S, - ) - except subprocess.TimeoutExpired: - logger.error("Timed out merging branch %s; attempting abort.", branch_name) - _abort_merge(repo_path) - return False - - if result.returncode != 0: - logger.error("Merge conflict for branch %s: %s", branch_name, result.stderr) - # Abort the merge to leave repo in clean state - _abort_merge(repo_path) - return False - - logger.info("Merged branch %s successfully.", branch_name) - return True - - -def _delete_branch(repo_path: pathlib.Path, branch_name: str) -> None: - """Delete a local branch after merge.""" - try: - result = subprocess.run( - ["git", "branch", "-d", branch_name], - cwd=str(repo_path), - capture_output=True, - text=True, - timeout=_WORKTREE_TIMEOUT_S, - ) - except subprocess.TimeoutExpired: - logger.warning("Timed out deleting branch %s.", branch_name) - return - if result.returncode != 0: - logger.warning("Failed to delete branch %s: %s", branch_name, result.stderr.strip()) - - -# --------------------------------------------------------------------------- -# Review output parsing -# --------------------------------------------------------------------------- - - -def _collect_review_findings(repo_path: pathlib.Path, role: str) -> list[Finding]: - """Read the JSON findings file written by a reviewer agent.""" - review_file = repo_path / ".codelicious" / f"review_{role}.json" - if not review_file.is_file(): - logger.debug("No review file for role %s", role) - return [] - - try: - data = json.loads(review_file.read_text(encoding="utf-8")) - if not isinstance(data, list): - logger.warning("Review file for %s is not a JSON array", role) - return [] - findings = [] - for item in data: - if isinstance(item, dict): - findings.append( - Finding( - role=role, - severity=item.get("severity", "P3"), - file=item.get("file", ""), - line=item.get("line", 0), - title=item.get("title", ""), - description=item.get("description", ""), - fix=item.get("fix", ""), - ) - ) - return findings - except (json.JSONDecodeError, OSError) as e: - logger.warning("Failed to parse review findings for %s: %s", role, e) - return [] - - -def _triage_findings(findings: list[Finding]) -> list[Finding]: - """Sort findings by severity (P1 first) and deduplicate by file+line.""" - severity_order = {"P1": 0, "P2": 1, "P3": 2} - seen: set[tuple[str, int]] = set() - deduped: list[Finding] = [] - for f in sorted(findings, key=lambda f: severity_order.get(f.severity, 9)): - key = (f.file, f.line) - if key not in seen: - seen.add(key) - deduped.append(f) - return deduped - - -# --------------------------------------------------------------------------- -# Fix prompt generation -# --------------------------------------------------------------------------- - -_FIX_PROMPT_TEMPLATE: str = """\ -You are fixing issues in {{project_name}} identified by automated review. - -## CRITICAL: Do NOT run git or gh commands - -The codelicious orchestrator manages all git and GitHub operations. -You MUST NOT run git add, git commit, git push, gh pr create, or any -other git/gh commands. The orchestrator will commit your changes. - -## Triaged Findings (ordered by severity) - -{{findings_text}} - -## Instructions - -Fix the findings above, starting with P1 (most critical). -For each fix: -1. Read the file cited in the finding -2. Apply the suggested fix (or a better one if you see one) -3. Run tests after each fix to ensure no regressions - -When all fixable findings are addressed, run /verify-all. -Then write "DONE" to .codelicious/BUILD_COMPLETE -""" - - -def _render_fix_prompt(project_name: str, findings: list[Finding]) -> str: - """Render the fix prompt with the triaged findings.""" - lines = [] - for i, f in enumerate(findings, 1): - lines.append( - f"{i}. **[{f.severity}]** `{f.file}:{f.line}` — {f.title}\n" - f" {f.description}\n" - f" **Fix:** {f.fix}\n" - f" _(from: {f.role} reviewer)_\n" - ) - findings_text = "\n".join(lines) if lines else "No findings to fix." - - result = _FIX_PROMPT_TEMPLATE - result = result.replace("{{project_name}}", project_name) - result = result.replace("{{findings_text}}", findings_text) - return result - - -# --------------------------------------------------------------------------- -# Orchestrator -# --------------------------------------------------------------------------- - - -class Orchestrator: - """Phase-based orchestrator for parallel agent workflows. - - Usage:: - - orch = Orchestrator(repo_path, git_manager, agent_config) - result = orch.run( - specs=[Path("spec-17.md"), Path("spec-18.md")], - reviewers=["security", "qa", "performance"], - ) - """ - - def __init__( - self, - repo_path: pathlib.Path, - git_manager: object, - config: object, - ) -> None: - self.repo_path = pathlib.Path(repo_path).resolve() - self.git_manager = git_manager - self.config = config - self.project_name = self.repo_path.name - - def _run_agent(self, prompt: str, project_root: pathlib.Path, session_id: str = "") -> object: - """Run a Claude agent, returning the AgentResult.""" - from codelicious.agent_runner import run_agent - - return run_agent( - prompt=prompt, - project_root=project_root, - config=self.config, - tee_to=sys.stdout, - resume_session_id=session_id, - ) - - # ------------------------------------------------------------------ - # Phase 1: BUILD (parallel, isolated worktrees) - # ------------------------------------------------------------------ - - def _build_spec_in_worktree(self, spec_path: pathlib.Path) -> tuple[str, bool]: - """Build a single spec in an isolated git worktree. - - The agent is instructed to build ALL unchecked tasks in the - assigned spec file, not just one. - - Returns (branch_name, success). - """ - from codelicious.git.git_orchestrator import spec_branch_name - from codelicious.prompts import AGENT_BUILD_SPEC, render - - branch_name = spec_branch_name(spec_path.name) - worktree_dir: pathlib.Path | None = None - - try: - worktree_dir = _create_worktree(self.repo_path, branch_name) - - # Resolve spec_path relative to the worktree so the agent - # sees the correct file path in its working directory. - try: - rel = spec_path.relative_to(self.repo_path) - except ValueError: - # spec_path is not under repo_path — use just the filename - # to avoid joining an absolute path (which discards the left operand) - rel = pathlib.Path(spec_path.name) - logger.warning( - "Spec %s is not under repo %s — using filename only.", - spec_path, - self.repo_path, - ) - spec_in_worktree = worktree_dir / rel - - if not spec_in_worktree.is_file(): - logger.warning( - "Spec file %s not found in worktree %s. Agent will search for specs automatically.", - spec_in_worktree, - worktree_dir, - ) - # Fall back to telling the agent to find the spec itself - spec_filter_str = ( - f"File not found at {spec_in_worktree}. Look for a spec file named '{spec_path.name}' in the repo." - ) - else: - spec_filter_str = str(spec_in_worktree) - - build_prompt = render( - AGENT_BUILD_SPEC, - project_name=self.project_name, - spec_filter=spec_filter_str, - ) - - result = self._run_agent(build_prompt, worktree_dir) - - # Don't trust result.success alone — it only reflects subprocess - # exit code (0 = success). Check BUILD_COMPLETE in the worktree - # to verify the agent actually finished building. - from codelicious.prompts import check_build_complete - - agent_done = check_build_complete(worktree_dir) - success = result.success and agent_done - - logger.info( - "Build for %s complete: process_ok=%s, build_complete=%s", - spec_path.name, - result.success, - agent_done, - ) - - # Commit the agent's changes in the worktree so they survive - # worktree removal and can be merged back. Agents are forbidden - # from running git commands — the orchestrator owns all git ops. - commit_ok = _commit_worktree_changes(worktree_dir, spec_path.name) - - # If the build succeeded but we couldn't commit its changes, mark the - # overall result as failed and preserve the worktree so the changes are - # not silently discarded. The caller will see success=False and can - # investigate the worktree directory for manual recovery. - if not commit_ok and success: - logger.error( - "Build for %s succeeded but committing worktree changes failed. " - "Preserving worktree at %s to prevent data loss.", - spec_path.name, - worktree_dir, - ) - success = False - # Signal the finally block to skip removal by clearing worktree_dir - worktree_dir = None - - return branch_name, success - - except Exception as e: - logger.error("Build for %s failed: %s", spec_path.name, e) - return branch_name, False - - finally: - if worktree_dir is not None: - try: - _remove_worktree(self.repo_path, worktree_dir) - except Exception as e: - logger.warning("Failed to clean up worktree %s: %s", worktree_dir, e) - - def _phase_build( - self, - specs: list[pathlib.Path], - max_workers: int, - ) -> list[tuple[str, bool]]: - """Phase 1: Build specs in parallel worktrees. - - Returns list of (branch_name, success) tuples. - """ - from codelicious.git.git_orchestrator import spec_branch_name - - if not specs: - return [] - - workers = min(max_workers, len(specs)) - logger.info( - "PHASE 1 BUILD: %d specs across %d workers", - len(specs), - workers, - ) - - results: list[tuple[str, bool]] = [] - completed_count = 0 - count_lock = threading.Lock() - - def _log_spec_progress(spec: pathlib.Path, branch: str, ok: bool) -> None: - nonlocal completed_count - with count_lock: - completed_count += 1 - count = completed_count - status = "OK" if ok else "FAILED" - logger.info( - " [%d/%d] %s — %s (branch: %s)", - count, - len(specs), - spec.name, - status, - branch, - ) - - if workers <= 1: - # Serial fallback - for spec in specs: - logger.info(" Building spec: %s ...", spec.name) - branch, ok = self._build_spec_in_worktree(spec) - _log_spec_progress(spec, branch, ok) - results.append((branch, ok)) - else: - with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as pool: - futures = {pool.submit(self._build_spec_in_worktree, spec): spec for spec in specs} - for spec in specs: - logger.info(" Queued spec: %s", spec.name) - try: - for future in concurrent.futures.as_completed(futures): - spec = futures[future] - try: - branch, ok = future.result() - _log_spec_progress(spec, branch, ok) - results.append((branch, ok)) - except Exception as e: - branch = spec_branch_name(spec.name) - with count_lock: - completed_count += 1 - count = completed_count - logger.error( - " [%d/%d] %s — ERROR: %s", - count, - len(specs), - spec.name, - e, - ) - results.append((branch, False)) - except KeyboardInterrupt: - logger.warning("KeyboardInterrupt received — cancelling pending build futures.") - for f in futures: - f.cancel() - pool.shutdown(wait=False, cancel_futures=True) - raise - - return results - - # ------------------------------------------------------------------ - # Phase 2: MERGE (serial, deterministic) - # ------------------------------------------------------------------ - - def _phase_merge(self, build_results: list[tuple[str, bool]]) -> int: - """Phase 2: Merge successful build branches back. - - Returns the number of successfully merged branches. - """ - successful = [(branch, ok) for branch, ok in build_results if ok] - if not successful: - logger.warning("PHASE 2 MERGE: no successful builds to merge.") - return 0 - - logger.info("PHASE 2 MERGE: merging %d branches", len(successful)) - merged = 0 - - for branch_name, _ in successful: - if _merge_worktree_branch(self.repo_path, branch_name): - _delete_branch(self.repo_path, branch_name) - merged += 1 - else: - logger.warning( - "Skipping branch %s due to merge conflict. Manual resolution required.", - branch_name, - ) - - return merged - - # ------------------------------------------------------------------ - # Phase 3: REVIEW (parallel, read-only) - # ------------------------------------------------------------------ - - def _run_reviewer(self, role: str) -> list[Finding]: - """Run a single reviewer agent (read-only) and collect findings.""" - from codelicious.prompts import render - - prompt_template = REVIEWER_PROMPTS.get(role) - if not prompt_template: - logger.warning("Unknown reviewer role: %s", role) - return [] - - prompt = render(prompt_template, project_name=self.project_name) - - try: - self._run_agent(prompt, self.repo_path) - except Exception as e: - logger.warning("Reviewer %s failed: %s", role, e) - - return _collect_review_findings(self.repo_path, role) - - def _phase_review( - self, - roles: list[str], - max_workers: int, - ) -> list[Finding]: - """Phase 3: Run read-only reviewers in parallel. - - Returns all collected findings. - """ - if not roles: - return [] - - workers = min(max_workers, len(roles)) - logger.info( - "PHASE 3 REVIEW: %d reviewers across %d workers: %s", - len(roles), - workers, - roles, - ) - - all_findings: list[Finding] = [] - - if workers <= 1: - for role in roles: - all_findings.extend(self._run_reviewer(role)) - else: - with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as pool: - futures = {pool.submit(self._run_reviewer, role): role for role in roles} - try: - for future in concurrent.futures.as_completed(futures): - role = futures[future] - try: - all_findings.extend(future.result()) - except Exception as e: - logger.error("Reviewer %s raised: %s", role, e) - except KeyboardInterrupt: - logger.warning("KeyboardInterrupt received — cancelling pending review futures.") - for f in futures: - f.cancel() - pool.shutdown(wait=False, cancel_futures=True) - raise - - triaged = _triage_findings(all_findings) - logger.info( - "PHASE 3 REVIEW: %d total findings, %d after triage (P1: %d, P2: %d, P3: %d)", - len(all_findings), - len(triaged), - sum(1 for f in triaged if f.severity == "P1"), - sum(1 for f in triaged if f.severity == "P2"), - sum(1 for f in triaged if f.severity == "P3"), - ) - return triaged - - # ------------------------------------------------------------------ - # Phase 4: FIX (serial, one agent) - # ------------------------------------------------------------------ - - def _phase_fix(self, findings: list[Finding]) -> bool: - """Phase 4: Apply fixes for triaged findings. - - Only runs if there are P1 or P2 findings. P3s are logged but - not auto-fixed. - - Returns True if the fix agent completed successfully. - """ - actionable = [f for f in findings if f.severity in ("P1", "P2")] - if not actionable: - logger.info("PHASE 4 FIX: no P1/P2 findings to fix.") - return True - - logger.info("PHASE 4 FIX: applying %d P1/P2 findings", len(actionable)) - - from codelicious.prompts import check_build_complete, clear_build_complete - - clear_build_complete(self.repo_path) - fix_prompt = _render_fix_prompt(self.project_name, actionable) - - try: - self._run_agent(fix_prompt, self.repo_path) - except Exception as e: - logger.error("Fix agent failed: %s", e) - return False - - return check_build_complete(self.repo_path) - - # ------------------------------------------------------------------ - # Public API - # ------------------------------------------------------------------ - - def run( - self, - specs: list[pathlib.Path], - reviewers: list[str] | None = None, - max_build_workers: int = 3, - max_review_workers: int = 4, - max_build_cycles: int = 10, - push_pr: bool = False, - max_wall_clock_s: float = 7200, - ) -> OrchestratorResult: - """Run the full orchestrated pipeline. - - Build→merge cycles repeat until all specs are complete (no - unchecked ``- [ ]`` items remain) or the cycle cap is reached. - Review and fix run once at the end, after building is done. - - Parameters - ---------- - specs: - List of spec file paths to build. - reviewers: - List of reviewer role names (e.g. ["security", "qa"]). - Defaults to all available roles. - max_build_workers: - Max concurrent builder agents. - max_review_workers: - Max concurrent reviewer agents. - max_build_cycles: - Max build→merge iterations before giving up. - push_pr: - Whether to push and create/update PR after completion. - max_wall_clock_s: - Hard wall-clock limit in seconds for the entire run (Finding 22). - Defaults to 7200 (2 hours). The build loop is aborted if this - limit is reached before all cycles complete. - """ - from codelicious.prompts import scan_remaining_tasks_for_spec - - if reviewers is None: - reviewers = list(REVIEWER_PROMPTS.keys()) - - # Normalize all spec paths to absolute so comparisons are reliable - specs = [s.resolve() if not s.is_absolute() else s for s in specs] - - start = time.monotonic() - total_builds = 0 - total_merged = 0 - cycles = 0 - - logger.info( - "ORCHESTRATOR: %d specs, %d reviewers, build_workers=%d, review_workers=%d, max_build_cycles=%d", - len(specs), - len(reviewers), - max_build_workers, - max_review_workers, - max_build_cycles, - ) - - # ── BUILD LOOP: repeat build→merge until all specs complete ── - incomplete_specs = list(specs) - consecutive_failures = 0 - - for cycle in range(1, max_build_cycles + 1): - # Wall-clock timeout guard (Finding 22) - elapsed_so_far = time.monotonic() - start - if elapsed_so_far >= max_wall_clock_s: - logger.error( - "Wall-clock timeout reached after %.1fs (limit=%ss). Aborting build loop.", - elapsed_so_far, - max_wall_clock_s, - ) - break - - # Cache scan_remaining_tasks_for_spec results keyed by spec path so - # each spec is queried at most once per cycle (Finding 26). - remaining_cache: dict[pathlib.Path, int] = {s: scan_remaining_tasks_for_spec(s) for s in incomplete_specs} - # Check which specs still have unchecked tasks - still_incomplete = [s for s, n in remaining_cache.items() if n > 0] - if not still_incomplete: - logger.info("All %d specs are complete after %d build cycle(s).", len(specs), cycles) - break - - cycles = cycle - logger.info("") - logger.info( - "══════ Build cycle %d/%d (%d specs remaining) ══════", cycle, max_build_cycles, len(still_incomplete) - ) - - # ── Phase 1: BUILD ──────────────────────────────────── - logger.info("---- BUILD ----") - build_results = self._phase_build(still_incomplete, max_build_workers) - successful = sum(1 for _, ok in build_results if ok) - total_builds += successful - logger.info("Build: %d/%d specs built successfully.", successful, len(still_incomplete)) - - if successful == 0: - consecutive_failures += 1 - logger.warning("No specs built in cycle %d (%d consecutive failures).", cycle, consecutive_failures) - if consecutive_failures >= 3: - logger.error("Aborting: %d consecutive build cycles with zero progress.", consecutive_failures) - break - continue - else: - consecutive_failures = 0 - - # ── Phase 2: MERGE ──────────────────────────────────── - logger.info("---- MERGE ----") - merged = self._phase_merge(build_results) - total_merged += merged - logger.info("Merge: %d branches merged.", merged) - - # Commit merged work and push before next cycle - try: - self.git_manager.commit_verified_changes( - commit_message=f"codelicious: build cycle {cycle} of {self.project_name}" - ) - except Exception as e: - logger.warning("Mid-cycle commit failed: %s", e) - - # Push even if commit_verified_changes found nothing new to - # commit — merge commits need to be pushed too. - push = self.git_manager.push_to_origin() - if not push.success: - logger.warning("Mid-cycle push failed (type=%s): %s", push.error_type, push.message) - - # Update incomplete list for next iteration - incomplete_specs = still_incomplete - - # ── Check final completion status ───────────────────────── - # Cache results to avoid calling scan_remaining_tasks_for_spec twice - # for the same spec (Finding 26). - final_remaining_cache: dict[pathlib.Path, int] = {s: scan_remaining_tasks_for_spec(s) for s in specs} - final_incomplete = [s for s, n in final_remaining_cache.items() if n > 0] - all_complete = len(final_incomplete) == 0 - - if all_complete: - logger.info("All specs complete. Proceeding to review phase.") - else: - remaining_tasks = sum(final_remaining_cache[s] for s in final_incomplete) - logger.warning( - "%d spec(s) still incomplete (%d unchecked tasks). Proceeding to review.", - len(final_incomplete), - remaining_tasks, - ) - - # ── REVIEW (once, after all building is done) ───────────── - logger.info("") - logger.info("---- REVIEW ----") - findings = self._phase_review(reviewers, max_review_workers) - - # ── FIX (once, after review) ────────────────────────────── - logger.info("") - logger.info("---- FIX ----") - fix_ok = self._phase_fix(findings) - - # ── Final commit, push & PR ─────────────────────────────── - try: - self.git_manager.commit_verified_changes( - commit_message=f"codelicious: orchestrated build of {self.project_name}" - ) - except Exception as e: - logger.warning("Post-orchestration commit failed: %s", e) - - # Always push — commit_verified_changes skips push when working - # tree is clean, but merge commits still need to be pushed. - push = self.git_manager.push_to_origin() - if not push.success: - logger.error("Final push failed (type=%s): %s", push.error_type, push.message) - - if push_pr: - # Create/reuse one PR per successfully built spec (spec-22 Phase 4) - for spec in specs: - _m = re.match(r"^(\d+)", spec.stem) - _sid = _m.group(1) if _m else spec.stem - try: - self.git_manager.ensure_draft_pr_exists( - spec_id=_sid, - spec_summary=f"build {self.project_name}", - ) - except Exception as e: - logger.warning("PR creation for spec-%s failed: %s", _sid, e) - - elapsed = time.monotonic() - start - return OrchestratorResult( - success=all_complete and fix_ok, - message=( - f"Orchestrated: {total_builds} builds across {cycles} cycle(s), " - f"{total_merged} merged, {len(final_incomplete)}/{len(specs)} specs still incomplete, " - f"{len(findings)} findings, fix={'OK' if fix_ok else 'FAILED'} " - f"in {elapsed:.1f}s" - ), - findings=findings, - elapsed_s=elapsed, - cycles_completed=cycles, - ) +# Keep a public alias so callers that import ChunkOutcome don't break. +ChunkOutcome = OrchestratorResult # ═══════════════════════════════════════════════════════════════════════ -# spec-27 Phase 4: V2 Orchestrator — chunk-based serial loop +# spec-27 Phase 4: Orchestrator — chunk-based serial loop # ═══════════════════════════════════════════════════════════════════════ @@ -1121,7 +119,7 @@ def _ensure_codelicious_gitignored(repo_path: pathlib.Path) -> None: logger.warning("Could not update .gitignore: %s", e) -class V2Orchestrator: +class Orchestrator: """Chunk-based orchestrator for codelicious v2 (spec-27 Phase 4.1). Runs the simplified workflow:: @@ -1144,14 +142,68 @@ def __init__( max_loc_per_pr: int = 250, model: str = "", progress_callback: object = None, + no_resume: bool = False, + engines: list[object] | None = None, ) -> None: self.repo_path = pathlib.Path(repo_path).resolve() self.git_manager = git_manager self.engine = engine + # spec v30 Step 5: optional engine list for rate-limit fallover. The + # primary ``engine`` argument stays first; additional engines tail it. + # When an engine rate-limits we drop it and continue on the next one. + if engines: + self._engines: list[object] = list(engines) + else: + self._engines = [engine] self.max_commits_per_pr = max_commits_per_pr self.max_loc_per_pr = max_loc_per_pr self.model = model self.progress_callback = progress_callback + # spec v30 Step 2: when True, ignore any persisted chunk-status ledger + # and re-execute every chunk regardless of prior runs. + self.no_resume = no_resume + + # ── spec v30 Step 2: chunk-status ledger helpers ───────────────── + def _ledger_path(self, spec: pathlib.Path) -> pathlib.Path: + slug = re.sub(r"[^A-Za-z0-9_-]+", "_", spec.stem) + return self.repo_path / ".codelicious" / "state" / f"{slug}.json" + + def _load_ledger(self, spec: pathlib.Path) -> dict: + if self.no_resume: + return {"chunks": {}} + path = self._ledger_path(spec) + try: + import json as _json + + data = _json.loads(path.read_text(encoding="utf-8")) + if isinstance(data, dict) and isinstance(data.get("chunks"), dict): + return data + except (OSError, ValueError): + pass + return {"chunks": {}} + + def _save_ledger(self, spec: pathlib.Path, ledger: dict) -> None: + import json as _json + import os as _os + + path = self._ledger_path(spec) + try: + path.parent.mkdir(parents=True, exist_ok=True) + tmp = path.with_suffix(path.suffix + ".tmp") + tmp.write_text(_json.dumps(ledger, indent=2, sort_keys=True), encoding="utf-8") + _os.replace(tmp, path) + except OSError as exc: # nosec B110 + logger.warning("Could not persist chunk ledger %s: %s", path, exc) + + def _ledger_set(self, ledger: dict, chunk_id: str, *, status: str, **fields: object) -> None: + import datetime as _dt + + entry = ledger.setdefault("chunks", {}).get(chunk_id, {}) + entry["status"] = status + entry["updated_at"] = _dt.datetime.now(_dt.timezone.utc).isoformat() + for k, v in fields.items(): + entry[k] = v + ledger["chunks"][chunk_id] = entry def _report( self, @@ -1213,7 +265,7 @@ def run( push_pr: Whether to create/update PRs on GitHub/GitLab. """ - from codelicious.chunker import chunk_spec + from codelicious.chunker import chunk_spec, enforce_token_budget from codelicious.engines.base import EngineContext from codelicious.scaffolder import scaffold_claude_dir from codelicious.spec_discovery import mark_chunk_complete @@ -1247,6 +299,11 @@ def run( logger.error("Failed to chunk spec %s: %s", spec.name, e) continue + # spec v30 Step 6: split any chunk whose estimated token cost + # exceeds the active engine's context window before dispatch. + engine_name = getattr(self.engine, "name", "") or "" + chunks = enforce_token_budget(chunks, self.repo_path, engines=[engine_name] if engine_name else None) + if not chunks: logger.info("Spec %s has no chunks to build.", spec.name) specs_completed += 1 @@ -1275,13 +332,32 @@ def run( previous_chunks: list[str] = [] + # spec v30 Step 2: load chunk-status ledger so a re-run after a + # mid-spec abort skips already-merged chunks. + ledger = self._load_ledger(spec) + # ── Execute each chunk ──────────────────────────────── all_chunks_ok = True spec_chunks_failed = 0 for chunk_idx, chunk in enumerate(chunks, 1): - # Deadline check - if deadline and time.monotonic() > deadline: - logger.warning("Build deadline reached during spec %s, chunk %d.", spec.name, chunk_idx) + ledger_entry = ledger.get("chunks", {}).get(chunk.id, {}) + if ledger_entry.get("status") == "merged": + logger.info( + "Skipping already-merged chunk %s: %s", + chunk.id, + chunk.title, + ) + previous_chunks.append(f"{chunk.id}: {chunk.title}") + continue + # spec v29 Step 10: deadline gate before each chunk so an + # expired budget never starts a fresh execute_chunk call. + if deadline and time.monotonic() >= deadline: + logger.warning( + "Deadline reached after %d/%d chunks in spec %s; stopping early.", + chunk_idx - 1, + total_chunks, + spec.name, + ) all_chunks_ok = False break @@ -1315,7 +391,19 @@ def run( ) # ── Execute ─────────────────────────────────────── + # spec v30 Step 5: try the primary engine, then fail over to + # any remaining engines on a rate-limit signal. result = self.engine.execute_chunk(chunk, self.repo_path, context) + while result.message and "Rate limited" in (result.message or "") and len(self._engines) > 1: + rate_limited = self._engines.pop(0) + next_engine = self._engines[0] + logger.warning( + "%s rate-limited; failing over to %s for the remainder of this spec.", + getattr(rate_limited, "name", "engine"), + getattr(next_engine, "name", "engine"), + ) + self.engine = next_engine + result = self.engine.execute_chunk(chunk, self.repo_path, context) # ── Verify ──────────────────────────────────────── if result.success: @@ -1420,6 +508,14 @@ def run( mark_chunk_complete(spec, chunk.title) previous_chunks.append(f"{chunk.id}: {chunk.title}") total_chunks_completed += 1 + self._ledger_set( + ledger, + chunk.id, + status="merged", + title=chunk.title, + pr_number=pr_number, + ) + self._save_ledger(spec, ledger) self._report( spec_run_idx, total_specs_in_run, @@ -1433,6 +529,8 @@ def run( total_chunks_failed += 1 spec_chunks_failed += 1 all_chunks_ok = False + self._ledger_set(ledger, chunk.id, status="failed", title=chunk.title, reason="no-diff") + self._save_ledger(spec, ledger) self._report( spec_run_idx, total_specs_in_run, @@ -1455,6 +553,14 @@ def run( total_chunks_failed += 1 spec_chunks_failed += 1 all_chunks_ok = False + self._ledger_set( + ledger, + chunk.id, + status="failed", + title=chunk.title, + reason=(result.message or "engine-failure")[:200], + ) + self._save_ledger(spec, ledger) self._report( spec_run_idx, total_specs_in_run, @@ -1497,3 +603,7 @@ def run( elapsed_s=elapsed, cycles_completed=1, ) + + +# Backward-compatibility alias — external imports of V2Orchestrator still work. +V2Orchestrator = Orchestrator diff --git a/src/codelicious/prompts.py b/src/codelicious/prompts.py index 5bc832b9..8c6879e3 100644 --- a/src/codelicious/prompts.py +++ b/src/codelicious/prompts.py @@ -3,6 +3,15 @@ All prompt text used in agent mode lives in this module. No other module contains agent prompt strings. This separation makes prompts auditable, testable, and modifiable without touching orchestration logic. + +Live templates (per spec 27 §6.2 and spec v29 Steps 2 and 17): + +* ``CHUNK_EXECUTE`` — per-chunk execution prompt for the chunk-based loop. +* ``CHUNK_VERIFY`` — per-chunk verification prompt. +* ``CHUNK_FIX`` — per-chunk fix prompt invoked when verification fails. + +Multi-phase templates (SCAFFOLD, ANALYZE, REFLECT, AGENT_BUILD_SPEC, etc.) +called out in spec 27 §6.2 have been removed. """ from __future__ import annotations @@ -11,7 +20,6 @@ import re __all__ = [ - "AGENT_BUILD_SPEC", "CHUNK_EXECUTE", "CHUNK_FIX", "CHUNK_VERIFY", @@ -22,83 +30,6 @@ "scan_remaining_tasks_for_spec", ] -# --------------------------------------------------------------------------- -# Active prompts — Claude Code gets full autonomy -# --------------------------------------------------------------------------- - -AGENT_BUILD_SPEC: str = """\ -You are codelicious, an autonomous build agent for {{project_name}}. - -## Your mission - -Build ALL incomplete tasks from your assigned spec file. You handle -understanding, implementation, and testing. The orchestrator handles -all git operations (branching, committing, pushing, PRs) — you MUST NOT. - -## Your assigned spec file - -{{spec_filter}} - -**IMPORTANT:** Only build tasks from the spec file listed above. Do NOT -look at or build tasks from other spec files. If the spec_filter above -is empty, find the first incomplete spec file in the repo and build from -that one only. - -## CRITICAL: Do NOT run git or gh commands - -The codelicious orchestrator manages all git and GitHub operations -deterministically. You MUST NOT run any of these commands: -- git checkout, git branch, git add, git commit, git push -- gh pr create, gh pr view, gh pr edit -- Any other git or gh CLI commands - -If you run git or gh commands, you will create duplicate branches and -PRs. The orchestrator will commit and push your work automatically. - -### Step 1: Understand the project - -- Read your assigned spec file first. -- Read CLAUDE.md, README, and the project manifest (package.json, - pyproject.toml, Cargo.toml, go.mod, etc.). Learn the tech stack. -- Figure out how to run tests and lint for THIS project. -- If this is your first time in this repo, write what you learned to - CLAUDE.md so future runs are faster. - -### Step 2: Find tasks in your assigned spec - -- Look through your assigned spec file for unchecked `- [ ]` items. -- **If ALL tasks in your spec are `- [x]` (nothing left to do):** - 1. Update .codelicious/STATE.md to reflect completion. - 2. Write "DONE" to `.codelicious/BUILD_COMPLETE` and stop. - -### Step 3: Build each task - -For each unchecked `- [ ]` item in your assigned spec, in order: - -1. Read existing code before modifying. Match existing patterns. -2. Implement the task completely. -3. Run tests and lint. Fix ALL failures: - - Run the test suite - - If failures, read errors carefully, fix the root cause - - Run tests again - - Repeat until green (up to 3 attempts) -4. Mark the task done: change `- [ ]` to `- [x]` in the spec file. -5. Move on to the next unchecked `- [ ]` item. - -### Step 4: When all tasks are done - -- Update .codelicious/STATE.md with current status. -- Write "DONE" to `.codelicious/BUILD_COMPLETE` - -## Rules - -- **Build ALL unchecked tasks** in your assigned spec before stopping. -- Every change MUST pass tests. No broken code. -- Keep docs (README, CLAUDE.md) current if your changes affect them. -- Do NOT run git or gh commands. The orchestrator handles all git ops. -""" - - # --------------------------------------------------------------------------- # spec-27 Phase 6.2: Chunk-focused prompt templates # --------------------------------------------------------------------------- diff --git a/src/codelicious/sandbox.py b/src/codelicious/sandbox.py index b40752ce..7826c877 100644 --- a/src/codelicious/sandbox.py +++ b/src/codelicious/sandbox.py @@ -108,6 +108,16 @@ def __init__( # Merge extra extensions from CODELICIOUS_EXTRA_EXTENSIONS env var self._allowed_extensions: frozenset[str] = self._build_allowed_extensions() + # spec v29 Step 15: explicit warning when O_NOFOLLOW is unavailable + # (Windows). The post-write symlink re-check is silently skipped on + # such platforms, downgrading TOCTOU protection to best-effort. + if not hasattr(os, "O_NOFOLLOW"): + logger.warning( + "Sandbox: os.O_NOFOLLOW unavailable on this platform — TOCTOU " + "protection on atomic writes is best-effort. Run codelicious " + "on a POSIX system for full guarantees." + ) + @staticmethod def _build_allowed_extensions() -> frozenset[str]: """Merge CODELICIOUS_EXTRA_EXTENSIONS into the base allowlist.""" diff --git a/src/codelicious/scaffolder.py b/src/codelicious/scaffolder.py index fb8b89cf..70fdd9b4 100644 --- a/src/codelicious/scaffolder.py +++ b/src/codelicious/scaffolder.py @@ -20,6 +20,29 @@ _SENTINEL_START: str = "<!-- codelicious:start -->" _SENTINEL_END: str = "<!-- codelicious:end -->" +# spec v29 Step 18: single source of truth for the forbidden-pattern blurb +# scaffolded into CLAUDE.md, the verify-all skill, and the security rule +# file. Three string-literal copies previously drifted independently; any +# future change must update this tuple, which makes the intent auditable. +_FORBIDDEN_PATTERNS_DOC: tuple[str, ...] = ( + "eval()", + "exec()", + "shell=True", + "os.system()", + "subprocess.call(..., shell=True)", +) + + +def _forbidden_patterns_bullets() -> str: + """Render ``_FORBIDDEN_PATTERNS_DOC`` as a markdown bullet list.""" + return "\n".join(f"- {pattern}" for pattern in _FORBIDDEN_PATTERNS_DOC) + + +def _forbidden_patterns_inline() -> str: + """Render ``_FORBIDDEN_PATTERNS_DOC`` as a comma-joined inline phrase.""" + return ", ".join(_FORBIDDEN_PATTERNS_DOC) + + _MANAGED_BLOCK: str = f"""{_SENTINEL_START} # codelicious @@ -145,10 +168,11 @@ def scaffold(project_root: pathlib.Path, dry_run: bool = False) -> None: QUALITY: - Every new function needs tests. -- No hardcoded secrets, no eval(), no shell=True. +- No hardcoded secrets. Forbidden patterns: __FORBIDDEN_INLINE__. - Handle errors explicitly — no bare except. - Follow the project's existing patterns exactly. """ +_AGENT_BUILDER = _AGENT_BUILDER.replace("__FORBIDDEN_INLINE__", _forbidden_patterns_inline()) _AGENT_TESTER = """\ --- @@ -288,13 +312,14 @@ def scaffold(project_root: pathlib.Path, dry_run: bool = False) -> None: 2. **Lint**: Run the linter. Fix all violations. 3. **Format**: Run the formatter. Fix any formatting issues. 4. **Security**: Grep for common security anti-patterns: - - eval(, exec(, shell=True, subprocess.call.*shell + - __FORBIDDEN_INLINE__ - Hardcoded secrets: password\\s*=\\s*["'], api_key\\s*=\\s*["'] - SQL injection: f"SELECT, f"INSERT, f"UPDATE, f"DELETE 5. **State**: Update .codelicious/STATE.md with current test count and status. Report a summary of each check: pass/fail, issues found, fixes applied. """ +_SKILL_VERIFY_ALL = _SKILL_VERIFY_ALL.replace("__FORBIDDEN_INLINE__", _forbidden_patterns_inline()) _SKILL_UPDATE_STATE = """\ --- @@ -332,9 +357,7 @@ def scaffold(project_root: pathlib.Path, dry_run: bool = False) -> None: When writing or modifying code, always follow these security rules: ## Never Use -- eval() or exec() with user-controlled input -- shell=True in subprocess calls -- os.system() — use subprocess with shell=False +__FORBIDDEN_BULLETS__ - String formatting in SQL queries — use parameterized queries - pickle.loads() on untrusted data - yaml.load() without SafeLoader @@ -353,6 +376,7 @@ def scaffold(project_root: pathlib.Path, dry_run: bool = False) -> None: - Never log secrets — use masking/redaction - Never commit .env files """ +_RULES_SECURITY = _RULES_SECURITY.replace("__FORBIDDEN_BULLETS__", _forbidden_patterns_bullets()) def _detect_conventions(project_root: pathlib.Path) -> str: diff --git a/src/codelicious/tools/audit_logger.py b/src/codelicious/tools/audit_logger.py index a2d9d5ab..1a62a999 100644 --- a/src/codelicious/tools/audit_logger.py +++ b/src/codelicious/tools/audit_logger.py @@ -1,8 +1,10 @@ from __future__ import annotations +import contextlib import datetime import json import logging +import os import sys import threading from enum import Enum @@ -98,6 +100,18 @@ def __init__(self, repo_path: Path): # Lock that serialises all file writes so concurrent threads cannot # interleave entries (Finding 51). self._write_lock = threading.Lock() + # spec v30 Step 11: cross-process append guard. When two `codelicious` + # processes share an audit dir (e.g. CODELICIOUS_AUDIT_DIR pointed at a + # shared location) we need an OS-level lock so entries don't interleave + # across rotation boundaries. We lock a dedicated file because + # rotation may move the audit log itself. + self._lock_path = self.log_file.parent / ".audit.lock" + try: + self._lock_fd = os.open(str(self._lock_path), os.O_CREAT | os.O_RDWR, 0o600) + except OSError as exc: + console_logger.warning("AuditLogger: cannot open audit lock %s: %s", self._lock_path, exc) + self._lock_fd = None + self._cross_process_lock_warned: bool = False # Keep file handles open for the lifetime of the instance to avoid the # overhead of open/close on every tool call (Finding 18). # buffering=1 enables line-buffered mode so entries are flushed after @@ -117,6 +131,54 @@ def __init__(self, repo_path: Path): console_logger.warning("AuditLogger: cannot open security log %s: %s", self.security_log_file, exc) self._security_fh = None + @contextlib.contextmanager + def _cross_process_lock(self): + """Acquire an exclusive OS lock on ``.audit.lock`` for the critical + section (spec v30 Step 11). + + On Windows, ``fcntl`` is unavailable; falls back to ``msvcrt.locking`` + when present, otherwise logs a one-shot WARNING and proceeds with + intra-process locking only. + """ + if self._lock_fd is None: + yield + return + + try: + import fcntl + + fcntl.flock(self._lock_fd, fcntl.LOCK_EX) + try: + yield + finally: + try: + fcntl.flock(self._lock_fd, fcntl.LOCK_UN) + except OSError: + pass + return + except ImportError: + pass # fall through to msvcrt branch + + try: + import msvcrt + + msvcrt.locking(self._lock_fd, msvcrt.LK_LOCK, 1) + try: + yield + finally: + try: + # Rewind so the unlock targets the same byte we locked. + os.lseek(self._lock_fd, 0, os.SEEK_SET) + msvcrt.locking(self._lock_fd, msvcrt.LK_UNLCK, 1) + except OSError: + pass + return + except ImportError: + if not self._cross_process_lock_warned: + console_logger.warning("AuditLogger: cross-process audit-log locking unavailable on this platform") + self._cross_process_lock_warned = True + yield + def close(self) -> None: """Close the persistent file handles. @@ -133,6 +195,12 @@ def close(self) -> None: self._security_fh.close() except OSError: pass + try: + if getattr(self, "_lock_fd", None) is not None: + os.close(self._lock_fd) + self._lock_fd = None + except OSError: + pass def __del__(self) -> None: """Best-effort cleanup of file handles on garbage collection.""" @@ -155,8 +223,12 @@ def set_current_tool(self, tool_name: str) -> None: def _write_to_file(self, level: str, tag: str, message: str): timestamp = datetime.datetime.now(datetime.timezone.utc).isoformat() try: - with self._write_lock: + # spec v30 Step 11: hold both intra- and inter-process locks so a + # second codelicious process sharing the audit dir cannot interleave + # bytes mid-line. + with self._write_lock, self._cross_process_lock(): self._audit_fh.write(f"[{timestamp}] [{level}] [{tag}] {message}\n") + self._audit_fh.flush() except Exception as e: # Fallback if logging fails, at least print to stdout print(f"FATAL: Audit log write failed: {e}") @@ -187,11 +259,13 @@ def _write_to_security_log( full_message = f"{message} ({context})" log_line = f"{timestamp} [SECURITY] {event.value}: {full_message}\n" - # Write to both logs under a single lock to keep entries atomic + # spec v30 Step 11: dual-write under both intra- and inter-process locks. try: - with self._write_lock: + with self._write_lock, self._cross_process_lock(): self._audit_fh.write(log_line) + self._audit_fh.flush() self._security_fh.write(log_line) + self._security_fh.flush() except Exception as e: print(f"FATAL: Security log write failed: {e}") diff --git a/src/codelicious/verifier.py b/src/codelicious/verifier.py index 03b481e1..5143060d 100644 --- a/src/codelicious/verifier.py +++ b/src/codelicious/verifier.py @@ -1170,6 +1170,199 @@ def check_custom_command( ) +# spec v30 Step 8: project-wide coverage floor used by ``verify`` / +# ``verify_paths`` when no explicit value is supplied. CLI flag overrides +# pyproject value; pyproject value overrides this default. +_DEFAULT_MIN_COVERAGE: float = 90.0 + + +def _read_min_coverage_from_pyproject(repo: pathlib.Path) -> float | None: + """Return ``[tool.codelicious].min_coverage`` from pyproject.toml or None.""" + pyproject = repo / "pyproject.toml" + if not pyproject.is_file(): + return None + try: + import tomllib + except ImportError: # Python <3.11 — codelicious requires 3.10+, may lack tomllib + try: + import tomli as tomllib # type: ignore[no-redef] + except ImportError: + return None + try: + with open(pyproject, "rb") as fh: + data = tomllib.load(fh) + except (OSError, ValueError): + return None + section = data.get("tool", {}).get("codelicious", {}) + val = section.get("min_coverage") + if isinstance(val, (int, float)) and val >= 0: + return float(val) + return None + + +def resolve_min_coverage( + repo: pathlib.Path, + *, + cli_override: float | None = None, +) -> float: + """Resolve the active coverage floor (CLI > pyproject > default).""" + if cli_override is not None and cli_override >= 0: + return float(cli_override) + pyproject_val = _read_min_coverage_from_pyproject(repo) + if pyproject_val is not None: + return pyproject_val + return _DEFAULT_MIN_COVERAGE + + +def _enforce_coverage_floor(result: VerificationResult, min_coverage: float) -> None: + """Demote success → failure when measured coverage is below ``min_coverage``. + + Looks for a ``coverage`` :class:`CheckResult` in ``result.checks``; if its + message embeds a percentage, compare against the floor. If below, mark the + check failed and append the explanatory message. + """ + if min_coverage <= 0: + return + for check in result.checks: + if check.name != "coverage": + continue + # Combine the short message with details so we catch the percentage no + # matter where ``check_coverage`` happened to put it. + haystack = " ".join((check.message or "", check.details or "")) + match = re.search(r"(\d+(?:\.\d+)?)\s*%", haystack) + if not match: + return + pct = float(match.group(1)) + if pct < min_coverage: + check.passed = False + floor_msg = f"coverage {pct:.1f}% below floor {min_coverage:.1f}%" + check.message = (check.message + " | " + floor_msg) if check.message else floor_msg + + +def _map_src_to_tests(repo: pathlib.Path, paths: list[pathlib.Path]) -> list[pathlib.Path]: + """Map ``src/<module>.py`` paths to ``tests/test_<module>.py`` when present. + + Returns the list of resolved test paths that exist on disk. Used by + ``verify_paths`` to scope pytest invocations to the modules a chunk + actually touched. + """ + tests_dir = repo / "tests" + found: list[pathlib.Path] = [] + seen: set[pathlib.Path] = set() + for raw in paths: + p = raw if isinstance(raw, pathlib.Path) else pathlib.Path(str(raw)) + if p.suffix != ".py": + continue + candidate = tests_dir / f"test_{p.stem}.py" + if candidate.is_file() and candidate not in seen: + found.append(candidate) + seen.add(candidate) + return found + + +def verify_paths( + repo: pathlib.Path, + paths: list[pathlib.Path], + *, + full_verify_kwargs: dict | None = None, + min_coverage: float | None = None, +) -> VerificationResult: + """Run lint/security/test checks scoped to ``paths`` (spec v29 Step 6). + + * ``ruff check <paths>`` and ``bandit <paths>`` are invoked with the + caller-supplied paths so we don't re-scan an entire repo per chunk. + * Tests are scoped via ``pytest <inferred-test-paths>`` derived by + mapping each ``src/<module>.py`` to ``tests/test_<module>.py`` when + that file exists. + * When ``paths`` is empty, or no test mapping can be inferred, the + function falls back to the project-wide :func:`verify`. + """ + if not paths: + kwargs = dict(full_verify_kwargs or {}) + return verify(repo, **kwargs) + + py_paths = [pathlib.Path(p) for p in paths if str(p).endswith(".py")] + test_paths = _map_src_to_tests(repo, py_paths) + + if not test_paths: + # Without a test mapping the chunk-scope test run is meaningless; + # fall back to the full verifier so we still catch regressions. + kwargs = dict(full_verify_kwargs or {}) + return verify(repo, **kwargs) + + checks: list[CheckResult] = [] + + # ── ruff (python only) ──────────────────────────────────────────── + if shutil.which("ruff"): + ruff_targets = [str(p) for p in py_paths if (repo / p).exists() or p.is_absolute()] + if ruff_targets: + try: + proc = _run_with_pgroup_kill( + ["ruff", "check", *ruff_targets], + cwd=str(repo), + capture_output=True, + text=True, + timeout=_LINT_TIMEOUT_S, + ) + checks.append( + CheckResult( + name="lint", + passed=proc.returncode == 0, + message="ruff clean" if proc.returncode == 0 else _truncate(proc.stdout or proc.stderr), + ) + ) + except subprocess.TimeoutExpired: + checks.append(CheckResult(name="lint", passed=False, message="ruff timed out")) + + # ── bandit (python only) ───────────────────────────────────────── + if shutil.which("bandit"): + bandit_targets = [str(p) for p in py_paths] + if bandit_targets: + try: + proc = _run_with_pgroup_kill( + ["bandit", "-q", *bandit_targets], + cwd=str(repo), + capture_output=True, + text=True, + timeout=_LINT_TIMEOUT_S, + ) + # bandit returns 1 when issues are found; treat any non-zero + # exit as a failure but pass through bandit's own message. + checks.append( + CheckResult( + name="security", + passed=proc.returncode == 0, + message="bandit clean" if proc.returncode == 0 else _truncate(proc.stdout or proc.stderr), + ) + ) + except subprocess.TimeoutExpired: + checks.append(CheckResult(name="security", passed=False, message="bandit timed out")) + + # ── pytest scoped to mapped tests ──────────────────────────────── + try: + proc = _run_with_pgroup_kill( + ["python", "-m", "pytest", "-q", "--no-cov", *[str(p) for p in test_paths]], + cwd=str(repo), + capture_output=True, + text=True, + timeout=_TEST_TIMEOUT_S, + ) + checks.append( + CheckResult( + name="tests", + passed=proc.returncode == 0, + message="tests passed" if proc.returncode == 0 else _truncate(proc.stdout or proc.stderr), + ) + ) + except subprocess.TimeoutExpired: + checks.append(CheckResult(name="tests", passed=False, message="pytest timed out")) + + result = VerificationResult(checks=checks) + floor = resolve_min_coverage(repo, cli_override=min_coverage) + _enforce_coverage_floor(result, floor) + return result + + def verify( project_dir: pathlib.Path, timeout: int = 120, @@ -1281,12 +1474,21 @@ def verify( check.message, ) - # Log summary - passed_count = sum(1 for c in checks if c.passed) - failed_count = len(checks) - passed_count + result = VerificationResult(checks=checks) + + # spec v30 Step 8: enforce the resolved coverage floor regardless of how + # the underlying ``check_coverage`` reported it. ``coverage_threshold==0`` + # means coverage was not measured this run; skip the gate. + if coverage_threshold > 0: + floor = resolve_min_coverage(project_dir, cli_override=float(coverage_threshold)) + _enforce_coverage_floor(result, floor) + + # Log summary (after floor enforcement so counts reflect any demotion) + passed_count = sum(1 for c in result.checks if c.passed) + failed_count = len(result.checks) - passed_count logger.info("Verification complete: %d passed, %d failed", passed_count, failed_count) - return VerificationResult(checks=checks) + return result def _pytest_cov_available() -> bool: diff --git a/tests/test_agent_runner.py b/tests/test_agent_runner.py index 9fc3458f..28435c20 100644 --- a/tests/test_agent_runner.py +++ b/tests/test_agent_runner.py @@ -11,6 +11,9 @@ import pytest from codelicious.agent_runner import ( + _CLAUDE_RETRY_AFTER_DEFAULT_S, + _CLAUDE_RETRY_AFTER_MAX_S, + _CLAUDE_RETRY_AFTER_MIN_S, _MAX_PROMPT_LENGTH, _POLL_INTERVAL_S, FORBIDDEN_CLI_FLAGS, @@ -19,6 +22,7 @@ _check_agent_errors, _enforce_timeout, _parse_agent_output, + _parse_claude_reset_seconds, _process_stream_event, _sanitize_prompt, _validate_command_flags, @@ -263,6 +267,53 @@ def test_sanitized_prompt_passed_to_subprocess( assert actual_prompt == "-- --dangerous-flag" +class TestSpec27Flags: + """spec v29 Step 9: chunk-level config can drive --allowedTools / --output-format.""" + + def test_allowed_tools_list_renders_csv(self, tmp_path: pathlib.Path) -> None: + import types + + config = types.SimpleNamespace( + model="", + effort="", + max_turns=0, + allowed_tools=["Edit", "Write", "Bash(pytest:*)"], + ) + cmd = _build_agent_command("test", tmp_path, config, "claude") + assert "--allowedTools" in cmd + idx = cmd.index("--allowedTools") + assert cmd[idx + 1] == "Edit,Write,Bash(pytest:*)" + + def test_output_format_default_is_stream_json(self, tmp_path: pathlib.Path) -> None: + import types + + config = types.SimpleNamespace(model="", effort="", max_turns=0) + cmd = _build_agent_command("test", tmp_path, config, "claude") + idx = cmd.index("--output-format") + assert cmd[idx + 1] == "stream-json" + + def test_output_format_override(self, tmp_path: pathlib.Path) -> None: + import types + + config = types.SimpleNamespace(model="", effort="", max_turns=0, output_format="text") + cmd = _build_agent_command("test", tmp_path, config, "claude") + idx = cmd.index("--output-format") + assert cmd[idx + 1] == "text" + + def test_default_chunk_config_has_spec27_allow_list(self) -> None: + """ClaudeCodeEngine._DEFAULT_ALLOWED_TOOLS matches spec 27 §3.2 verbatim.""" + from codelicious.engines.claude_engine import _DEFAULT_ALLOWED_TOOLS + + assert "Edit" in _DEFAULT_ALLOWED_TOOLS + assert "Write" in _DEFAULT_ALLOWED_TOOLS + assert "Bash(git status:*)" in _DEFAULT_ALLOWED_TOOLS + assert "Bash(pytest:*)" in _DEFAULT_ALLOWED_TOOLS + assert "Bash(ruff:*)" in _DEFAULT_ALLOWED_TOOLS + assert "Read" in _DEFAULT_ALLOWED_TOOLS + assert "Glob" in _DEFAULT_ALLOWED_TOOLS + assert "Grep" in _DEFAULT_ALLOWED_TOOLS + + class TestDangerousFlagNeverPresent: """Tests for S20-P1-3: --dangerously-skip-permissions is permanently removed.""" @@ -409,6 +460,50 @@ def test_returncode_zero_never_raises(self) -> None: assert _check_agent_errors(0, [], ["auth failed somehow\n"]) is None +class TestParseClaudeResetSeconds: + """Spec v29 Step 5: parse Claude rate-limit reset windows from CLI output.""" + + def test_no_match_returns_none(self) -> None: + assert _parse_claude_reset_seconds("rate limit exceeded") is None + + def test_empty_returns_none(self) -> None: + assert _parse_claude_reset_seconds("") is None + + def test_resets_in_seconds(self) -> None: + assert _parse_claude_reset_seconds("Limit resets in 30 seconds") == 30.0 + + def test_resets_in_short_form(self) -> None: + assert _parse_claude_reset_seconds("resets in 45s") == 45.0 + + def test_try_again_in_minutes(self) -> None: + assert _parse_claude_reset_seconds("Please try again in 5 minutes.") == 300.0 + + def test_try_again_short_form(self) -> None: + assert _parse_claude_reset_seconds("try again in 2m") == 120.0 + + def test_retry_after_header(self) -> None: + assert _parse_claude_reset_seconds("Retry-After: 90") == 90.0 + + def test_value_below_floor_clamped(self) -> None: + # 5s is below the 10s floor; should clamp up. + assert _parse_claude_reset_seconds("resets in 5 seconds") == _CLAUDE_RETRY_AFTER_MIN_S + + def test_value_above_cap_clamped(self) -> None: + assert _parse_claude_reset_seconds("try again in 999 minutes") == _CLAUDE_RETRY_AFTER_MAX_S + + def test_check_agent_errors_uses_parsed_value(self) -> None: + """The rate-limit branch should pick up the parsed reset window.""" + with pytest.raises(ClaudeRateLimitError) as exc_info: + _check_agent_errors(1, [], ["rate limit hit; resets in 30 seconds\n"]) + assert exc_info.value.retry_after_s == 30.0 + + def test_check_agent_errors_falls_back_to_default(self) -> None: + """When no reset window is parseable, default retry_after_s applies.""" + with pytest.raises(ClaudeRateLimitError) as exc_info: + _check_agent_errors(1, [], ["rate limit exceeded\n"]) + assert exc_info.value.retry_after_s == _CLAUDE_RETRY_AFTER_DEFAULT_S + + class TestParseAgentOutput: """Unit tests for _parse_agent_output (Finding 46).""" diff --git a/tests/test_audit_logger.py b/tests/test_audit_logger.py new file mode 100644 index 00000000..fc6c2d36 --- /dev/null +++ b/tests/test_audit_logger.py @@ -0,0 +1,202 @@ +"""Tests for ``codelicious.tools.audit_logger`` (spec v29 Step 12). + +Covers initialization, intent/outcome routing, security-event dual-write to +``audit.log`` and ``security.log``, thread-safe appends under contention, +context-manager lifecycle, and the SecurityEvent enum surface. +""" + +from __future__ import annotations + +import threading +from pathlib import Path + +import pytest + +from codelicious.tools.audit_logger import AuditLogger, SecurityEvent + + +@pytest.fixture +def audit(tmp_path: Path) -> AuditLogger: + logger = AuditLogger(tmp_path) + yield logger + logger.close() + + +class TestAuditLoggerInit: + def test_creates_log_files(self, tmp_path: Path) -> None: + AuditLogger(tmp_path).close() + assert (tmp_path / ".codelicious" / "audit.log").exists() + assert (tmp_path / ".codelicious" / "security.log").exists() + + def test_init_creates_codelicious_dir(self, tmp_path: Path) -> None: + target = tmp_path / "nested" / "repo" + target.mkdir(parents=True) + AuditLogger(target).close() + assert (target / ".codelicious").is_dir() + + +class TestToolDispatchLogging: + def test_log_tool_intent_writes_dispatch_line(self, audit: AuditLogger, tmp_path: Path) -> None: + audit.log_tool_intent("read_file", {"path": "src/main.py"}) + body = (tmp_path / ".codelicious" / "audit.log").read_text() + assert "TOOL_DISPATCH" in body + assert "read_file" in body + assert "src/main.py" in body + + def test_log_tool_outcome_success(self, audit: AuditLogger, tmp_path: Path) -> None: + audit.log_tool_outcome("read_file", {"success": True, "stdout": "hello world"}) + body = (tmp_path / ".codelicious" / "audit.log").read_text() + assert "TOOL_SUCCESS" in body + assert "INFO" in body + + def test_log_tool_outcome_failure(self, audit: AuditLogger, tmp_path: Path) -> None: + audit.log_tool_outcome("write_file", {"success": False, "stderr": "permission denied"}) + body = (tmp_path / ".codelicious" / "audit.log").read_text() + assert "TOOL_FAILED" in body + assert "ERROR" in body + assert "permission denied" in body + + +class TestSecurityEventLogging: + def test_writes_to_both_audit_and_security(self, audit: AuditLogger, tmp_path: Path) -> None: + audit.log_security_event(SecurityEvent.PATH_TRAVERSAL_BLOCKED, "blocked ../etc/passwd") + audit_body = (tmp_path / ".codelicious" / "audit.log").read_text() + security_body = (tmp_path / ".codelicious" / "security.log").read_text() + assert "PATH_TRAVERSAL_BLOCKED" in audit_body + assert "PATH_TRAVERSAL_BLOCKED" in security_body + assert "blocked ../etc/passwd" in security_body + + def test_iteration_and_tool_context_in_security_line(self, audit: AuditLogger, tmp_path: Path) -> None: + audit.set_iteration(7) + audit.set_current_tool("write_file") + audit.log_security_event(SecurityEvent.EXTENSION_BLOCKED, "blocked .exe") + body = (tmp_path / ".codelicious" / "security.log").read_text() + assert "iteration 7" in body + assert "write_file" in body + + def test_explicit_overrides_take_precedence(self, audit: AuditLogger, tmp_path: Path) -> None: + audit.set_iteration(1) + audit.set_current_tool("read_file") + audit.log_security_event( + SecurityEvent.COMMAND_DENIED, + "denied rm -rf /", + iteration=99, + tool="run_command", + ) + body = (tmp_path / ".codelicious" / "security.log").read_text() + assert "iteration 99" in body + assert "run_command" in body + + def test_log_sandbox_violation_with_explicit_event_routes_to_security_log( + self, audit: AuditLogger, tmp_path: Path + ) -> None: + audit.log_sandbox_violation("denied symlink escape", event_type=SecurityEvent.SYMLINK_ESCAPE_BLOCKED) + body = (tmp_path / ".codelicious" / "security.log").read_text() + assert "SYMLINK_ESCAPE_BLOCKED" in body + + def test_log_sandbox_violation_without_event_falls_back_to_audit(self, audit: AuditLogger, tmp_path: Path) -> None: + audit.log_sandbox_violation("legacy untyped violation") + audit_body = (tmp_path / ".codelicious" / "audit.log").read_text() + security_body = (tmp_path / ".codelicious" / "security.log").read_text() + assert "SANDBOX TRAP" in audit_body + # Untyped violations are not security-event-typed, so they don't + # land in security.log. + assert "legacy untyped violation" not in security_body + + +class TestSecurityEventEnum: + def test_known_members_are_strings(self) -> None: + assert SecurityEvent.COMMAND_DENIED.value == "COMMAND_DENIED" + assert SecurityEvent.METACHAR_BLOCKED.value == "METACHAR_BLOCKED" + + def test_str_inheritance(self) -> None: + # SecurityEvent inherits from str → comparable to plain strings. + assert SecurityEvent.PATH_TRAVERSAL_BLOCKED == "PATH_TRAVERSAL_BLOCKED" + + +class TestThreadSafeAppend: + def test_concurrent_appends_preserve_all_lines(self, tmp_path: Path) -> None: + """10 threads × 100 entries each produce 1000 distinct lines, none truncated.""" + logger = AuditLogger(tmp_path) + try: + entries_per_thread = 100 + thread_count = 10 + barrier = threading.Barrier(thread_count) + + def writer(worker_id: int) -> None: + barrier.wait() + for i in range(entries_per_thread): + logger.log_tool_intent("noop", {"worker": worker_id, "i": i}) + + threads = [threading.Thread(target=writer, args=(w,)) for w in range(thread_count)] + for t in threads: + t.start() + for t in threads: + t.join(timeout=15) + finally: + logger.close() + + lines = (tmp_path / ".codelicious" / "audit.log").read_text().splitlines() + assert len(lines) == thread_count * entries_per_thread + # No line should be truncated mid-write — every line ends with '...' + # because log_tool_intent appends ellipsis-free Intent: prefix; verify + # presence of the dispatch tag instead. + for line in lines: + assert "TOOL_DISPATCH" in line + + +def _mp_writer(repo_path_str: str, worker_id: int, count: int) -> None: # pragma: no cover + """Multiprocessing target — must be module-level for pickling.""" + from codelicious.tools.audit_logger import AuditLogger + + logger = AuditLogger(Path(repo_path_str)) + try: + for i in range(count): + logger.log_tool_intent("noop", {"worker": worker_id, "i": i}) + finally: + logger.close() + + +class TestCrossProcessAppend: + """spec v30 Step 11: ``fcntl.flock`` keeps audit lines from interleaving across processes.""" + + def test_concurrent_processes_do_not_interleave(self, tmp_path: Path) -> None: + import multiprocessing + import sys as _sys + + if _sys.platform == "win32": + import pytest as _pytest + + _pytest.skip("POSIX-only test; Windows uses msvcrt fallback or no locking") + + workers = 4 + per_worker = 50 + # Use spawn to keep the child interpreter clean of pytest state. + ctx = multiprocessing.get_context("spawn") + procs = [ctx.Process(target=_mp_writer, args=(str(tmp_path), w, per_worker)) for w in range(workers)] + for p in procs: + p.start() + for p in procs: + p.join(timeout=20) + assert p.exitcode == 0, f"worker {p.pid} exited with {p.exitcode}" + + lines = (tmp_path / ".codelicious" / "audit.log").read_text().splitlines() + assert len(lines) == workers * per_worker + for line in lines: + assert "TOOL_DISPATCH" in line, f"interleaved line: {line!r}" + + +class TestLifecycle: + def test_context_manager_closes_handles(self, tmp_path: Path) -> None: + with AuditLogger(tmp_path) as logger: + logger.log_tool_intent("noop", {}) + assert logger._audit_fh is not None + assert not logger._audit_fh.closed + assert logger._audit_fh.closed + assert logger._security_fh.closed + + def test_double_close_is_safe(self, tmp_path: Path) -> None: + logger = AuditLogger(tmp_path) + logger.close() + # Should not raise. + logger.close() diff --git a/tests/test_chunker.py b/tests/test_chunker.py index 094b1c1e..f6c1c559 100644 --- a/tests/test_chunker.py +++ b/tests/test_chunker.py @@ -329,3 +329,157 @@ def test_markdown_code_fence_stripped(self, tmp_path: pathlib.Path) -> None: chunks = chunk_spec_with_llm(spec, tmp_path, llm) assert len(chunks) == 1 assert chunks[0].title == "Task" + + def test_too_many_chunks_truncated_with_warning( + self, tmp_path: pathlib.Path, caplog: pytest.LogCaptureFixture + ) -> None: + """LLM producing > _MAX_CHUNKS_PER_SPEC is truncated to the cap with a WARNING (spec v29 Step 7).""" + from codelicious.chunker import _MAX_CHUNKS_PER_SPEC, chunk_spec_with_llm + + spec = self._write_spec(tmp_path, "# Spec\n\nMassive.\n") + items = ",".join( + f'{{"title": "T{i}", "description": "d", "files": [], "depends_on_indices": [], "validation": ""}}' + for i in range(_MAX_CHUNKS_PER_SPEC + 1) + ) + llm = self._mock_llm(f"[{items}]") + + with caplog.at_level("WARNING", logger="codelicious.chunker"): + chunks = chunk_spec_with_llm(spec, tmp_path, llm) + assert len(chunks) == _MAX_CHUNKS_PER_SPEC + assert any("truncating" in r.message.lower() for r in caplog.records) + + def test_non_array_json_falls_back(self, tmp_path: pathlib.Path) -> None: + """JSON object (not array) at top level falls back to deterministic.""" + spec = self._write_spec(tmp_path, "# Spec\n\n## P1\n\n- [ ] Single task\n") + llm = self._mock_llm('{"title": "wrong shape"}') + + from codelicious.chunker import chunk_spec_with_llm + + chunks = chunk_spec_with_llm(spec, tmp_path, llm) + assert len(chunks) == 1 + assert "Single task" in chunks[0].title + + def test_short_spec_uses_single_window(self, tmp_path: pathlib.Path) -> None: + """Short specs make exactly one LLM call (spec v29 Step 7).""" + spec = self._write_spec(tmp_path, "# Short\n\n## P1\n\n- [ ] X\n") + llm = self._mock_llm( + '[{"title": "X", "description": "d", "files": [], "depends_on_indices": [], "validation": ""}]' + ) + + from codelicious.chunker import chunk_spec_with_llm + + chunk_spec_with_llm(spec, tmp_path, llm) + assert llm.chat_completion.call_count == 1 + + def test_oversized_spec_makes_multiple_windowed_calls( + self, tmp_path: pathlib.Path, caplog: pytest.LogCaptureFixture + ) -> None: + """A spec well over _LLM_WINDOW_SIZE drives multiple LLM calls (spec v29 Step 7).""" + from codelicious.chunker import _LLM_WINDOW_SIZE, chunk_spec_with_llm + + big = "x " * (_LLM_WINDOW_SIZE * 3) + spec = self._write_spec(tmp_path, big) + llm = self._mock_llm( + '[{"title": "T", "description": "d", "files": [], "depends_on_indices": [], "validation": ""}]' + ) + + with caplog.at_level("WARNING", logger="codelicious.chunker"): + chunks = chunk_spec_with_llm(spec, tmp_path, llm) + assert llm.chat_completion.call_count >= 2 + # Title is the same across windows; dedup keeps a single chunk. + assert len(chunks) == 1 + assert any("splitting into" in r.message for r in caplog.records) + + def test_split_spec_helper_caps_window_count(self) -> None: + """``_split_spec_for_llm`` never produces more than _LLM_MAX_WINDOWS windows.""" + from codelicious.chunker import _LLM_MAX_WINDOWS, _LLM_WINDOW_SIZE, _split_spec_for_llm + + huge = "y " * (_LLM_WINDOW_SIZE * 50) + windows = _split_spec_for_llm(huge) + assert len(windows) == _LLM_MAX_WINDOWS + + def test_self_referential_dep_dropped(self, tmp_path: pathlib.Path) -> None: + """A chunk depending on itself is silently dropped (i != idx guard).""" + spec = self._write_spec(tmp_path, "# Spec\n\nWork.\n") + llm = self._mock_llm( + '[{"title": "Solo", "description": "d", "files": [], "depends_on_indices": [0], "validation": ""}]' + ) + + from codelicious.chunker import chunk_spec_with_llm + + chunks = chunk_spec_with_llm(spec, tmp_path, llm) + assert len(chunks) == 1 + assert chunks[0].depends_on == [] + + +# ───────────────────────────────────────────────────────────────────────── +# spec v30 Step 6: token-budget-aware chunk sizing +# ───────────────────────────────────────────────────────────────────────── + + +class TestTokenBudget: + def _wc(self, files: list[str], chunk_id: str = "spec-1-chunk-01") -> WorkChunk: + from codelicious.chunker import WorkChunk + + return WorkChunk( + id=chunk_id, + spec_path=pathlib.Path("docs/specs/01.md"), + title="t", + description="d", + depends_on=[], + estimated_files=files, + validation="", + ) + + def test_under_budget_unchanged(self, tmp_path: pathlib.Path) -> None: + from codelicious.chunker import enforce_token_budget + + wc = self._wc([]) + out = enforce_token_budget([wc], tmp_path, engines=["claude"]) + assert out == [wc] + + def test_over_budget_chunk_is_split(self, tmp_path: pathlib.Path) -> None: + from codelicious.chunker import enforce_token_budget + + # 4 files, each 30k chars → 120k chars / 4 = 30k tokens. HF budget 24k. + files = [] + for i in range(4): + p = tmp_path / f"big_{i}.py" + p.write_text("x" * 30_000) + files.append(p.name) + wc = self._wc(files) + out = enforce_token_budget([wc], tmp_path, engines=["huggingface"]) + # Splits at least once → two or more chunks; second one depends on first. + assert len(out) >= 2 + assert out[0].id == wc.id + assert wc.id in out[1].depends_on + + def test_recursive_split_caps_at_depth(self, tmp_path: pathlib.Path, caplog: pytest.LogCaptureFixture) -> None: + """Even a runaway chunk produces at most 2**depth+1 sub-chunks.""" + from codelicious.chunker import _MAX_CHUNK_SPLIT_DEPTH, enforce_token_budget + + # One enormous file dominates the budget; can't split below 1 file. + big = tmp_path / "huge.py" + big.write_text("y" * 200_000) + wc = self._wc([big.name]) + with caplog.at_level("WARNING", logger="codelicious.chunker"): + out = enforce_token_budget([wc], tmp_path, engines=["huggingface"]) + # Single-file chunk can't subdivide further; original is dispatched anyway. + assert len(out) == 1 + assert any("dispatch anyway" in r.message for r in caplog.records) + assert _MAX_CHUNK_SPLIT_DEPTH == 3 + + def test_split_preserves_total_file_coverage(self, tmp_path: pathlib.Path) -> None: + from codelicious.chunker import enforce_token_budget + + files = [] + for i in range(4): + p = tmp_path / f"file_{i}.py" + p.write_text("z" * 30_000) + files.append(p.name) + wc = self._wc(files) + out = enforce_token_budget([wc], tmp_path, engines=["huggingface"]) + covered: list[str] = [] + for c in out: + covered.extend(c.estimated_files) + assert sorted(covered) == sorted(files) diff --git a/tests/test_cli.py b/tests/test_cli.py index 74f431eb..12e474cf 100644 --- a/tests/test_cli.py +++ b/tests/test_cli.py @@ -27,8 +27,8 @@ main, setup_logger, ) -from codelicious.engines.base import BuildResult from codelicious.git.git_orchestrator import GitManager +from codelicious.orchestrator import OrchestratorResult @pytest.fixture @@ -46,12 +46,6 @@ def mock_successful_engine(): engine = mock.MagicMock() engine.name = "mock-engine" - engine.run_build_cycle.return_value = BuildResult( - success=True, - message="Build completed successfully", - session_id="test-123", - elapsed_s=10.5, - ) # v2 orchestrator chunk methods engine.execute_chunk.return_value = ChunkResult(success=True, files_modified=["src/foo.py"], message="done") engine.verify_chunk.return_value = ChunkResult(success=True, message="passed") @@ -66,12 +60,6 @@ def mock_failed_engine(): engine = mock.MagicMock() engine.name = "mock-engine" - engine.run_build_cycle.return_value = BuildResult( - success=False, - message="Build failed: test error", - session_id="test-456", - elapsed_s=5.0, - ) # v2 orchestrator chunk methods engine.execute_chunk.return_value = ChunkResult(success=False, message="failed") engine.verify_chunk.return_value = ChunkResult(success=False, message="failed") @@ -333,7 +321,7 @@ def _skip_dep_validation(self): def test_no_incomplete_specs_exits_zero_without_build( self, mock_repo: Path, mock_successful_engine, mock_git_manager ): - """When discover_incomplete_specs returns [], main() exits 0 without running engine.run_build_cycle.""" + """When discover_incomplete_specs returns [], main() exits 0 without dispatching any chunk.""" # Patch both walk_for_specs (for the banner) and discover_incomplete_specs (for the guard) # to return empty lists, simulating a fully-complete repo. with mock.patch("codelicious.cli.select_engine", return_value=mock_successful_engine): @@ -346,7 +334,7 @@ def test_no_incomplete_specs_exits_zero_without_build( main() assert exc_info.value.code == 0 - mock_successful_engine.run_build_cycle.assert_not_called() + mock_successful_engine.execute_chunk.assert_not_called() class TestPrintBanner: @@ -416,7 +404,7 @@ class TestPrintResult: def test_print_result_success(self, tmp_path: Path): """_print_result prints BUILD COMPLETE for a successful result.""" - result = BuildResult(success=True, message="Done.", session_id="s1", elapsed_s=5.0) + result = OrchestratorResult(success=True, message="Done.", elapsed_s=5.0) captured = io.StringIO() with mock.patch("codelicious.cli.walk_for_specs", return_value=[]): @@ -435,7 +423,7 @@ def test_print_result_success(self, tmp_path: Path): def test_print_result_failure(self, tmp_path: Path): """_print_result prints BUILD FINISHED (with issues) for a failed result.""" - result = BuildResult(success=False, message="Some error.", session_id="s2", elapsed_s=3.0) + result = OrchestratorResult(success=False, message="Some error.", elapsed_s=3.0) captured = io.StringIO() with mock.patch("codelicious.cli.walk_for_specs", return_value=[]), mock.patch("sys.stdout", captured): @@ -453,7 +441,7 @@ def test_print_result_failure(self, tmp_path: Path): def test_print_result_elapsed_time_formatted(self, tmp_path: Path): """_print_result formats elapsed time in minutes and seconds for long runs.""" - result = BuildResult(success=True, message="", session_id="s3", elapsed_s=90.0) + result = OrchestratorResult(success=True, message="", elapsed_s=90.0) captured = io.StringIO() with mock.patch("codelicious.cli.walk_for_specs", return_value=[]), mock.patch("sys.stdout", captured): @@ -845,6 +833,15 @@ def test_max_commits_per_pr_default_8(self): opts = _parse_args(sys.argv) assert opts["max_commits_per_pr"] == 8 + def test_max_loc_per_pr_default_matches_constant(self): + """max_loc_per_pr default tracks the _DEFAULT_PR_LOC_CAP constant (spec v29 Step 1).""" + from codelicious.cli import _DEFAULT_PR_COMMIT_CAP, _DEFAULT_PR_LOC_CAP + + with mock.patch.object(sys, "argv", ["codelicious", "/tmp/repo"]): + opts = _parse_args(sys.argv) + assert opts["max_loc_per_pr"] == _DEFAULT_PR_LOC_CAP == 250 + assert opts["max_commits_per_pr"] == _DEFAULT_PR_COMMIT_CAP == 8 + def test_max_commits_per_pr_over_100_exits(self): """--max-commits-per-pr > 100 exits with code 2.""" with mock.patch.object(sys, "argv", ["codelicious", "/tmp/repo", "--max-commits-per-pr", "101"]): @@ -970,9 +967,10 @@ def fake_run(args, **kw): assert info["transport"] == "ssh" assert info["ssh_key_loaded"] is True + assert info["ssh_agent_state"] == "keys_loaded" def test_ssh_repo_with_no_keys(self, tmp_path: Path) -> None: - """SSH remote with no agent keys → ssh_key_loaded False.""" + """SSH remote with no agent keys → ssh_key_loaded False, state=empty.""" cls = self.__class__ def fake_run(args, **kw): @@ -989,6 +987,46 @@ def fake_run(args, **kw): assert info["transport"] == "ssh" assert info["ssh_key_loaded"] is False + assert info["ssh_agent_state"] == "empty" + + def test_ssh_repo_with_no_agent(self, tmp_path: Path) -> None: + """SSH remote, ssh-add exit 2 → ssh_agent_state=no_agent (spec v29 Step 14).""" + cls = self.__class__ + + def fake_run(args, **kw): + if args[:2] == ["git", "-C"] and "remote.origin.url" in args: + return cls._result(0, "git@github.com:o/r.git\n") + if args[:2] == ["git", "-C"] and "commit.gpgsign" in args: + return cls._result(1, "") + if args[0] == "ssh-add": + return cls._result(2, "", "Could not open a connection to your authentication agent.\n") + raise AssertionError(f"unexpected call {args}") + + with mock.patch("subprocess.run", side_effect=fake_run): + info = _probe_git_credentials(tmp_path) + + assert info["transport"] == "ssh" + assert info["ssh_key_loaded"] is False + assert info["ssh_agent_state"] == "no_agent" + + def test_ssh_repo_probe_unknown_on_oserror(self, tmp_path: Path) -> None: + """OSError on ssh-add → ssh_agent_state=unknown, conservative not-loaded.""" + cls = self.__class__ + + def fake_run(args, **kw): + if args[:2] == ["git", "-C"] and "remote.origin.url" in args: + return cls._result(0, "git@github.com:o/r.git\n") + if args[:2] == ["git", "-C"] and "commit.gpgsign" in args: + return cls._result(1, "") + if args[0] == "ssh-add": + raise OSError("ssh-add binary missing") + raise AssertionError(f"unexpected call {args}") + + with mock.patch("subprocess.run", side_effect=fake_run): + info = _probe_git_credentials(tmp_path) + + assert info["ssh_key_loaded"] is False + assert info["ssh_agent_state"] == "unknown" def test_gpgsign_true_with_warm_agent(self, tmp_path: Path) -> None: """commit.gpgsign=true + secret keys present → gpg_agent_warm True.""" @@ -1392,3 +1430,267 @@ def test_spec_override_builds_single_spec(self, mock_repo: Path, mock_git_manage main() mock_successful_engine.execute_chunk.assert_called() + + +# -- spec v30 Step 4: SIGTERM graceful shutdown ----------------------------- + + +class TestSigtermIntegration: + """spec v30 Step 4: SIGTERM raises SystemExit(143) and releases the run lock.""" + + def test_sigterm_handler_raises_system_exit_143(self) -> None: + """The handler exits with code 143 (128 + SIGTERM=15) per CLAUDE.md.""" + from codelicious.cli import _handle_sigterm + + with pytest.raises(SystemExit) as exc: + _handle_sigterm(15, None) + assert exc.value.code == 143 + + def test_subprocess_exits_within_grace_window(self, tmp_path: Path) -> None: + """A child process holding the run lock terminates cleanly on SIGTERM.""" + import signal as _signal + import subprocess + import sys as _sys + import time as _time + + if _sys.platform == "win32": + pytest.skip("POSIX-only test") + + repo_root = Path(__file__).resolve().parent.parent + script = ( + "import sys, time, signal;" + "sys.path.insert(0, 'src');" + "from codelicious.cli import _handle_sigterm, _run_lock;" + "import pathlib;" + "signal.signal(signal.SIGTERM, _handle_sigterm);" + f"cm = _run_lock(pathlib.Path({str(tmp_path)!r}));" + "cm.__enter__();" + "time.sleep(30)" + ) + proc = subprocess.Popen( + [_sys.executable, "-c", script], + cwd=repo_root, + stdout=subprocess.PIPE, + stderr=subprocess.PIPE, + ) + + # Give the child time to acquire the lock. + for _ in range(40): + if (tmp_path / ".codelicious" / "run.lock").exists(): + break + _time.sleep(0.05) + + proc.send_signal(_signal.SIGTERM) + try: + _stdout, stderr = proc.communicate(timeout=8) + except subprocess.TimeoutExpired: + proc.kill() + _stdout, stderr = proc.communicate() + pytest.fail(f"Child did not exit within 8 s after SIGTERM. stderr={stderr.decode(errors='replace')[:500]}") + + assert proc.returncode == 143, f"expected 143, got {proc.returncode}: {stderr.decode(errors='replace')[:500]}" + # The atexit hook in _run_lock should have removed the lockfile. + assert not (tmp_path / ".codelicious" / "run.lock").exists() + + +# -- spec v30 Step 12: postmortem dump on abnormal exit --------------------- + + +class TestPostmortem: + """spec v30 Step 12: ``_write_postmortem`` summarises run state on abort.""" + + def test_writes_markdown_with_counts_and_resume(self, tmp_path: Path) -> None: + import json + + from codelicious.cli import _write_postmortem + + state = tmp_path / ".codelicious" / "state" + state.mkdir(parents=True) + (state / "spec-27.json").write_text( + json.dumps( + { + "chunks": { + "spec-27-chunk-01": {"status": "merged", "title": "A"}, + "spec-27-chunk-02": {"status": "failed", "title": "B"}, + "spec-27-chunk-03": {"status": "failed", "title": "C"}, + } + } + ) + ) + + log = tmp_path / "build.log" + log.write_text("\n".join(f"line-{i}" for i in range(75)), encoding="utf-8") + + path = _write_postmortem(tmp_path, abort_reason="rate-limit", log_path=log, spec_arg=".") + assert path is not None + body = path.read_text() + assert "Abort reason:** rate-limit" in body + assert "merged: 1" in body + assert "failed: 2" in body + assert "Failed chunks" in body + assert "- B" in body + assert "Log tail" in body + assert "line-74" in body + # Resume hint is present and uses the no-reset form for rate-limit. + assert "codelicious build ." in body + assert "--reset-ledger" not in body + + def test_handles_missing_state_dir(self, tmp_path: Path) -> None: + from codelicious.cli import _write_postmortem + + path = _write_postmortem(tmp_path, abort_reason="exception", log_path=None) + assert path is not None + body = path.read_text() + # Empty state → all counts zero, but the file still writes cleanly. + assert "merged: 0" in body + assert "failed: 0" in body + + def test_corrupt_ledger_suggests_reset(self, tmp_path: Path) -> None: + from codelicious.cli import _write_postmortem + + path = _write_postmortem(tmp_path, abort_reason="ledger-corrupt", log_path=None) + body = path.read_text() + assert "--reset-ledger" in body + + +# -- spec v30 Step 7: atomic latest.log symlink update ---------------------- + + +class TestAtomicSymlinkUpdate: + """spec v30 Step 7: ``_atomic_symlink_update`` swaps via tmp + os.replace.""" + + def test_creates_link_when_missing(self, tmp_path: Path) -> None: + from codelicious.cli import _atomic_symlink_update + + target = tmp_path / "build-1.log" + target.write_text("hi") + link = tmp_path / "latest.log" + _atomic_symlink_update(link, target.name) + assert link.is_symlink() + assert link.resolve() == target.resolve() + + def test_repeated_update_replaces_target(self, tmp_path: Path) -> None: + from codelicious.cli import _atomic_symlink_update + + a = tmp_path / "build-1.log" + b = tmp_path / "build-2.log" + a.write_text("1") + b.write_text("2") + link = tmp_path / "latest.log" + _atomic_symlink_update(link, a.name) + _atomic_symlink_update(link, b.name) + assert link.is_symlink() + assert link.resolve() == b.resolve() + + def test_no_dangling_tmp_after_update(self, tmp_path: Path) -> None: + from codelicious.cli import _atomic_symlink_update + + target = tmp_path / "build-1.log" + target.write_text("hi") + link = tmp_path / "latest.log" + _atomic_symlink_update(link, target.name) + # No leftover *.tmp siblings. + leftovers = list(tmp_path.glob("latest.log.*.tmp")) + assert leftovers == [] + + +# -- spec v30 Step 1: per-repo run lockfile --------------------------------- + + +class TestRunLock: + """spec v30 Step 1: ``_run_lock`` prevents concurrent codelicious runs on a repo.""" + + def test_acquires_and_releases(self, tmp_path: Path) -> None: + import os as _os + + from codelicious.cli import _run_lock + + with _run_lock(tmp_path) as lock_path: + assert lock_path is not None + assert lock_path.exists() + assert lock_path.read_text().strip() == str(_os.getpid()) + + # After exit the lock file is removed. + assert not (tmp_path / ".codelicious" / "run.lock").exists() + + def test_second_acquire_in_subprocess_exits_75(self, tmp_path: Path) -> None: + """A second invocation while the first holds the lock exits 75 (EX_TEMPFAIL).""" + import subprocess + import sys + + from codelicious.cli import _run_lock + + # Hold the lock and try to grab it again from a child Python process. + with _run_lock(tmp_path): + child = subprocess.run( + [ + sys.executable, + "-c", + ( + "import sys; sys.path.insert(0, 'src'); " + "from codelicious.cli import _run_lock; " + f"cm = _run_lock(__import__('pathlib').Path({str(tmp_path)!r})); cm.__enter__()" + ), + ], + capture_output=True, + text=True, + cwd=Path(__file__).resolve().parent.parent, + timeout=10, + ) + assert child.returncode == 75 + assert "another codelicious run is in progress" in child.stderr + + +# -- spec v30 Step 3: --endpoint / LLM_ENDPOINT validation ------------------- + + +class TestEndpointValidation: + """spec v30 Step 3: insecure or credential-bearing endpoints rejected at CLI layer.""" + + def test_empty_url_accepted(self) -> None: + from codelicious.cli import _validate_endpoint_url_strict + + # Should not raise / sys.exit + _validate_endpoint_url_strict("") + + def test_https_url_accepted(self) -> None: + from codelicious.cli import _validate_endpoint_url_strict + + _validate_endpoint_url_strict("https://api.example.com/v1/chat") + + def test_http_url_rejected(self, capsys: pytest.CaptureFixture[str]) -> None: + from codelicious.cli import _validate_endpoint_url_strict + + with pytest.raises(SystemExit) as exc: + _validate_endpoint_url_strict("http://internal-proxy/v1") + assert exc.value.code == 2 + err = capsys.readouterr().err + assert "https://" in err + assert "http://internal-proxy/v1" in err + + def test_user_info_rejected_and_masked(self, capsys: pytest.CaptureFixture[str]) -> None: + from codelicious.cli import _validate_endpoint_url_strict + + with pytest.raises(SystemExit) as exc: + _validate_endpoint_url_strict("https://user:secret@host/v1") + assert exc.value.code == 2 + err = capsys.readouterr().err + assert "***@host" in err + assert "secret" not in err + + def test_empty_host_rejected(self, capsys: pytest.CaptureFixture[str]) -> None: + from codelicious.cli import _validate_endpoint_url_strict + + with pytest.raises(SystemExit) as exc: + _validate_endpoint_url_strict("https:///path-only") + assert exc.value.code == 2 + assert "no hostname" in capsys.readouterr().err + + def test_malformed_url_rejected(self, capsys: pytest.CaptureFixture[str]) -> None: + from codelicious.cli import _validate_endpoint_url_strict + + with pytest.raises(SystemExit) as exc: + # a scheme-less garbage string parses but has no scheme → http rejected branch + _validate_endpoint_url_strict("not a url at all") + assert exc.value.code == 2 + assert "https://" in capsys.readouterr().err diff --git a/tests/test_engine_base.py b/tests/test_engine_base.py index eb84a528..9b5d0ddf 100644 --- a/tests/test_engine_base.py +++ b/tests/test_engine_base.py @@ -1,9 +1,9 @@ -"""Tests for the BuildEngine abstract base class and BuildResult dataclass. +"""Tests for the BuildEngine abstract base class. Covers: - BuildEngine cannot be directly instantiated (abstract class enforcement) -- Concrete subclasses must implement all abstract members -- BuildResult field creation and default values +- Concrete subclasses must implement all three abstract chunk methods +- ChunkResult and EngineContext dataclass field creation and defaults - select_engine factory function """ @@ -14,53 +14,10 @@ import pytest -from codelicious.engines.base import BuildEngine, BuildResult, ChunkResult, EngineContext +from codelicious.engines.base import BuildEngine, ChunkResult, EngineContext from codelicious.engines.claude_engine import ClaudeCodeEngine from codelicious.engines.huggingface_engine import HuggingFaceEngine -# --------------------------------------------------------------------------- -# BuildResult tests -# --------------------------------------------------------------------------- - - -class TestBuildResult: - """Tests for the BuildResult dataclass.""" - - def test_build_result_creation(self) -> None: - """BuildResult stores all provided field values correctly.""" - result = BuildResult( - success=True, - message="All specs complete.", - session_id="abc-123", - elapsed_s=42.5, - ) - - assert result.success is True - assert result.message == "All specs complete." - assert result.session_id == "abc-123" - assert result.elapsed_s == 42.5 - - def test_build_result_defaults(self) -> None: - """BuildResult has correct default field values when only success is provided.""" - result = BuildResult(success=False) - - assert result.success is False - assert result.message == "" - assert result.session_id == "" - assert result.elapsed_s == 0.0 - - def test_build_result_success_true(self) -> None: - """BuildResult with success=True is truthy for the success field.""" - result = BuildResult(success=True) - assert result.success is True - - def test_build_result_success_false(self) -> None: - """BuildResult with success=False reflects a failed build.""" - result = BuildResult(success=False, message="Exhausted iteration limit.") - assert result.success is False - assert result.message == "Exhausted iteration limit." - - # --------------------------------------------------------------------------- # BuildEngine abstract class enforcement tests # --------------------------------------------------------------------------- @@ -74,17 +31,55 @@ def test_base_engine_cannot_be_instantiated(self) -> None: with pytest.raises(TypeError): BuildEngine() # type: ignore[abstract] - def test_subclass_must_implement_all_abstract(self) -> None: - """A subclass that omits run_build_cycle cannot be instantiated.""" + def test_subclass_must_implement_execute_chunk(self) -> None: + """A subclass that omits execute_chunk cannot be instantiated.""" class PartialEngine(BuildEngine): - """Implements name but not run_build_cycle.""" + """Implements name, verify_chunk, fix_chunk but not execute_chunk.""" + + @property + def name(self) -> str: + return "Partial" + + def verify_chunk(self, chunk, repo_path): + return ChunkResult(success=True) + + def fix_chunk(self, chunk, repo_path, failures): + return ChunkResult(success=True) + + with pytest.raises(TypeError): + PartialEngine() # type: ignore[abstract] + + def test_subclass_must_implement_verify_chunk(self) -> None: + """A subclass that omits verify_chunk cannot be instantiated.""" + + class PartialEngine(BuildEngine): + @property + def name(self) -> str: + return "Partial" + + def execute_chunk(self, chunk, repo_path, context): + return ChunkResult(success=True) + + def fix_chunk(self, chunk, repo_path, failures): + return ChunkResult(success=True) + + with pytest.raises(TypeError): + PartialEngine() # type: ignore[abstract] + + def test_subclass_must_implement_fix_chunk(self) -> None: + """A subclass that omits fix_chunk cannot be instantiated.""" + class PartialEngine(BuildEngine): @property def name(self) -> str: return "Partial" - # Intentionally omits run_build_cycle + def execute_chunk(self, chunk, repo_path, context): + return ChunkResult(success=True) + + def verify_chunk(self, chunk, repo_path): + return ChunkResult(success=True) with pytest.raises(TypeError): PartialEngine() # type: ignore[abstract] @@ -93,12 +88,16 @@ def test_subclass_missing_name_property_cannot_be_instantiated(self) -> None: """A subclass that omits the name property cannot be instantiated.""" class NoNameEngine(BuildEngine): - """Implements run_build_cycle but not name.""" + """Implements chunk methods but not name.""" - def run_build_cycle(self, repo_path, git_manager, cache_manager, spec_filter=None, **kwargs): - return BuildResult(success=True) + def execute_chunk(self, chunk, repo_path, context): + return ChunkResult(success=True) - # Intentionally omits name property + def verify_chunk(self, chunk, repo_path): + return ChunkResult(success=True) + + def fix_chunk(self, chunk, repo_path, failures): + return ChunkResult(success=True) with pytest.raises(TypeError): NoNameEngine() # type: ignore[abstract] @@ -122,28 +121,12 @@ def verify_chunk(self, chunk, repo_path): def fix_chunk(self, chunk, repo_path, failures): return ChunkResult(success=True, message="fixed") - def run_build_cycle( - self, - repo_path: pathlib.Path, - git_manager: object, - cache_manager: object, - spec_filter: str | None = None, - **kwargs, - ) -> BuildResult: - return BuildResult(success=True, message="Done", elapsed_s=1.0) - engine = ConcreteEngine() assert engine.name == "Concrete Engine" - result = engine.run_build_cycle( - repo_path=pathlib.Path("/tmp"), - git_manager=object(), - cache_manager=object(), - ) - assert isinstance(result, BuildResult) + result = engine.execute_chunk(None, pathlib.Path("/tmp"), EngineContext()) + assert isinstance(result, ChunkResult) assert result.success is True - assert result.message == "Done" - assert result.elapsed_s == 1.0 def test_subclass_name_property_is_accessible(self) -> None: """The name property on a concrete subclass returns the expected string.""" @@ -162,16 +145,12 @@ def verify_chunk(self, chunk, repo_path): def fix_chunk(self, chunk, repo_path, failures): return ChunkResult(success=True) - def run_build_cycle(self, repo_path, git_manager, cache_manager, spec_filter=None, **kwargs): - return BuildResult(success=False) - engine = NamedEngine() assert engine.name == "My Engine" # --------------------------------------------------------------------------- # Engine contract tests — verify both concrete engines implement the ABC -# (merged from test_engine_contract.py, spec-18 Phase 11) # --------------------------------------------------------------------------- @@ -200,26 +179,48 @@ def test_hf_engine_has_name(self) -> None: assert isinstance(engine.name, str) assert len(engine.name) > 0 - def test_claude_engine_has_run_build_cycle(self) -> None: - """ClaudeCodeEngine must expose a callable run_build_cycle method.""" + def test_claude_engine_has_execute_chunk(self) -> None: + """ClaudeCodeEngine must expose a callable execute_chunk method.""" engine = ClaudeCodeEngine() - assert hasattr(engine, "run_build_cycle") - assert callable(engine.run_build_cycle) + assert hasattr(engine, "execute_chunk") + assert callable(engine.execute_chunk) - def test_hf_engine_has_run_build_cycle(self) -> None: - """HuggingFaceEngine must expose a callable run_build_cycle method.""" + def test_hf_engine_has_execute_chunk(self) -> None: + """HuggingFaceEngine must expose a callable execute_chunk method.""" engine = HuggingFaceEngine() - assert hasattr(engine, "run_build_cycle") - assert callable(engine.run_build_cycle) + assert hasattr(engine, "execute_chunk") + assert callable(engine.execute_chunk) + def test_claude_engine_has_verify_chunk(self) -> None: + """ClaudeCodeEngine must expose a callable verify_chunk method.""" + engine = ClaudeCodeEngine() + assert hasattr(engine, "verify_chunk") + assert callable(engine.verify_chunk) + + def test_hf_engine_has_verify_chunk(self) -> None: + """HuggingFaceEngine must expose a callable verify_chunk method.""" + engine = HuggingFaceEngine() + assert hasattr(engine, "verify_chunk") + assert callable(engine.verify_chunk) + + def test_claude_engine_has_fix_chunk(self) -> None: + """ClaudeCodeEngine must expose a callable fix_chunk method.""" + engine = ClaudeCodeEngine() + assert hasattr(engine, "fix_chunk") + assert callable(engine.fix_chunk) -class TestBuildResultMessageType: - """Targeted type-check for BuildResult.message (not covered by TestBuildResult).""" + def test_hf_engine_has_fix_chunk(self) -> None: + """HuggingFaceEngine must expose a callable fix_chunk method.""" + engine = HuggingFaceEngine() + assert hasattr(engine, "fix_chunk") + assert callable(engine.fix_chunk) - def test_build_result_message_is_str(self) -> None: - """BuildResult.message must be an instance of str.""" - result = BuildResult(success=False, message="failed") - assert isinstance(result.message, str) + def test_engines_do_not_have_run_build_cycle(self) -> None: + """Neither engine should expose run_build_cycle (removed in spec v29 Step 8).""" + for cls in (ClaudeCodeEngine, HuggingFaceEngine): + assert not hasattr(cls, "run_build_cycle"), ( + f"{cls.__name__} still has run_build_cycle — legacy method was not removed" + ) # --------------------------------------------------------------------------- @@ -298,9 +299,6 @@ def verify_chunk(self, chunk, repo_path): def fix_chunk(self, chunk, repo_path, failures): return ChunkResult(success=True) - def run_build_cycle(self, repo_path, git_manager, cache_manager, spec_filter=None, **kwargs): - return BuildResult(success=True) - with pytest.raises(TypeError): NoChunkEngine() # type: ignore[abstract] diff --git a/tests/test_engine_claude_chunk.py b/tests/test_engine_claude_chunk.py index d31c47b4..03a83e67 100644 --- a/tests/test_engine_claude_chunk.py +++ b/tests/test_engine_claude_chunk.py @@ -251,47 +251,3 @@ def test_fix_chunk_git_diff_exception(self, tmp_path: pathlib.Path) -> None: assert result.success is True assert result.files_modified == [] assert result.retries_used == 1 - - -class TestClaudeRunBuildCycle: - """Coverage for run_build_cycle legacy interface (lines 273-293).""" - - def test_run_build_cycle_no_specs(self, tmp_path: pathlib.Path) -> None: - """Lines 280-282: discover_incomplete_specs returns [] → immediate success.""" - engine = ClaudeCodeEngine() - git_manager = mock.MagicMock() - cache_manager = mock.MagicMock() - - with mock.patch("codelicious.spec_discovery.discover_incomplete_specs", return_value=[]): - result = engine.run_build_cycle(tmp_path, git_manager, cache_manager) - - assert result.success is True - assert "no incomplete" in result.message.lower() - - def test_run_build_cycle_with_specs(self, tmp_path: pathlib.Path) -> None: - """Lines 284-293: discover_incomplete_specs returns specs, delegates to V2Orchestrator.run.""" - engine = ClaudeCodeEngine() - git_manager = mock.MagicMock() - cache_manager = mock.MagicMock() - - fake_spec = tmp_path / "docs" / "specs" / "01_feature.md" - fake_spec.parent.mkdir(parents=True) - fake_spec.write_text("# Spec\n- [ ] task\n") - - from codelicious.orchestrator import OrchestratorResult - - orch_result = OrchestratorResult(success=True, message="all done", elapsed_s=1.5) - mock_orch = mock.MagicMock() - mock_orch.run.return_value = orch_result - - with mock.patch("codelicious.spec_discovery.discover_incomplete_specs", return_value=[fake_spec]): - with mock.patch("codelicious.orchestrator.V2Orchestrator", return_value=mock_orch) as mock_orch_cls: - result = engine.run_build_cycle(tmp_path, git_manager, cache_manager, push_pr=True) - - assert result.success is True - assert result.message == "all done" - mock_orch_cls.assert_called_once() - mock_orch.run.assert_called_once() - # push_pr kwarg should have been forwarded - call_kwargs = mock_orch.run.call_args.kwargs - assert call_kwargs.get("push_pr") is True diff --git a/tests/test_engine_huggingface_chunk.py b/tests/test_engine_huggingface_chunk.py index 6e014307..5a871ecc 100644 --- a/tests/test_engine_huggingface_chunk.py +++ b/tests/test_engine_huggingface_chunk.py @@ -92,6 +92,82 @@ def test_iteration_limit_returns_incomplete(self, tmp_path: pathlib.Path) -> Non assert "incomplete" in result.message.lower() +class TestHFReflectionGate: + """spec v29 Step 11: post-reflection verify must pass before HF returns success=True.""" + + def _setup_complete_signal(self, mock_llm: mock.MagicMock) -> None: + mock_llm.chat_completion.return_value = { + "choices": [{"message": {"role": "assistant", "content": "done. CHUNK_COMPLETE"}}] + } + mock_llm.parse_tool_calls.return_value = [] + mock_llm.parse_content.return_value = "done. CHUNK_COMPLETE" + mock_llm.planner_model = "test" + mock_llm.coder_model = "test" + mock_llm.endpoint_url = "https://test" + + def test_verify_fails_then_passes_after_one_fix(self, tmp_path: pathlib.Path) -> None: + """verify fails once, fix is invoked, second verify passes → success.""" + engine = HuggingFaceEngine() + chunk = FakeChunk() + context = EngineContext(spec_content="# Spec", deadline=0.0) + + mock_llm = mock.MagicMock() + self._setup_complete_signal(mock_llm) + mock_registry = mock.MagicMock() + mock_registry.generate_schema.return_value = [] + diff_mock = mock.MagicMock(returncode=0, stdout="src/a.py\n") + + verify_results = [ + ChunkResult(success=False, message="lint failed"), + ChunkResult(success=True, message="ok"), + ] + fix_result = ChunkResult(success=True, message="patched") + + with mock.patch("codelicious.llm_client.LLMClient", return_value=mock_llm): + with mock.patch("codelicious.tools.registry.ToolRegistry", return_value=mock_registry): + with mock.patch("codelicious.config.load_project_config", return_value={}): + with mock.patch("subprocess.run", return_value=diff_mock): + with mock.patch.object(engine, "verify_chunk", side_effect=verify_results) as m_verify: + with mock.patch.object(engine, "fix_chunk", return_value=fix_result) as m_fix: + result = engine.execute_chunk(chunk, tmp_path, context) + + assert result.success is True + assert m_fix.call_count == 1 + assert m_verify.call_count == 2 + + def test_verify_always_fails_returns_failure_after_two_attempts(self, tmp_path: pathlib.Path) -> None: + """When verify never passes, success=False after exactly 2 fix attempts.""" + engine = HuggingFaceEngine() + chunk = FakeChunk() + context = EngineContext(spec_content="# Spec", deadline=0.0) + + mock_llm = mock.MagicMock() + self._setup_complete_signal(mock_llm) + mock_registry = mock.MagicMock() + mock_registry.generate_schema.return_value = [] + diff_mock = mock.MagicMock(returncode=0, stdout="") + + with mock.patch("codelicious.llm_client.LLMClient", return_value=mock_llm): + with mock.patch("codelicious.tools.registry.ToolRegistry", return_value=mock_registry): + with mock.patch("codelicious.config.load_project_config", return_value={}): + with mock.patch("subprocess.run", return_value=diff_mock): + with mock.patch.object( + engine, + "verify_chunk", + return_value=ChunkResult(success=False, message="still failing"), + ) as m_verify: + with mock.patch.object( + engine, "fix_chunk", return_value=ChunkResult(success=True, message="x") + ) as m_fix: + result = engine.execute_chunk(chunk, tmp_path, context) + + assert result.success is False + assert "fix attempts" in result.message + # Initial verify + 2 retries = 3 verify calls; 2 fix attempts. + assert m_verify.call_count == 3 + assert m_fix.call_count == 2 + + class TestHFVerifyChunk: """HuggingFaceEngine.verify_chunk runs the verifier.""" @@ -555,60 +631,3 @@ def test_verify_import_error_treated_as_pass(self, tmp_path: pathlib.Path) -> No assert result.success is True assert "skipped" in result.message.lower() or "not available" in result.message.lower() - - -# --------------------------------------------------------------------------- -# run_build_cycle -# --------------------------------------------------------------------------- - - -class TestRunBuildCycle: - """run_build_cycle delegates to V2Orchestrator or returns early when no specs.""" - - def test_run_build_cycle_no_specs(self, tmp_path: pathlib.Path) -> None: - """When discover_incomplete_specs returns [], run_build_cycle returns success=True.""" - from codelicious.engines.base import BuildResult - - engine = HuggingFaceEngine() - - with mock.patch("codelicious.spec_discovery.discover_incomplete_specs", return_value=[]): - result = engine.run_build_cycle( - repo_path=tmp_path, - git_manager=mock.MagicMock(), - cache_manager=mock.MagicMock(), - ) - - assert isinstance(result, BuildResult) - assert result.success is True - assert "No incomplete specs" in result.message - - def test_run_build_cycle_delegates_to_v2(self, tmp_path: pathlib.Path) -> None: - """When specs are found, run_build_cycle instantiates V2Orchestrator and calls run().""" - from codelicious.engines.base import BuildResult - - engine = HuggingFaceEngine() - - fake_spec = mock.MagicMock() - fake_orch_result = mock.MagicMock() - fake_orch_result.success = True - fake_orch_result.message = "all done" - fake_orch_result.elapsed_s = 1.5 - - mock_orch_instance = mock.MagicMock() - mock_orch_instance.run.return_value = fake_orch_result - - with mock.patch("codelicious.spec_discovery.discover_incomplete_specs", return_value=[fake_spec]): - with mock.patch( - "codelicious.orchestrator.V2Orchestrator", return_value=mock_orch_instance - ) as mock_orch_cls: - result = engine.run_build_cycle( - repo_path=tmp_path, - git_manager=mock.MagicMock(), - cache_manager=mock.MagicMock(), - ) - - assert isinstance(result, BuildResult) - assert result.success is True - assert result.message == "all done" - mock_orch_cls.assert_called_once() - mock_orch_instance.run.assert_called_once() diff --git a/tests/test_git_orchestrator.py b/tests/test_git_orchestrator.py index 2397b32f..3ae3806c 100644 --- a/tests/test_git_orchestrator.py +++ b/tests/test_git_orchestrator.py @@ -2807,6 +2807,53 @@ def test_creates_new_branch(self, tmp_path: Path) -> None: mock_cmd.assert_any_call(["git", "checkout", "-b", "codelicious/spec-27-part-2"]) +class TestBranchDisambiguation: + """spec v30 Step 10: ``_disambiguate_branch`` resolves local/remote collisions.""" + + def _manager(self, tmp_path: Path) -> GitManager: + (tmp_path / ".git").mkdir() + return GitManager(tmp_path) + + def test_unique_name_unchanged(self, tmp_path: Path) -> None: + m = self._manager(tmp_path) + with ( + mock.patch.object(m, "_branch_exists_locally", return_value=False), + mock.patch.object(m, "_branch_exists_remotely", return_value=False), + ): + assert m._disambiguate_branch("codelicious/spec-27-part-2") == "codelicious/spec-27-part-2" + + def test_local_collision_appends_suffix(self, tmp_path: Path, caplog: pytest.LogCaptureFixture) -> None: + m = self._manager(tmp_path) + with ( + mock.patch.object(m, "_branch_exists_locally", side_effect=[True, False]), + mock.patch.object(m, "_branch_exists_remotely", return_value=False), + caplog.at_level("INFO", logger="codelicious"), + ): + disambiguated = m._disambiguate_branch("codelicious/spec-27-part-2", suffix_hint="abc123") + assert disambiguated == "codelicious/spec-27-part-2-abc123" + assert any("exists; using" in r.message for r in caplog.records) + + def test_remote_only_collision_disambiguated(self, tmp_path: Path) -> None: + m = self._manager(tmp_path) + with ( + mock.patch.object(m, "_branch_exists_locally", return_value=False), + mock.patch.object(m, "_branch_exists_remotely", side_effect=[True, False]), + ): + assert m._disambiguate_branch("c", suffix_hint="zz") == "c-zz" + + def test_double_collision_falls_back_to_timestamp(self, tmp_path: Path) -> None: + m = self._manager(tmp_path) + with ( + mock.patch.object(m, "_branch_exists_locally", side_effect=[True, True]), + mock.patch.object(m, "_branch_exists_remotely", return_value=False), + ): + disambiguated = m._disambiguate_branch("c", suffix_hint="x") + # Falls back to "c-<unix_ts>" — assert the prefix and that the suffix is digits. + assert disambiguated.startswith("c-") + assert disambiguated != "c-x" + assert disambiguated.split("-")[-1].isdigit() + + # --------------------------------------------------------------------------- # New coverage: push_to_origin() — all retries exhausted (transient) # --------------------------------------------------------------------------- @@ -3227,6 +3274,54 @@ def test_chunk_summaries_capped_at_50(self, tmp_path: Path) -> None: assert "chunk-50" not in body +class TestPrBodyChunkMetadata: + """spec v30 Step 9: ``chunk_metadata`` adds Chunk Context / Verifier Summary / Audit Log sections.""" + + def test_full_metadata_renders_all_sections(self, tmp_path: Path) -> None: + manager = GitManager(tmp_path) + meta = { + "spec_path": "docs/specs/27.md", + "chunk_id": "spec-27-chunk-04", + "chunk_title": "Wire orchestrator to engines", + "depends_on": ["spec-27-chunk-03"], + "dependents": ["spec-27-chunk-05"], + "verifier": { + "tests_passed": 1908, + "lint_warnings": 0, + "coverage_pct": 92.5, + "coverage_delta_pp": 0.4, + }, + "audit_log_path": ".codelicious/logs/build-20260504.log", + } + body = manager._build_pr_body(spec_id="27", chunk_summaries=None, prev_pr_url="", chunk_metadata=meta) + + assert "## Chunk Context" in body + assert "spec-27-chunk-04" in body + assert "spec-27-chunk-03" in body + assert "spec-27-chunk-05" in body + assert "## Verifier Summary" in body + assert "tests passed: 1908" in body + assert "92.5%" in body + assert "+0.4 pp" in body + assert "## Audit Log" in body + assert "build-20260504.log" in body + + def test_missing_fields_render_as_n_a(self, tmp_path: Path) -> None: + manager = GitManager(tmp_path) + body = manager._build_pr_body(spec_id="27", chunk_summaries=None, prev_pr_url="", chunk_metadata={}) + assert "## Chunk Context" in body + # Every renderable field falls back to "n/a" rather than raising. + assert body.count("n/a") >= 4 + + def test_no_metadata_omits_chunk_sections(self, tmp_path: Path) -> None: + """When chunk_metadata is None the new sections must NOT appear (back-compat).""" + manager = GitManager(tmp_path) + body = manager._build_pr_body(spec_id="27", chunk_summaries=None, prev_pr_url="", chunk_metadata=None) + assert "Chunk Context" not in body + assert "Verifier Summary" not in body + assert "Audit Log" not in body + + # --------------------------------------------------------------------------- # New coverage: _find_existing_pr_by_branch — timeout and JSON error # --------------------------------------------------------------------------- diff --git a/tests/test_llm_client.py b/tests/test_llm_client.py index 82d6bc5b..7b42e089 100644 --- a/tests/test_llm_client.py +++ b/tests/test_llm_client.py @@ -6,7 +6,7 @@ import ssl import urllib.error from datetime import datetime -from unittest.mock import call, patch +from unittest.mock import patch import pytest @@ -457,15 +457,18 @@ def test_os_error_retries_and_raises(self, client): assert "LLM Connection Error" in str(exc_info.value) def test_network_error_exponential_backoff_intervals(self, client): - """Sleep durations should follow exponential backoff: 1s, 2s, 4s.""" + """Sleep durations follow jittered exponential backoff in [0.5x, 1.5x] of 2**i (spec v29 Step 3).""" with patch("urllib.request.urlopen") as mock_urlopen, patch("time.sleep") as mock_sleep: mock_urlopen.side_effect = urllib.error.URLError("timeout") with pytest.raises(RuntimeError): client.chat_completion([{"role": "user", "content": "test"}]) - expected_sleeps = [call(client._BACKOFF_BASE_S * (2**i)) for i in range(client._MAX_RETRIES)] - assert mock_sleep.call_args_list == expected_sleeps + assert mock_sleep.call_count == client._MAX_RETRIES + for i, call_obj in enumerate(mock_sleep.call_args_list): + base = client._BACKOFF_BASE_S * (2**i) + actual = call_obj.args[0] + assert 0.5 * base <= actual <= 1.5 * base, f"attempt {i}: {actual} not in [{0.5 * base}, {1.5 * base}]" def test_network_error_succeeds_on_retry(self, client): """A transient network error should succeed once the connection recovers.""" diff --git a/tests/test_orchestrator.py b/tests/test_orchestrator.py deleted file mode 100644 index 5e57bf2f..00000000 --- a/tests/test_orchestrator.py +++ /dev/null @@ -1,1306 +0,0 @@ -"""Tests for the Orchestrator build loop, finding triage, and fix prompt.""" - -from __future__ import annotations - -import json -import logging -import pathlib -import subprocess -import threading -from unittest import mock - -import pytest - -from codelicious.git.git_orchestrator import GitManager -from codelicious.orchestrator import ( - REVIEWER_PROMPTS, - Finding, - Orchestrator, - OrchestratorResult, - ReviewRole, - _abort_merge, - _collect_review_findings, - _commit_worktree_changes, - _create_worktree, - _merge_worktree_branch, - _render_fix_prompt, - _triage_findings, -) - -# --------------------------------------------------------------------------- -# Finding triage -# --------------------------------------------------------------------------- - - -class TestTriageFindings: - """Tests for severity-based sorting and deduplication.""" - - def test_sorts_by_severity(self): - findings = [ - Finding(role="qa", severity="P3", file="a.py", line=1, title="minor", description="", fix=""), - Finding(role="sec", severity="P1", file="b.py", line=2, title="critical", description="", fix=""), - Finding(role="perf", severity="P2", file="c.py", line=3, title="medium", description="", fix=""), - ] - result = _triage_findings(findings) - assert [f.severity for f in result] == ["P1", "P2", "P3"] - - def test_deduplicates_by_file_line(self): - findings = [ - Finding(role="qa", severity="P2", file="a.py", line=10, title="first", description="", fix=""), - Finding(role="sec", severity="P1", file="a.py", line=10, title="second", description="", fix=""), - ] - result = _triage_findings(findings) - # P1 sorts first, so the P1 finding wins the dedup - assert len(result) == 1 - assert result[0].severity == "P1" - - def test_empty_list(self): - assert _triage_findings([]) == [] - - def test_unknown_severity_sorts_after_p3(self): - """Findings with non-standard severity values (P0, UNKNOWN) sort after P3 and are preserved.""" - findings = [ - Finding(role="qa", severity="P3", file="a.py", line=1, title="minor", description="", fix=""), - Finding(role="sec", severity="UNKNOWN", file="b.py", line=2, title="mystery", description="", fix=""), - Finding(role="perf", severity="P0", file="c.py", line=3, title="critical-plus", description="", fix=""), - ] - result = _triage_findings(findings) - # All three findings must be preserved (distinct file+line keys) - assert len(result) == 3 - # P3 sorts before both unknowns (which fall into the default bucket, order=9) - assert result[0].severity == "P3" - # Both non-standard severities appear after P3 - unknown_severities = {f.severity for f in result[1:]} - assert unknown_severities == {"UNKNOWN", "P0"} - - -# --------------------------------------------------------------------------- -# Review findings collection -# --------------------------------------------------------------------------- - - -class TestCollectReviewFindings: - """Tests for parsing JSON review files.""" - - def test_reads_valid_json(self, tmp_path: pathlib.Path): - review_file = tmp_path / ".codelicious" / "review_security.json" - review_file.parent.mkdir(parents=True) - review_file.write_text( - json.dumps( - [ - { - "severity": "P1", - "file": "x.py", - "line": 5, - "title": "issue", - "description": "desc", - "fix": "fix", - }, - ] - ) - ) - findings = _collect_review_findings(tmp_path, "security") - assert len(findings) == 1 - assert findings[0].severity == "P1" - assert findings[0].role == "security" - - def test_missing_file_returns_empty(self, tmp_path: pathlib.Path): - assert _collect_review_findings(tmp_path, "nonexistent") == [] - - def test_malformed_json_returns_empty(self, tmp_path: pathlib.Path): - review_file = tmp_path / ".codelicious" / "review_qa.json" - review_file.parent.mkdir(parents=True) - review_file.write_text("not json") - assert _collect_review_findings(tmp_path, "qa") == [] - - def test_non_array_json_returns_empty(self, tmp_path: pathlib.Path): - review_file = tmp_path / ".codelicious" / "review_qa.json" - review_file.parent.mkdir(parents=True) - review_file.write_text(json.dumps({"not": "an array"})) - assert _collect_review_findings(tmp_path, "qa") == [] - - def test_permission_error_returns_empty(self, tmp_path: pathlib.Path): - """When read_text raises PermissionError (OSError subclass), return [].""" - review_file = tmp_path / ".codelicious" / "review_security.json" - review_file.parent.mkdir(parents=True) - review_file.write_text("[]") - - with mock.patch.object(pathlib.Path, "read_text", side_effect=PermissionError("access denied")): - result = _collect_review_findings(tmp_path, "security") - - assert result == [] - - -# --------------------------------------------------------------------------- -# Fix prompt rendering -# --------------------------------------------------------------------------- - - -class TestRenderFixPrompt: - """Tests for the fix prompt template.""" - - def test_includes_no_git_warning(self): - prompt = _render_fix_prompt("myproject", []) - assert "Do NOT run git" in prompt - assert "git add" in prompt and "git commit" in prompt - - def test_includes_findings(self): - findings = [ - Finding(role="sec", severity="P1", file="a.py", line=10, title="XSS", description="bad", fix="escape"), - ] - prompt = _render_fix_prompt("myproject", findings) - assert "XSS" in prompt - assert "a.py:10" in prompt - assert "P1" in prompt - - def test_no_findings(self): - prompt = _render_fix_prompt("myproject", []) - assert "No findings to fix" in prompt - - -# --------------------------------------------------------------------------- -# Orchestrator build loop -# --------------------------------------------------------------------------- - - -class TestOrchestratorRun: - """Tests for the orchestrator's build→merge→review→fix loop.""" - - @pytest.fixture - def mock_git_manager(self): - mgr = mock.MagicMock(spec=GitManager) - mgr.commit_verified_changes.return_value = None - mgr.push_to_origin.return_value = mock.MagicMock(success=True, error_type=None, message="") - mgr.ensure_draft_pr_exists.return_value = None - return mgr - - @pytest.fixture - def mock_config(self): - class C: - model = "" - effort = "" - max_turns = 0 - agent_timeout_s = 30 - dry_run = True - - return C() - - def test_all_specs_already_complete(self, tmp_path: pathlib.Path, mock_git_manager, mock_config): - """When all specs are already complete, the loop exits immediately with no builds.""" - spec = tmp_path / "spec.md" - spec.write_text("- [x] done\n- [x] also done\n") - - orch = Orchestrator(tmp_path, mock_git_manager, mock_config) - - # Mock _phase_build, _phase_review and _phase_fix to avoid running actual agents - with mock.patch.object(orch, "_phase_build", return_value=[]) as mock_build: - with mock.patch.object(orch, "_phase_review", return_value=[]): - with mock.patch.object(orch, "_phase_fix", return_value=True): - result = orch.run(specs=[spec], reviewers=[], max_build_cycles=5) - - assert result.success is True - assert result.cycles_completed == 0 - mock_build.assert_not_called() - - def test_consecutive_failures_abort(self, tmp_path: pathlib.Path, mock_git_manager, mock_config): - """3 consecutive build failures cause the loop to abort.""" - spec = tmp_path / "spec.md" - spec.write_text("- [ ] never built\n") - - orch = Orchestrator(tmp_path, mock_git_manager, mock_config) - - # Mock _phase_build to always fail - with mock.patch.object(orch, "_phase_build", return_value=[("branch", False)]): - with mock.patch.object(orch, "_phase_review", return_value=[]): - with mock.patch.object(orch, "_phase_fix", return_value=True): - with mock.patch( - "codelicious.prompts.scan_remaining_tasks_for_spec", - return_value=1, - ): - result = orch.run(specs=[spec], reviewers=[], max_build_cycles=10) - - assert result.success is False - assert result.cycles_completed == 3 # aborted after 3 - - def test_empty_specs_list(self, tmp_path: pathlib.Path, mock_git_manager, mock_config): - """Empty specs list should complete immediately.""" - orch = Orchestrator(tmp_path, mock_git_manager, mock_config) - - with mock.patch.object(orch, "_phase_review", return_value=[]): - with mock.patch.object(orch, "_phase_fix", return_value=True): - result = orch.run(specs=[], reviewers=[], max_build_cycles=5) - - assert result.success is True - assert result.cycles_completed == 0 - - def test_build_without_build_complete_reports_failure(self, tmp_path: pathlib.Path, mock_git_manager, mock_config): - """Agent exits cleanly but doesn't write BUILD_COMPLETE → build fails.""" - spec = tmp_path / "spec.md" - spec.write_text("- [ ] not built\n") - - orch = Orchestrator(tmp_path, mock_git_manager, mock_config) - - # Mock _build_spec_in_worktree to simulate agent that exits ok but - # never writes BUILD_COMPLETE (the old bug behavior). - # Instead, test the actual logic by mocking _run_agent and worktree ops. - mock_result = mock.MagicMock(success=True) # process exited ok - - with mock.patch.object(orch, "_run_agent", return_value=mock_result): - with mock.patch("codelicious.orchestrator._create_worktree", return_value=tmp_path / "wt"): - with mock.patch("codelicious.orchestrator._remove_worktree"): - with mock.patch("codelicious.orchestrator._commit_worktree_changes", return_value=True): - # Create worktree dir but NO BUILD_COMPLETE file - wt = tmp_path / "wt" - wt.mkdir() - (wt / ".codelicious").mkdir() - # Copy spec into worktree - (wt / "spec.md").write_text("- [ ] not built\n") - - _, success = orch._build_spec_in_worktree(spec) - - # Agent exited ok, but no BUILD_COMPLETE → should be False - assert success is False - - def test_spec_becomes_complete_after_build(self, tmp_path: pathlib.Path, mock_git_manager, mock_config): - """Build loop exits when the spec becomes complete after a build cycle.""" - spec = tmp_path / "spec.md" - spec.write_text("- [ ] build me\n") - - orch = Orchestrator(tmp_path, mock_git_manager, mock_config) - - def fake_build(specs, workers): - # Simulate the agent checking off the task - spec.write_text("- [x] build me\n") - return [("codelicious/spec", True)] - - with mock.patch.object(orch, "_phase_build", side_effect=fake_build): - with mock.patch.object(orch, "_phase_merge", return_value=1): - with mock.patch.object(orch, "_phase_review", return_value=[]): - with mock.patch.object(orch, "_phase_fix", return_value=True): - result = orch.run(specs=[spec], reviewers=[], max_build_cycles=10) - - assert result.success is True - # Build ran once, then the loop detected completion on the next iteration - assert result.cycles_completed == 1 - # Verify push was called (mid-cycle + final = at least 2 calls) - assert mock_git_manager.push_to_origin.call_count >= 2 - - -class TestPhaseBuildConcurrentCounter: - """Tests that _phase_build's completed_count is updated correctly under concurrency.""" - - @pytest.fixture - def orch(self, tmp_path: pathlib.Path): - git_manager = mock.MagicMock(spec=GitManager) - git_manager.push_to_origin.return_value = mock.MagicMock(success=True, error_type=None, message="") - - class C: - model = "" - effort = "" - max_turns = 0 - agent_timeout_s = 30 - dry_run = True - - return Orchestrator(tmp_path, git_manager, C()) - - def test_all_successes_counted(self, tmp_path: pathlib.Path, orch: Orchestrator): - """All successful futures must be counted exactly once each.""" - specs = [tmp_path / f"spec_{i}.md" for i in range(8)] - for s in specs: - s.write_text("") - - # Mock _build_spec_in_worktree to return (branch, True) after a brief pause - # so futures resolve with genuine overlap. - barrier = threading.Barrier(len(specs)) - - def fake_build(spec: pathlib.Path): - barrier.wait(timeout=5) - return (f"codelicious/build-{spec.stem}", True) - - with mock.patch.object(orch, "_build_spec_in_worktree", side_effect=fake_build): - results = orch._phase_build(specs, max_workers=len(specs)) - - assert len(results) == len(specs) - assert all(ok for _, ok in results) - - def test_exception_futures_counted(self, tmp_path: pathlib.Path, orch: Orchestrator): - """Futures that raise must be counted, not silently dropped.""" - specs = [tmp_path / f"spec_{i}.md" for i in range(4)] - for s in specs: - s.write_text("") - - barrier = threading.Barrier(len(specs)) - - def fake_build_raises(spec: pathlib.Path): - barrier.wait(timeout=5) - raise RuntimeError("worker exploded") - - with mock.patch.object(orch, "_build_spec_in_worktree", side_effect=fake_build_raises): - results = orch._phase_build(specs, max_workers=len(specs)) - - # All specs should still produce a result entry (failed) - assert len(results) == len(specs) - assert all(not ok for _, ok in results) - - def test_mixed_success_and_failure_counted(self, tmp_path: pathlib.Path, orch: Orchestrator): - """A mix of successes and exceptions must all be counted exactly once.""" - specs = [tmp_path / f"spec_{i}.md" for i in range(6)] - for s in specs: - s.write_text("") - - barrier = threading.Barrier(len(specs)) - - def fake_build_mixed(spec: pathlib.Path): - barrier.wait(timeout=5) - idx = int(spec.stem.rsplit("_", 1)[-1]) - if idx % 2 == 0: - raise RuntimeError("even spec fails") - return (f"codelicious/build-{spec.stem}", True) - - with mock.patch.object(orch, "_build_spec_in_worktree", side_effect=fake_build_mixed): - results = orch._phase_build(specs, max_workers=len(specs)) - - assert len(results) == len(specs) - successes = [ok for _, ok in results if ok] - failures = [ok for _, ok in results if not ok] - assert len(successes) == 3 - assert len(failures) == 3 - - -# --------------------------------------------------------------------------- -# Finding 7 — _commit_worktree_changes -# --------------------------------------------------------------------------- - - -class TestCommitWorktreeChanges: - """Tests for _commit_worktree_changes error paths.""" - - def test_staging_timeout_returns_false(self, tmp_path: pathlib.Path): - """A timeout while staging all files returns False.""" - with mock.patch( - "codelicious.orchestrator.subprocess.run", - side_effect=subprocess.TimeoutExpired(cmd="git add", timeout=120), - ): - result = _commit_worktree_changes(tmp_path, "spec.md") - assert result is False - - def test_diff_check_timeout_returns_false(self, tmp_path: pathlib.Path): - """A timeout on the diff --cached check returns False.""" - add_ok = mock.MagicMock(returncode=0) - - def _fake_run(cmd, **kwargs): - if "add" in cmd: - return add_ok - # diff --cached - raise subprocess.TimeoutExpired(cmd=cmd, timeout=120) - - with mock.patch("codelicious.orchestrator.subprocess.run", side_effect=_fake_run): - result = _commit_worktree_changes(tmp_path, "spec.md") - assert result is False - - def test_clean_worktree_returns_false(self, tmp_path: pathlib.Path): - """When diff --cached exits 0 (nothing staged), returns False without committing.""" - add_ok = mock.MagicMock(returncode=0) - diff_clean = mock.MagicMock(returncode=0) # 0 = no staged changes - - def _fake_run(cmd, **kwargs): - if "add" in cmd: - return add_ok - return diff_clean - - with mock.patch("codelicious.orchestrator.subprocess.run", side_effect=_fake_run): - result = _commit_worktree_changes(tmp_path, "spec.md") - assert result is False - - def test_gpg_failure_falls_back_to_no_gpg_sign(self, tmp_path: pathlib.Path): - """A GPG-related commit failure triggers a --no-gpg-sign retry.""" - add_ok = mock.MagicMock(returncode=0) - diff_dirty = mock.MagicMock(returncode=1) # 1 = staged changes exist - gpg_fail = mock.MagicMock(returncode=1, stderr="error: gpg failed to sign the data") - unsigned_ok = mock.MagicMock(returncode=0) - - calls = iter([add_ok, diff_dirty, gpg_fail, unsigned_ok]) - - with mock.patch("codelicious.orchestrator.subprocess.run", side_effect=lambda *a, **kw: next(calls)): - result = _commit_worktree_changes(tmp_path, "spec.md") - assert result is True - - def test_unsigned_commit_timeout_returns_false(self, tmp_path: pathlib.Path): - """Timeout on the --no-gpg-sign fallback commit returns False.""" - add_ok = mock.MagicMock(returncode=0) - diff_dirty = mock.MagicMock(returncode=1) - gpg_fail = mock.MagicMock(returncode=1, stderr="gpg signing failed: secret key not available") - - def _fake_run(cmd, **kwargs): - if "add" in cmd: - return add_ok - if "diff" in cmd: - return diff_dirty - if "--no-gpg-sign" in cmd: - raise subprocess.TimeoutExpired(cmd=cmd, timeout=120) - # First commit attempt (no --no-gpg-sign yet) - return gpg_fail - - with mock.patch("codelicious.orchestrator.subprocess.run", side_effect=_fake_run): - result = _commit_worktree_changes(tmp_path, "spec.md") - assert result is False - - def test_gpg_fallback_succeeds_returns_true(self, tmp_path: pathlib.Path): - """When the first commit fails with a GPG error and the --no-gpg-sign retry succeeds, returns True.""" - add_ok = mock.MagicMock(returncode=0) - diff_dirty = mock.MagicMock(returncode=1) # 1 = staged changes exist - gpg_fail = mock.MagicMock(returncode=1, stderr="gpg: signing failed: secret key not available") - unsigned_ok = mock.MagicMock(returncode=0) - - calls = iter([add_ok, diff_dirty, gpg_fail, unsigned_ok]) - - with mock.patch("codelicious.orchestrator.subprocess.run", side_effect=lambda *a, **kw: next(calls)): - result = _commit_worktree_changes(tmp_path, "spec.md") - assert result is True - - def test_gpg_fallback_also_fails_returns_false(self, tmp_path: pathlib.Path): - """When both the initial commit and the --no-gpg-sign fallback fail, returns False.""" - add_ok = mock.MagicMock(returncode=0) - diff_dirty = mock.MagicMock(returncode=1) # 1 = staged changes exist - gpg_fail = mock.MagicMock(returncode=1, stderr="gpg: signing failed: secret key not available") - unsigned_fail = mock.MagicMock(returncode=1, stderr="error: commit failed") - - calls = iter([add_ok, diff_dirty, gpg_fail, unsigned_fail]) - - with mock.patch("codelicious.orchestrator.subprocess.run", side_effect=lambda *a, **kw: next(calls)): - result = _commit_worktree_changes(tmp_path, "spec.md") - assert result is False - - -# --------------------------------------------------------------------------- -# Finding 8 — data-loss guard: commit fails after successful build -# --------------------------------------------------------------------------- - - -class TestDataLossGuard: - """When commit fails after a successful build the worktree must be preserved.""" - - @pytest.fixture - def orch(self, tmp_path: pathlib.Path): - git_manager = mock.MagicMock(spec=GitManager) - - class C: - model = "" - effort = "" - max_turns = 0 - agent_timeout_s = 30 - dry_run = True - - return Orchestrator(tmp_path, git_manager, C()) - - def test_commit_failure_after_success_returns_false(self, tmp_path: pathlib.Path, orch: Orchestrator): - """If _commit_worktree_changes returns False after a successful build, success is False.""" - spec = tmp_path / "spec.md" - spec.write_text("- [ ] a task\n") - - worktree = tmp_path / "wt" - worktree.mkdir() - (worktree / ".codelicious").mkdir() - (worktree / "spec.md").write_text("- [x] a task\n") - # Write BUILD_COMPLETE so agent_done is True - (worktree / ".codelicious" / "BUILD_COMPLETE").write_text("DONE") - - mock_result = mock.MagicMock(success=True) - - remove_worktree = mock.MagicMock() - - with mock.patch.object(orch, "_run_agent", return_value=mock_result): - with mock.patch("codelicious.orchestrator._create_worktree", return_value=worktree): - with mock.patch("codelicious.orchestrator._remove_worktree", remove_worktree): - with mock.patch("codelicious.orchestrator._commit_worktree_changes", return_value=False): - _, success = orch._build_spec_in_worktree(spec) - - assert success is False - # Worktree must be preserved (not removed) to prevent data loss - remove_worktree.assert_not_called() - - -# --------------------------------------------------------------------------- -# Finding 9 — _create_worktree -# --------------------------------------------------------------------------- - - -class TestCreateWorktree: - """Tests for _create_worktree error and fallback paths.""" - - def test_branch_exists_uses_fallback(self, tmp_path: pathlib.Path): - """When the first worktree add fails (branch exists), the fallback without -b succeeds.""" - fail = mock.MagicMock(returncode=1, stderr="already exists") - success = mock.MagicMock(returncode=0) - - # Call order: optional stale-remove (skipped — dir doesn't exist yet), - # first add (-b), fallback add (no -b) - responses = iter([fail, success]) - - with mock.patch("codelicious.orchestrator.subprocess.run", side_effect=lambda *a, **kw: next(responses)): - result = _create_worktree(tmp_path, "codelicious/test-branch") - # Should return the expected worktree path without raising - assert result == tmp_path / ".codelicious" / "worktrees" / "codelicious/test-branch" - - def test_first_add_timeout_raises_runtime_error(self, tmp_path: pathlib.Path): - """A timeout on the primary worktree add raises RuntimeError.""" - - def _fake_run(cmd, **kwargs): - if "add" in cmd and "-b" in cmd: - raise subprocess.TimeoutExpired(cmd=cmd, timeout=120) - return mock.MagicMock(returncode=0) - - with mock.patch("codelicious.orchestrator.subprocess.run", side_effect=_fake_run): - with pytest.raises(RuntimeError, match="Timed out creating worktree"): - _create_worktree(tmp_path, "codelicious/test-branch") - - -# --------------------------------------------------------------------------- -# Finding 10 — _abort_merge -# --------------------------------------------------------------------------- - - -class TestAbortMerge: - """Tests for _abort_merge error and timeout paths.""" - - def test_non_zero_abort_logs_critical(self, tmp_path: pathlib.Path, caplog): - """When git merge --abort returns non-zero, a CRITICAL message is logged.""" - fail = mock.MagicMock(returncode=1, stderr="nothing to abort") - - with mock.patch("codelicious.orchestrator.subprocess.run", return_value=fail): - with caplog.at_level("CRITICAL", logger="codelicious.orchestrator"): - _abort_merge(tmp_path) - - assert any("abort failed" in r.message.lower() for r in caplog.records) - - def test_timeout_logs_critical_dirty_state(self, tmp_path: pathlib.Path, caplog): - """A timeout on git merge --abort logs a CRITICAL warning about dirty state.""" - with ( - mock.patch( - "codelicious.orchestrator.subprocess.run", - side_effect=subprocess.TimeoutExpired(cmd="git merge", timeout=30), - ), - caplog.at_level("CRITICAL", logger="codelicious.orchestrator"), - ): - _abort_merge(tmp_path) - - assert any("dirty state" in r.message.lower() for r in caplog.records) - - -# --------------------------------------------------------------------------- -# Finding 11 — _merge_worktree_branch -# --------------------------------------------------------------------------- - - -class TestMergeWorktreeBranch: - """Tests for _merge_worktree_branch success, conflict and timeout paths.""" - - def test_successful_merge_returns_true(self, tmp_path: pathlib.Path): - """A zero-returncode merge returns True.""" - ok = mock.MagicMock(returncode=0) - with mock.patch("codelicious.orchestrator.subprocess.run", return_value=ok): - result = _merge_worktree_branch(tmp_path, "codelicious/feat") - assert result is True - - def test_merge_conflict_calls_abort_and_returns_false(self, tmp_path: pathlib.Path): - """A non-zero merge result calls _abort_merge and returns False.""" - conflict = mock.MagicMock(returncode=1, stderr="CONFLICT") - - with mock.patch("codelicious.orchestrator.subprocess.run", return_value=conflict): - with mock.patch("codelicious.orchestrator._abort_merge") as mock_abort: - result = _merge_worktree_branch(tmp_path, "codelicious/feat") - - assert result is False - mock_abort.assert_called_once_with(tmp_path) - - def test_timeout_calls_abort_and_returns_false(self, tmp_path: pathlib.Path): - """A timeout on git merge calls _abort_merge and returns False.""" - with ( - mock.patch( - "codelicious.orchestrator.subprocess.run", - side_effect=subprocess.TimeoutExpired(cmd="git merge", timeout=120), - ), - mock.patch("codelicious.orchestrator._abort_merge") as mock_abort, - ): - result = _merge_worktree_branch(tmp_path, "codelicious/feat") - - assert result is False - mock_abort.assert_called_once_with(tmp_path) - - -# --------------------------------------------------------------------------- -# Finding 12 — Orchestrator.run() loop edge cases -# --------------------------------------------------------------------------- - - -class TestOrchestratorRunLoop: - """Tests for loop-abort logic and commit-failure tolerance in run().""" - - @pytest.fixture - def mock_git_manager(self): - mgr = mock.MagicMock(spec=GitManager) - mgr.commit_verified_changes.return_value = None - mgr.push_to_origin.return_value = mock.MagicMock(success=True, error_type=None, message="") - mgr.ensure_draft_pr_exists.return_value = None - return mgr - - @pytest.fixture - def mock_config(self): - class C: - model = "" - effort = "" - max_turns = 0 - agent_timeout_s = 30 - dry_run = True - - return C() - - def test_zero_progress_for_three_cycles_aborts(self, tmp_path: pathlib.Path, mock_git_manager, mock_config): - """_phase_build returning all failures for 3 consecutive cycles aborts the loop.""" - spec = tmp_path / "spec.md" - spec.write_text("- [ ] never built\n") - - orch = Orchestrator(tmp_path, mock_git_manager, mock_config) - - with mock.patch.object(orch, "_phase_build", return_value=[("codelicious/spec", False)]): - with mock.patch.object(orch, "_phase_review", return_value=[]): - with mock.patch.object(orch, "_phase_fix", return_value=True): - with mock.patch( - "codelicious.prompts.scan_remaining_tasks_for_spec", - return_value=1, - ): - result = orch.run(specs=[spec], reviewers=[], max_build_cycles=10) - - assert result.success is False - assert result.cycles_completed == 3 - - def test_commit_raises_does_not_crash_run(self, tmp_path: pathlib.Path, mock_git_manager, mock_config): - """An exception from commit_verified_changes must not propagate out of run().""" - spec = tmp_path / "spec.md" - spec.write_text("- [x] already done\n") - - mock_git_manager.commit_verified_changes.side_effect = RuntimeError("disk full") - - orch = Orchestrator(tmp_path, mock_git_manager, mock_config) - - with mock.patch.object(orch, "_phase_review", return_value=[]): - with mock.patch.object(orch, "_phase_fix", return_value=True): - result = orch.run(specs=[spec], reviewers=[], max_build_cycles=5) - - # run() must still return a valid OrchestratorResult regardless of commit errors - assert isinstance(result, OrchestratorResult) - - -# --------------------------------------------------------------------------- -# Finding 13 — spec-not-in-worktree fallback -# --------------------------------------------------------------------------- - - -class TestSpecNotInWorktreeFallback: - """Tests for the fallback prompt when a spec path is outside the worktree.""" - - @pytest.fixture - def orch(self, tmp_path: pathlib.Path): - git_manager = mock.MagicMock(spec=GitManager) - - class C: - model = "" - effort = "" - max_turns = 0 - agent_timeout_s = 30 - dry_run = True - - return Orchestrator(tmp_path, git_manager, C()) - - def test_spec_outside_repo_logs_warning_and_uses_fallback(self, tmp_path: pathlib.Path, orch: Orchestrator, caplog): - """A spec path not under repo_path logs a warning and uses the filename as fallback.""" - # Create a spec outside the repo (different tmp directory) - outside_dir = tmp_path / "outside" - outside_dir.mkdir() - spec = outside_dir / "myspec.md" - spec.write_text("- [ ] build task\n") - - # Use a completely different repo_path so spec is definitely outside it - other_repo = tmp_path / "repo" - other_repo.mkdir() - orch.repo_path = other_repo - - worktree = tmp_path / "wt" - worktree.mkdir() - (worktree / ".codelicious").mkdir() - - captured_prompts: list[str] = [] - - def fake_run_agent(prompt, project_root, session_id=""): - captured_prompts.append(prompt) - return mock.MagicMock(success=False) - - with mock.patch.object(orch, "_run_agent", side_effect=fake_run_agent): - with mock.patch("codelicious.orchestrator._create_worktree", return_value=worktree): - with mock.patch("codelicious.orchestrator._remove_worktree"): - with mock.patch("codelicious.orchestrator._commit_worktree_changes", return_value=False): - with caplog.at_level("WARNING", logger="codelicious.orchestrator"): - orch._build_spec_in_worktree(spec) - - warning_messages = [r.message for r in caplog.records if r.levelname == "WARNING"] - assert any("not under repo" in m for m in warning_messages) - # The agent should have been called with the filename-based fallback path - assert len(captured_prompts) == 1 - assert "myspec.md" in captured_prompts[0] - - def test_spec_missing_in_worktree_uses_fallback_prompt(self, tmp_path: pathlib.Path, orch: Orchestrator, caplog): - """When the resolved spec path doesn't exist in the worktree, the agent gets a fallback prompt.""" - spec = tmp_path / "docs" / "spec_missing.md" - spec.parent.mkdir(parents=True, exist_ok=True) - spec.write_text("- [ ] build task\n") - - worktree = tmp_path / "wt" - worktree.mkdir() - (worktree / ".codelicious").mkdir() - # Intentionally do NOT create worktree / "docs" / "spec_missing.md" - - captured_prompts: list[str] = [] - - def fake_run_agent(prompt, project_root, session_id=""): - captured_prompts.append(prompt) - return mock.MagicMock(success=False) - - with mock.patch.object(orch, "_run_agent", side_effect=fake_run_agent): - with mock.patch("codelicious.orchestrator._create_worktree", return_value=worktree): - with mock.patch("codelicious.orchestrator._remove_worktree"): - with mock.patch("codelicious.orchestrator._commit_worktree_changes", return_value=False): - with caplog.at_level("WARNING", logger="codelicious.orchestrator"): - orch._build_spec_in_worktree(spec) - - warning_messages = [r.message for r in caplog.records if r.levelname == "WARNING"] - assert any("not found in worktree" in m for m in warning_messages) - assert len(captured_prompts) == 1 - assert "spec_missing.md" in captured_prompts[0] - - -# --------------------------------------------------------------------------- -# Finding 68 — _phase_build parallel error path -# --------------------------------------------------------------------------- - - -class TestPhaseBuildParallelErrorPath: - """Tests that _phase_build catches exceptions from one worker while - allowing the rest to succeed, and that the failed spec produces a - (branch, False) result.""" - - @pytest.fixture - def orch(self, tmp_path: pathlib.Path) -> Orchestrator: - git_manager = mock.MagicMock() - git_manager.push_to_origin.return_value = mock.MagicMock(success=True, error_type=None, message="") - - class C: - model = "" - effort = "" - max_turns = 0 - agent_timeout_s = 30 - dry_run = True - - return Orchestrator(tmp_path, git_manager, C()) - - def test_one_worker_raises_caught_logged_and_false_returned( - self, tmp_path: pathlib.Path, orch: Orchestrator, caplog - ): - """When one future raises, the exception must be caught, an error - logged, and (branch, False) returned for that spec while the other - spec's success result is preserved.""" - spec_ok = tmp_path / "spec_ok.md" - spec_fail = tmp_path / "spec_fail.md" - spec_ok.write_text("") - spec_fail.write_text("") - - barrier = threading.Barrier(2) - - def fake_build(spec: pathlib.Path) -> tuple[str, bool]: - barrier.wait(timeout=5) - if spec == spec_fail: - raise RuntimeError("worker exploded") - return (f"codelicious/build-{spec.stem}", True) - - with mock.patch.object(orch, "_build_spec_in_worktree", side_effect=fake_build): - with caplog.at_level("ERROR", logger="codelicious.orchestrator"): - results = orch._phase_build([spec_ok, spec_fail], max_workers=2) - - assert len(results) == 2 - # The failing spec must produce (branch, False) - fail_results = [(b, ok) for b, ok in results if not ok] - assert len(fail_results) == 1 - # The failing branch name is derived from spec.stem - assert "spec_fail" in fail_results[0][0] - # The success spec must still produce (branch, True) - ok_results = [(b, ok) for b, ok in results if ok] - assert len(ok_results) == 1 - # An error must have been logged for the exception - error_msgs = [r.message for r in caplog.records if r.levelno >= logging.ERROR] - assert any("worker exploded" in m or "spec_fail" in m for m in error_msgs) - - -# --------------------------------------------------------------------------- -# Finding 69 — _phase_merge -# --------------------------------------------------------------------------- - - -class TestPhaseMerge: - """Tests for _phase_merge success, conflict, and all-failures paths.""" - - @pytest.fixture - def orch(self, tmp_path: pathlib.Path) -> Orchestrator: - # Finding 82: use spec=GitManager so attribute access is validated - git_manager = mock.MagicMock(spec=GitManager) - - class C: - model = "" - effort = "" - max_turns = 0 - agent_timeout_s = 30 - dry_run = True - - return Orchestrator(tmp_path, git_manager, C()) - - def test_all_failures_returns_zero_merged(self, orch: Orchestrator): - """When every build result is False, _phase_merge returns 0.""" - build_results = [("codelicious/spec-a", False), ("codelicious/spec-b", False)] - result = orch._phase_merge(build_results) - assert result == 0 - - def test_successful_merge_deletes_branch(self, tmp_path: pathlib.Path, orch: Orchestrator): - """A branch that merges successfully must have _delete_branch called on it.""" - build_results = [("codelicious/spec-ok", True)] - - with mock.patch("codelicious.orchestrator._merge_worktree_branch", return_value=True) as mock_merge: - with mock.patch("codelicious.orchestrator._delete_branch") as mock_del: - merged = orch._phase_merge(build_results) - - assert merged == 1 - mock_merge.assert_called_once_with(orch.repo_path, "codelicious/spec-ok") - mock_del.assert_called_once_with(orch.repo_path, "codelicious/spec-ok") - - def test_merge_conflict_logs_warning_and_skips_delete(self, tmp_path: pathlib.Path, orch: Orchestrator, caplog): - """A merge conflict must log a warning and not call _delete_branch.""" - build_results = [("codelicious/spec-conflict", True)] - - with mock.patch("codelicious.orchestrator._merge_worktree_branch", return_value=False): - with mock.patch("codelicious.orchestrator._delete_branch") as mock_del: - with caplog.at_level("WARNING", logger="codelicious.orchestrator"): - merged = orch._phase_merge(build_results) - - assert merged == 0 - mock_del.assert_not_called() - warning_msgs = [r.message for r in caplog.records if r.levelname == "WARNING"] - assert any("conflict" in m.lower() or "merge" in m.lower() for m in warning_msgs) - - -# --------------------------------------------------------------------------- -# Finding 70 — _phase_review parallel path -# --------------------------------------------------------------------------- - - -class TestPhaseReviewParallelPath: - """Tests for _phase_review error handling in the parallel path.""" - - @pytest.fixture - def orch(self, tmp_path: pathlib.Path) -> Orchestrator: - git_manager = mock.MagicMock() - - class C: - model = "" - effort = "" - max_turns = 0 - agent_timeout_s = 30 - dry_run = True - - return Orchestrator(tmp_path, git_manager, C()) - - def test_one_reviewer_raises_caught_remaining_findings_collected( - self, tmp_path: pathlib.Path, orch: Orchestrator, caplog - ): - """When one reviewer future raises, the exception is caught, an error - logged, and the findings from the remaining reviewer are still returned.""" - good_finding = Finding( - role="qa", - severity="P2", - file="src/foo.py", - line=10, - title="missing test", - description="untested path", - fix="add test", - ) - - def fake_reviewer(role: str) -> list[Finding]: - if role == "security": - raise RuntimeError("security agent crashed") - return [good_finding] - - with mock.patch.object(orch, "_run_reviewer", side_effect=fake_reviewer): - with caplog.at_level("ERROR", logger="codelicious.orchestrator"): - results = orch._phase_review(["security", "qa"], max_workers=2) - - # The QA finding must still be present - assert any(f.role == "qa" for f in results) - # An error must have been logged for the exception - error_msgs = [r.message for r in caplog.records if r.levelno >= logging.ERROR] - assert any("security" in m.lower() or "crashed" in m.lower() for m in error_msgs) - - -# --------------------------------------------------------------------------- -# Finding 6 — push_pr=True code path -# --------------------------------------------------------------------------- - - -class TestPushPrPath: - """Tests for the push_pr=True branch in Orchestrator.run().""" - - @pytest.fixture - def mock_config(self): - class C: - model = "" - effort = "" - max_turns = 0 - agent_timeout_s = 30 - dry_run = True - - return C() - - def test_push_pr_true_calls_ensure_draft_pr_exists(self, tmp_path: pathlib.Path, mock_config): - """When push_pr=True, ensure_draft_pr_exists() is called with spec_id and spec_summary.""" - from codelicious.git.git_orchestrator import GitManager - - git_manager = mock.MagicMock(spec=GitManager) - git_manager.commit_verified_changes.return_value = None - git_manager.push_to_origin.return_value = mock.MagicMock(success=True, error_type=None, message="") - git_manager.ensure_draft_pr_exists.return_value = None - - spec = tmp_path / "16_test_spec.md" - spec.write_text("- [x] already done\n") - - orch = Orchestrator(tmp_path, git_manager, mock_config) - - with mock.patch.object(orch, "_phase_review", return_value=[]): - with mock.patch.object(orch, "_phase_fix", return_value=True): - orch.run(specs=[spec], reviewers=[], max_build_cycles=5, push_pr=True) - - git_manager.ensure_draft_pr_exists.assert_called_once() - call_kwargs = git_manager.ensure_draft_pr_exists.call_args.kwargs - assert call_kwargs["spec_id"] == "16" - assert "spec_summary" in call_kwargs - - def test_push_pr_true_exception_logs_warning_and_run_returns(self, tmp_path: pathlib.Path, mock_config, caplog): - """When ensure_draft_pr_exists() raises, a warning is logged and run() does not crash.""" - from codelicious.git.git_orchestrator import GitManager - - git_manager = mock.MagicMock(spec=GitManager) - git_manager.commit_verified_changes.return_value = None - git_manager.push_to_origin.return_value = mock.MagicMock(success=True, error_type=None, message="") - git_manager.ensure_draft_pr_exists.side_effect = RuntimeError("gh CLI not found") - - spec = tmp_path / "22_test_spec.md" - spec.write_text("- [x] already done\n") - - orch = Orchestrator(tmp_path, git_manager, mock_config) - - with mock.patch.object(orch, "_phase_review", return_value=[]): - with mock.patch.object(orch, "_phase_fix", return_value=True): - with caplog.at_level("WARNING", logger="codelicious.orchestrator"): - result = orch.run(specs=[spec], reviewers=[], max_build_cycles=5, push_pr=True) - - # run() must return a valid result despite the PR creation failure - assert isinstance(result, OrchestratorResult) - warning_msgs = [r.message for r in caplog.records if r.levelname == "WARNING"] - assert any("PR creation" in m or "gh CLI not found" in m for m in warning_msgs) - - -# --------------------------------------------------------------------------- -# Finding 71 — _phase_fix -# --------------------------------------------------------------------------- - - -class TestPhaseFix: - """Tests for _phase_fix short-circuit and agent-failure paths.""" - - @pytest.fixture - def orch(self, tmp_path: pathlib.Path) -> Orchestrator: - # Finding 82: use spec=GitManager so attribute access is validated - git_manager = mock.MagicMock(spec=GitManager) - - class C: - model = "" - effort = "" - max_turns = 0 - agent_timeout_s = 30 - dry_run = True - - return Orchestrator(tmp_path, git_manager, C()) - - def test_only_p3_findings_returns_true_without_calling_agent(self, orch: Orchestrator): - """When all findings are P3, _phase_fix must return True immediately - without invoking the fix agent.""" - p3_findings = [ - Finding(role="qa", severity="P3", file="a.py", line=1, title="minor", description="", fix=""), - Finding(role="qa", severity="P3", file="b.py", line=2, title="also minor", description="", fix=""), - ] - - with mock.patch.object(orch, "_run_agent") as mock_agent: - result = orch._phase_fix(p3_findings) - - assert result is True - mock_agent.assert_not_called() - - def test_p1_finding_agent_raises_returns_false(self, tmp_path: pathlib.Path, orch: Orchestrator): - """When the fix agent raises an exception, _phase_fix must return False.""" - p1_finding = Finding( - role="security", - severity="P1", - file="src/bar.py", - line=42, - title="critical issue", - description="dangerous code", - fix="remove it", - ) - - with mock.patch.object(orch, "_run_agent", side_effect=RuntimeError("agent timed out")): - with mock.patch("codelicious.prompts.check_build_complete", return_value=False): - with mock.patch("codelicious.prompts.clear_build_complete"): - result = orch._phase_fix([p1_finding]) - - assert result is False - - -# --------------------------------------------------------------------------- -# Finding 63 — _create_worktree failure/fallback paths -# --------------------------------------------------------------------------- - - -class TestCreateWorktreeFailurePaths: - """Finding 63: _create_worktree fallback and double-failure paths.""" - - def test_fallback_non_zero_raises_runtime_error(self, tmp_path: pathlib.Path): - """When both primary add (-b) and fallback add (no -b) return non-zero, - RuntimeError is raised.""" - primary_fail = mock.MagicMock(returncode=1, stderr="fatal: cannot create branch") - fallback_fail = mock.MagicMock(returncode=1, stderr="fatal: worktree already exists") - - responses = iter([primary_fail, fallback_fail]) - - with ( - mock.patch( - "codelicious.orchestrator.subprocess.run", - side_effect=lambda *a, **kw: next(responses), - ), - pytest.raises(RuntimeError, match="Failed to create worktree"), - ): - _create_worktree(tmp_path, "codelicious/my-branch") - - def test_fallback_timeout_raises_runtime_error(self, tmp_path: pathlib.Path): - """When the fallback (no -b) worktree add times out, RuntimeError is raised.""" - primary_fail = mock.MagicMock(returncode=1, stderr="already exists") - - def _fake_run(cmd, **kwargs): - if "-b" in cmd: - return primary_fail - # fallback — no -b - raise subprocess.TimeoutExpired(cmd=cmd, timeout=120) - - with mock.patch("codelicious.orchestrator.subprocess.run", side_effect=_fake_run): - with pytest.raises(RuntimeError, match="Timed out creating worktree"): - _create_worktree(tmp_path, "codelicious/my-branch") - - def test_primary_fails_fallback_succeeds_returns_path(self, tmp_path: pathlib.Path): - """When the primary (-b) add fails and the fallback succeeds, the worktree path - is returned without raising.""" - primary_fail = mock.MagicMock(returncode=1, stderr="already exists") - fallback_ok = mock.MagicMock(returncode=0) - - responses = iter([primary_fail, fallback_ok]) - - with mock.patch( - "codelicious.orchestrator.subprocess.run", - side_effect=lambda *a, **kw: next(responses), - ): - result = _create_worktree(tmp_path, "codelicious/my-branch") - - expected = tmp_path / ".codelicious" / "worktrees" / "codelicious/my-branch" - assert result == expected - - -# --------------------------------------------------------------------------- -# Finding 64 — _delete_branch -# --------------------------------------------------------------------------- - - -class TestDeleteBranch: - """Finding 64: _delete_branch happy path, timeout, and non-zero exit.""" - - def test_happy_path_succeeds_silently(self, tmp_path: pathlib.Path): - """When git branch -d returns zero, _delete_branch returns without raising.""" - ok = mock.MagicMock(returncode=0) - with mock.patch("codelicious.orchestrator.subprocess.run", return_value=ok): - # Should not raise - from codelicious.orchestrator import _delete_branch - - _delete_branch(tmp_path, "codelicious/my-branch") - - def test_timeout_logs_warning_and_returns(self, tmp_path: pathlib.Path, caplog): - """A timeout on git branch -d logs a warning and does not raise.""" - from codelicious.orchestrator import _delete_branch - - with ( - mock.patch( - "codelicious.orchestrator.subprocess.run", - side_effect=subprocess.TimeoutExpired(cmd=["git", "branch", "-d"], timeout=120), - ), - caplog.at_level("WARNING", logger="codelicious.orchestrator"), - ): - _delete_branch(tmp_path, "codelicious/timed-out-branch") - - warning_msgs = [r.message for r in caplog.records if r.levelname == "WARNING"] - assert any("timed out" in m.lower() or "timeout" in m.lower() for m in warning_msgs) - - def test_non_zero_exit_logs_warning_and_returns(self, tmp_path: pathlib.Path, caplog): - """A non-zero exit from git branch -d logs a warning and does not raise.""" - from codelicious.orchestrator import _delete_branch - - fail = mock.MagicMock(returncode=1, stderr="error: branch not fully merged") - with mock.patch("codelicious.orchestrator.subprocess.run", return_value=fail): - with caplog.at_level("WARNING", logger="codelicious.orchestrator"): - _delete_branch(tmp_path, "codelicious/un-merged-branch") - - warning_msgs = [r.message for r in caplog.records if r.levelname == "WARNING"] - assert any("failed to delete" in m.lower() or "un-merged-branch" in m.lower() for m in warning_msgs) - - -# --------------------------------------------------------------------------- -# Finding 65 — _phase_build KeyboardInterrupt handling -# --------------------------------------------------------------------------- - - -class TestPhaseBuildKeyboardInterrupt: - """Finding 65: KeyboardInterrupt inside _phase_build shuts the pool down and re-raises.""" - - @pytest.fixture - def orch(self, tmp_path: pathlib.Path) -> Orchestrator: - git_manager = mock.MagicMock() - git_manager.push_to_origin.return_value = mock.MagicMock(success=True, error_type=None, message="") - - class C: - model = "" - effort = "" - max_turns = 0 - agent_timeout_s = 30 - dry_run = True - - return Orchestrator(tmp_path, git_manager, C()) - - def test_keyboard_interrupt_re_raises_and_pool_is_shut_down(self, tmp_path: pathlib.Path, orch: Orchestrator): - """When concurrent.futures.as_completed raises KeyboardInterrupt, - the exception is re-raised after cancelling pending futures.""" - spec_a = tmp_path / "spec_a.md" - spec_b = tmp_path / "spec_b.md" - spec_a.write_text("") - spec_b.write_text("") - - with ( - mock.patch( - "concurrent.futures.as_completed", - side_effect=KeyboardInterrupt, - ), - mock.patch.object(orch, "_build_spec_in_worktree", return_value=("branch", True)), - ): - with pytest.raises(KeyboardInterrupt): - orch._phase_build([spec_a, spec_b], max_workers=2) - - -# --------------------------------------------------------------------------- -# Finding 66 — commit_ok=False, success=True data-loss prevention path -# --------------------------------------------------------------------------- - - -class TestCommitFailureAfterSuccessPreservesWorktree: - """Finding 66: when _commit_worktree_changes returns False and the build - was successful, the worktree is preserved and success becomes False.""" - - @pytest.fixture - def orch(self, tmp_path: pathlib.Path) -> Orchestrator: - git_manager = mock.MagicMock() - - class C: - model = "" - effort = "" - max_turns = 0 - agent_timeout_s = 30 - dry_run = True - - return Orchestrator(tmp_path, git_manager, C()) - - def test_commit_false_success_true_sets_success_false_and_preserves_worktree( - self, tmp_path: pathlib.Path, orch: Orchestrator, caplog - ): - """If _commit_worktree_changes returns False after a successful build, - success is flipped to False and the worktree is NOT removed.""" - spec = tmp_path / "spec.md" - spec.write_text("- [ ] some task\n") - - worktree = tmp_path / "wt" - worktree.mkdir() - (worktree / ".codelicious").mkdir() - (worktree / ".codelicious" / "BUILD_COMPLETE").write_text("DONE") - (worktree / "spec.md").write_text("- [x] some task\n") - - remove_worktree = mock.MagicMock() - - with mock.patch.object(orch, "_run_agent", return_value=mock.MagicMock(success=True)): - with mock.patch("codelicious.orchestrator._create_worktree", return_value=worktree): - with mock.patch("codelicious.orchestrator._remove_worktree", remove_worktree): - with mock.patch("codelicious.orchestrator._commit_worktree_changes", return_value=False): - with caplog.at_level("ERROR", logger="codelicious.orchestrator"): - _, success = orch._build_spec_in_worktree(spec) - - assert success is False - remove_worktree.assert_not_called() - error_msgs = [r.message for r in caplog.records if r.levelno >= logging.ERROR] - assert any("data loss" in m.lower() or "commit" in m.lower() for m in error_msgs) - - -# --------------------------------------------------------------------------- -# spec-21 Phase 14: REVIEWER_PROMPTS and ReviewRole coverage -# --------------------------------------------------------------------------- - - -class TestReviewerPromptsStructure: - """Tests for REVIEWER_PROMPTS dict structure (spec-21 Phase 14).""" - - def test_reviewer_prompts_is_dict_with_string_values(self) -> None: - """REVIEWER_PROMPTS must be a dict[str, str].""" - assert isinstance(REVIEWER_PROMPTS, dict) - assert len(REVIEWER_PROMPTS) > 0 - for key, value in REVIEWER_PROMPTS.items(): - assert isinstance(key, str), f"Key {key!r} is not a string" - assert isinstance(value, str), f"Value for {key!r} is not a string" - - def test_reviewer_prompts_has_security_role(self) -> None: - """REVIEWER_PROMPTS must include a 'security' role.""" - assert "security" in REVIEWER_PROMPTS - - def test_reviewer_prompts_contain_template_vars(self) -> None: - """Each prompt must contain the {{project_name}} template variable.""" - for key, prompt in REVIEWER_PROMPTS.items(): - assert "{{project_name}}" in prompt, f"Role {key!r} missing {{{{project_name}}}}" - - -class TestReviewRoleDataclass: - """Tests for ReviewRole dataclass (spec-21 Phase 14).""" - - def test_review_role_fields(self) -> None: - """ReviewRole must have name and prompt fields.""" - role = ReviewRole(name="security", prompt="Review for vulnerabilities.") - assert role.name == "security" - assert role.prompt == "Review for vulnerabilities." - - def test_review_role_is_frozen(self) -> None: - """ReviewRole must be frozen (immutable).""" - role = ReviewRole(name="test", prompt="test prompt") - with pytest.raises(AttributeError): - role.name = "modified" diff --git a/tests/test_prompts.py b/tests/test_prompts.py index ac41c63b..71b88efd 100644 --- a/tests/test_prompts.py +++ b/tests/test_prompts.py @@ -8,7 +8,9 @@ import pytest from codelicious.prompts import ( - AGENT_BUILD_SPEC, + CHUNK_EXECUTE, + CHUNK_FIX, + CHUNK_VERIFY, check_build_complete, clear_build_complete, render, @@ -124,17 +126,6 @@ def test_unused_kwargs_ignored(self): result = render("Hello {{name}}!", name="world", extra="ignored") assert result == "Hello world!" - def test_spec_filter_substitution(self): - """The critical fix: spec_filter is actually substituted into the prompt.""" - result = render( - AGENT_BUILD_SPEC, - project_name="myproject", - spec_filter="/path/to/spec.md", - ) - assert "/path/to/spec.md" in result - assert "{{spec_filter}}" not in result - assert "{{project_name}}" not in result - def test_partial_kwargs_leaves_unreplaced_tokens_verbatim(self): """render() with only some kwargs replaces provided tokens and leaves others intact.""" template = "Hello {{name}}, your task is {{task}}!" @@ -236,16 +227,8 @@ def test_all_prompt_constants_are_strings(self) -> None: if isinstance(val, str): assert len(val) > 0, f"Prompt constant {name} is empty" - def test_agent_build_spec_contains_template_vars(self) -> None: - """AGENT_BUILD_SPEC must contain {{project_name}} and {{spec_filter}}.""" - from codelicious.prompts import AGENT_BUILD_SPEC - - assert "{{project_name}}" in AGENT_BUILD_SPEC - assert "{{spec_filter}}" in AGENT_BUILD_SPEC - def test_chunk_execute_contains_template_vars(self) -> None: """CHUNK_EXECUTE must contain expected template variables.""" - from codelicious.prompts import CHUNK_EXECUTE assert "{{repo_path}}" in CHUNK_EXECUTE assert "{{chunk_id}}" in CHUNK_EXECUTE @@ -256,14 +239,12 @@ def test_chunk_execute_contains_template_vars(self) -> None: def test_chunk_verify_contains_template_vars(self) -> None: """CHUNK_VERIFY must contain expected template variables.""" - from codelicious.prompts import CHUNK_VERIFY assert "{{repo_path}}" in CHUNK_VERIFY assert "{{chunk_id}}" in CHUNK_VERIFY def test_chunk_fix_contains_template_vars(self) -> None: """CHUNK_FIX must contain expected template variables.""" - from codelicious.prompts import CHUNK_FIX assert "{{repo_path}}" in CHUNK_FIX assert "{{chunk_id}}" in CHUNK_FIX @@ -271,7 +252,7 @@ def test_chunk_fix_contains_template_vars(self) -> None: def test_chunk_templates_renderable(self) -> None: """All chunk templates can be rendered with render().""" - from codelicious.prompts import CHUNK_EXECUTE, CHUNK_FIX, CHUNK_VERIFY, render + from codelicious.prompts import render rendered = render( CHUNK_EXECUTE, diff --git a/tests/test_sandbox.py b/tests/test_sandbox.py index 9397f43f..2578e8ca 100644 --- a/tests/test_sandbox.py +++ b/tests/test_sandbox.py @@ -879,3 +879,35 @@ def batch_writer(thread_id: int) -> None: failed = sum(1 for _, ok in results if not ok) assert succeeded == 10, f"Expected 10 successes, got {succeeded}" assert failed == 10, f"Expected 10 failures, got {failed}" + + +# -- spec v29 Step 15: O_NOFOLLOW unavailable warning -------------------------- + + +class TestONoFollowWarning: + """spec v29 Step 15: Sandbox emits a single WARNING when O_NOFOLLOW is unavailable.""" + + def test_warning_emitted_when_o_nofollow_missing( + self, tmp_path: pathlib.Path, caplog: pytest.LogCaptureFixture + ) -> None: + """A platform without os.O_NOFOLLOW (e.g. Windows) gets a clear warning at init.""" + with unittest.mock.patch("codelicious.sandbox.os") as mock_os: + # Re-expose only the attributes the constructor touches; deliberately + # omit O_NOFOLLOW so hasattr() returns False. + mock_os.path = os.path + mock_os.environ = os.environ + del mock_os.O_NOFOLLOW + with caplog.at_level(logging.WARNING, logger="codelicious.sandbox"): + Sandbox(tmp_path) + + warnings = [r for r in caplog.records if "O_NOFOLLOW unavailable" in r.message] + assert len(warnings) == 1, f"expected exactly one O_NOFOLLOW warning, got {len(warnings)}" + + def test_no_warning_on_posix(self, tmp_path: pathlib.Path, caplog: pytest.LogCaptureFixture) -> None: + """On POSIX (where os.O_NOFOLLOW exists) no warning fires.""" + if not hasattr(os, "O_NOFOLLOW"): + pytest.skip("Test only meaningful on POSIX") + with caplog.at_level(logging.WARNING, logger="codelicious.sandbox"): + Sandbox(tmp_path) + warnings = [r for r in caplog.records if "O_NOFOLLOW unavailable" in r.message] + assert warnings == [] diff --git a/tests/test_scaffolder.py b/tests/test_scaffolder.py index 376a7de8..4b7449e9 100644 --- a/tests/test_scaffolder.py +++ b/tests/test_scaffolder.py @@ -204,3 +204,37 @@ def test_scaffold_propagates_oserror_from_atomic_write(tmp_path: pathlib.Path) - with patch("codelicious.scaffolder.atomic_write_text", side_effect=OSError("disk full")): with pytest.raises(OSError, match="disk full"): scaffold(tmp_path) + + +# -- spec v29 Step 18: forbidden-pattern tuple is single source of truth --- + + +class TestForbiddenPatternsSingleSource: + def test_all_patterns_appear_in_security_rules(self) -> None: + from codelicious.scaffolder import _FORBIDDEN_PATTERNS_DOC, _RULES_SECURITY + + for pattern in _FORBIDDEN_PATTERNS_DOC: + assert pattern in _RULES_SECURITY, f"missing {pattern!r} in _RULES_SECURITY" + + def test_all_patterns_appear_in_verify_all_skill(self) -> None: + from codelicious.scaffolder import _FORBIDDEN_PATTERNS_DOC, _SKILL_VERIFY_ALL + + for pattern in _FORBIDDEN_PATTERNS_DOC: + assert pattern in _SKILL_VERIFY_ALL + + def test_all_patterns_appear_in_builder_agent(self) -> None: + from codelicious.scaffolder import _AGENT_BUILDER, _FORBIDDEN_PATTERNS_DOC + + for pattern in _FORBIDDEN_PATTERNS_DOC: + assert pattern in _AGENT_BUILDER + + def test_no_unsubstituted_sentinels_remain(self) -> None: + from codelicious.scaffolder import ( + _AGENT_BUILDER, + _RULES_SECURITY, + _SKILL_VERIFY_ALL, + ) + + for blob in (_AGENT_BUILDER, _RULES_SECURITY, _SKILL_VERIFY_ALL): + assert "__FORBIDDEN_INLINE__" not in blob + assert "__FORBIDDEN_BULLETS__" not in blob diff --git a/tests/test_v2_orchestrator.py b/tests/test_v2_orchestrator.py index 09f37c47..7e2313f3 100644 --- a/tests/test_v2_orchestrator.py +++ b/tests/test_v2_orchestrator.py @@ -184,6 +184,19 @@ def test_deadline_stops_execution(self, tmp_path: pathlib.Path) -> None: # Chunks should not have been executed engine.execute_chunk.assert_not_called() + def test_future_deadline_runs_all_chunks(self, tmp_path: pathlib.Path) -> None: + """A deadline well in the future does not interrupt chunk execution (spec v29 Step 10).""" + import time + + spec = self._make_spec(tmp_path, "# Feature\n\n## Phase 1\n\n- [ ] Task A\n- [ ] Task B\n") + engine = self._mock_engine(success=True) + git = self._mock_git() + + orch = V2Orchestrator(tmp_path, git, engine) + orch.run(specs=[spec], deadline=time.monotonic() + 3600, push_pr=False) + + assert engine.execute_chunk.call_count == 2 + def test_empty_spec_no_chunks(self, tmp_path: pathlib.Path) -> None: """A spec with no checkboxes and no body produces no chunks.""" spec = self._make_spec(tmp_path, "# Empty Spec\n") @@ -321,3 +334,194 @@ def test_commit_cap_takes_precedence_over_loc_cap(self, tmp_path: pathlib.Path) git.create_continuation_branch.assert_called_once() # LOC check should not have been polled on the chunk that split via commits assert git.get_pr_diff_loc.call_count == 0 + + +class TestIdempotentResume: + """spec v30 Step 2: persistent chunk-status ledger drives skip-on-resume.""" + + def _make_spec(self, tmp_path: pathlib.Path, content: str) -> pathlib.Path: + spec_dir = tmp_path / "docs" / "specs" + spec_dir.mkdir(parents=True, exist_ok=True) + spec = spec_dir / "01_feature.md" + spec.write_text(content, encoding="utf-8") + return spec + + def _mock_engine(self): + engine = mock.MagicMock() + engine.name = "mock-engine" + engine.execute_chunk.return_value = ChunkResult( + success=True, files_modified=[pathlib.Path("src/a.py")], message="done" + ) + engine.verify_chunk.return_value = ChunkResult(success=True, message="passed") + engine.fix_chunk.return_value = ChunkResult(success=True, message="fixed") + return engine + + def _mock_git(self): + git = mock.MagicMock() + git.assert_safe_branch = mock.MagicMock() + git.push_to_origin.return_value = mock.MagicMock(success=True, error_type=None, message="") + git.commit_chunk.return_value = mock.MagicMock(success=True, sha="abc1234", message="ok") + git.get_pr_commit_count.return_value = 0 + git.get_pr_diff_loc.return_value = 0 + git.ensure_draft_pr_exists.return_value = 42 + git.revert_chunk_changes.return_value = True + git.repo_path = pathlib.Path("/tmp/repo") + return git + + def test_merged_chunk_skipped(self, tmp_path: pathlib.Path) -> None: + """A chunk previously marked 'merged' in the ledger is not re-executed.""" + import json + + spec = self._make_spec(tmp_path, "# Feature\n\n## Phase 1\n\n- [ ] Task A\n- [ ] Task B\n") + engine = self._mock_engine() + git = self._mock_git() + orch = V2Orchestrator(tmp_path, git, engine) + ledger_path = orch._ledger_path(spec) + ledger_path.parent.mkdir(parents=True, exist_ok=True) + ledger_path.write_text( + json.dumps( + { + "chunks": { + "spec-01-chunk-01": { + "status": "merged", + "title": "Task A", + "updated_at": "2026-05-04T00:00:00Z", + } + } + } + ) + ) + + orch.run(specs=[spec], push_pr=False) + # Only chunk 02 should have been executed. + assert engine.execute_chunk.call_count == 1 + + def test_full_ledger_executes_no_chunks(self, tmp_path: pathlib.Path) -> None: + import json + + spec = self._make_spec(tmp_path, "# Feature\n\n## Phase 1\n\n- [ ] Task A\n") + engine = self._mock_engine() + git = self._mock_git() + orch = V2Orchestrator(tmp_path, git, engine) + ledger_path = orch._ledger_path(spec) + ledger_path.parent.mkdir(parents=True, exist_ok=True) + ledger_path.write_text( + json.dumps( + { + "chunks": { + "spec-01-chunk-01": {"status": "merged", "title": "Task A"}, + } + } + ) + ) + + orch.run(specs=[spec], push_pr=False) + engine.execute_chunk.assert_not_called() + + def test_no_resume_ignores_ledger(self, tmp_path: pathlib.Path) -> None: + import json + + spec = self._make_spec(tmp_path, "# Feature\n\n## Phase 1\n\n- [ ] Task A\n") + engine = self._mock_engine() + git = self._mock_git() + orch = V2Orchestrator(tmp_path, git, engine, no_resume=True) + ledger_path = orch._ledger_path(spec) + ledger_path.parent.mkdir(parents=True, exist_ok=True) + ledger_path.write_text( + json.dumps( + { + "chunks": { + "spec-01-chunk-01": {"status": "merged", "title": "Task A"}, + } + } + ) + ) + + orch.run(specs=[spec], push_pr=False) + engine.execute_chunk.assert_called_once() + + def test_successful_run_writes_merged_status(self, tmp_path: pathlib.Path) -> None: + import json + + spec = self._make_spec(tmp_path, "# Feature\n\n## Phase 1\n\n- [ ] Task A\n") + engine = self._mock_engine() + git = self._mock_git() + orch = V2Orchestrator(tmp_path, git, engine) + + orch.run(specs=[spec], push_pr=False) + ledger = json.loads(orch._ledger_path(spec).read_text()) + merged = [entry for entry in ledger["chunks"].values() if entry["status"] == "merged"] + assert len(merged) == 1 + + +class TestEngineFallback: + """spec v30 Step 5: Claude rate-limit fails over to the next engine in the list.""" + + def _make_spec(self, tmp_path: pathlib.Path, content: str) -> pathlib.Path: + spec_dir = tmp_path / "docs" / "specs" + spec_dir.mkdir(parents=True, exist_ok=True) + spec = spec_dir / "01_feature.md" + spec.write_text(content, encoding="utf-8") + return spec + + def _mock_git(self): + git = mock.MagicMock() + git.assert_safe_branch = mock.MagicMock() + git.push_to_origin.return_value = mock.MagicMock(success=True, error_type=None, message="") + git.commit_chunk.return_value = mock.MagicMock(success=True, sha="abc1234", message="ok") + git.get_pr_commit_count.return_value = 0 + git.get_pr_diff_loc.return_value = 0 + git.ensure_draft_pr_exists.return_value = 42 + git.revert_chunk_changes.return_value = True + git.repo_path = pathlib.Path("/tmp/repo") + return git + + def _engine(self, name: str, *, success: bool = True, rate_limit: bool = False) -> mock.MagicMock: + eng = mock.MagicMock() + eng.name = name + if rate_limit: + eng.execute_chunk.return_value = ChunkResult(success=False, files_modified=[], message="Rate limited") + else: + eng.execute_chunk.return_value = ChunkResult( + success=success, + files_modified=[pathlib.Path("src/a.py")] if success else [], + message="ok" if success else "err", + ) + eng.verify_chunk.return_value = ChunkResult(success=True, message="passed") + eng.fix_chunk.return_value = ChunkResult(success=True, message="fixed") + return eng + + def test_primary_rate_limit_fails_over(self, tmp_path: pathlib.Path) -> None: + spec = self._make_spec(tmp_path, "# Spec\n\n## P1\n\n- [ ] Task A\n") + primary = self._engine("claude", rate_limit=True) + secondary = self._engine("huggingface", success=True) + git = self._mock_git() + + orch = V2Orchestrator(tmp_path, git, primary, engines=[primary, secondary]) + result = orch.run(specs=[spec], push_pr=False) + + assert primary.execute_chunk.call_count == 1 + assert secondary.execute_chunk.call_count == 1 + assert result.success is True + + def test_both_rate_limit_aborts(self, tmp_path: pathlib.Path) -> None: + spec = self._make_spec(tmp_path, "# Spec\n\n## P1\n\n- [ ] Task A\n") + primary = self._engine("claude", rate_limit=True) + secondary = self._engine("huggingface", rate_limit=True) + git = self._mock_git() + + orch = V2Orchestrator(tmp_path, git, primary, engines=[primary, secondary]) + result = orch.run(specs=[spec], push_pr=False) + + assert primary.execute_chunk.call_count == 1 + assert secondary.execute_chunk.call_count == 1 + assert result.success is False + + def test_no_engines_arg_keeps_legacy_single_engine_behavior(self, tmp_path: pathlib.Path) -> None: + spec = self._make_spec(tmp_path, "# Spec\n\n## P1\n\n- [ ] Task A\n") + eng = self._engine("claude", success=True) + git = self._mock_git() + + orch = V2Orchestrator(tmp_path, git, eng) # no engines kwarg + orch.run(specs=[spec], push_pr=False) + assert eng.execute_chunk.call_count == 1 diff --git a/tests/test_verifier.py b/tests/test_verifier.py index 089252ee..e9e1a1c6 100644 --- a/tests/test_verifier.py +++ b/tests/test_verifier.py @@ -1771,3 +1771,108 @@ def test_coverage_timeout_used_in_subprocess(self, tmp_path: pathlib.Path) -> No assert mock_run.called call_kwargs = mock_run.call_args assert call_kwargs.kwargs.get("timeout") == 42 or call_kwargs[1].get("timeout") == 42 + + +# ───────────────────────────────────────────────────────────────────────── +# spec v29 Step 6: chunk-scoped verify_paths +# ───────────────────────────────────────────────────────────────────────── + + +class TestVerifyPaths: + def test_empty_paths_falls_back_to_full_verify(self, tmp_path: pathlib.Path) -> None: + from codelicious.verifier import verify_paths + + with patch("codelicious.verifier.verify") as mock_verify: + mock_verify.return_value = "fallback" + assert verify_paths(tmp_path, []) == "fallback" + mock_verify.assert_called_once_with(tmp_path) + + def test_no_test_mapping_falls_back(self, tmp_path: pathlib.Path) -> None: + """Source file with no matching test_<module>.py falls back to full verify.""" + from codelicious.verifier import verify_paths + + (tmp_path / "src").mkdir() + src = tmp_path / "src" / "lonely.py" + src.write_text("x = 1\n") + + with patch("codelicious.verifier.verify") as mock_verify: + mock_verify.return_value = "fallback" + assert verify_paths(tmp_path, [src]) == "fallback" + mock_verify.assert_called_once() + + def test_mapped_test_file_resolved(self, tmp_path: pathlib.Path) -> None: + """src/<module>.py with a matching tests/test_<module>.py is mapped.""" + from codelicious.verifier import _map_src_to_tests + + (tmp_path / "src").mkdir() + (tmp_path / "tests").mkdir() + src = tmp_path / "src" / "math_utils.py" + src.write_text("def add(a, b):\n return a + b\n") + test = tmp_path / "tests" / "test_math_utils.py" + test.write_text("def test_x():\n assert True\n") + + mapped = _map_src_to_tests(tmp_path, [src]) + assert mapped == [test] + + def test_map_skips_non_python_paths(self, tmp_path: pathlib.Path) -> None: + from codelicious.verifier import _map_src_to_tests + + (tmp_path / "tests").mkdir() + assert _map_src_to_tests(tmp_path, [pathlib.Path("README.md")]) == [] + + +# ───────────────────────────────────────────────────────────────────────── +# spec v30 Step 8: coverage-floor enforcement +# ───────────────────────────────────────────────────────────────────────── + + +class TestCoverageGate: + def test_default_floor_is_90(self, tmp_path: pathlib.Path) -> None: + from codelicious.verifier import resolve_min_coverage + + assert resolve_min_coverage(tmp_path) == 90.0 + + def test_pyproject_value_used_when_present(self, tmp_path: pathlib.Path) -> None: + from codelicious.verifier import resolve_min_coverage + + (tmp_path / "pyproject.toml").write_text( + "[tool.codelicious]\nmin_coverage = 75\n", + encoding="utf-8", + ) + assert resolve_min_coverage(tmp_path) == 75.0 + + def test_cli_override_wins(self, tmp_path: pathlib.Path) -> None: + from codelicious.verifier import resolve_min_coverage + + (tmp_path / "pyproject.toml").write_text( + "[tool.codelicious]\nmin_coverage = 75\n", + encoding="utf-8", + ) + assert resolve_min_coverage(tmp_path, cli_override=99.0) == 99.0 + + def test_below_floor_demotes_to_failure(self, tmp_path: pathlib.Path) -> None: + from codelicious.verifier import ( + CheckResult, + VerificationResult, + _enforce_coverage_floor, + ) + + cov = CheckResult(name="coverage", passed=True, message="Coverage 71.5%", details="") + result = VerificationResult(checks=[cov]) + _enforce_coverage_floor(result, 90.0) + assert cov.passed is False + assert "71.5%" in cov.message + assert "90.0%" in cov.message + + def test_above_floor_unchanged(self, tmp_path: pathlib.Path) -> None: + from codelicious.verifier import ( + CheckResult, + VerificationResult, + _enforce_coverage_floor, + ) + + cov = CheckResult(name="coverage", passed=True, message="Coverage 95%", details="") + result = VerificationResult(checks=[cov]) + _enforce_coverage_floor(result, 90.0) + assert cov.passed is True + assert cov.message == "Coverage 95%"