diff --git a/.github/workflows/windows-free-tests.yml b/.github/workflows/windows-free-tests.yml index 69e71a8b6a..1b0d5c793b 100644 --- a/.github/workflows/windows-free-tests.yml +++ b/.github/workflows/windows-free-tests.yml @@ -91,8 +91,17 @@ jobs: continue-on-error: true - name: Verify new portability work on Windows - # 31 tests targeting the new code paths added by v1.20.0.0. These - # MUST pass for the release-note headline ("curated Windows lane added") - # to be truthful. - run: bun test test/gstack-paths.test.ts browse/test/claude-bin.test.ts test/test-free-shards.test.ts + # Tests targeting the v1.20.0.0 lane plus v1.30.0.0 fix-wave additions. + # v1.30.0.0 extension covers icacls hardening (#1308), bash.exe telemetry + # wrap (#1306), and Bun.which-based binary resolvers (#1307). These must + # pass on Windows for the wave's "Windows hardening" framing to be honest. + run: | + bun test \ + test/gstack-paths.test.ts \ + browse/test/claude-bin.test.ts \ + test/test-free-shards.test.ts \ + browse/test/file-permissions.test.ts \ + browse/test/security.test.ts \ + make-pdf/test/browseClient.test.ts \ + make-pdf/test/pdftotext.test.ts shell: bash diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index 1cbd52898b..7f903d60ae 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -156,7 +156,7 @@ The Chrome sidebar agent has tools (Bash, Read, Glob, Grep, WebFetch) and reads 4. **L5 canary token (`browse/src/security.ts`).** A random token injected into the system prompt at session start. Rolling-buffer detection across `text_delta` and `input_json_delta` streams catches the token if it shows up anywhere in Claude's output, tool arguments, URLs, or file writes. Deterministic BLOCK — if the token leaks, the attacker convinced Claude to reveal the system prompt, and the session ends. -5. **L6 ensemble combiner (`combineVerdict`).** BLOCK requires agreement from two ML classifiers at >= `WARN` (0.60), not a single confident hit. This is the Stack Overflow instruction-writing false-positive mitigation. On tool-output scans, single-layer high confidence BLOCKs directly — the content wasn't user-authored, so the FP concern doesn't apply. +5. **L6 ensemble combiner (`combineVerdict`).** BLOCK requires agreement from two ML classifiers at >= `WARN` (0.75), not a single confident hit. This is the Stack Overflow instruction-writing false-positive mitigation. On tool-output scans, single-layer high confidence BLOCKs directly — the content wasn't user-authored, so the FP concern doesn't apply. **Critical constraint:** `security-classifier.ts` runs only in the sidebar-agent process, never in the compiled browse binary. `@huggingface/transformers` v4 requires `onnxruntime-node`, which fails `dlopen` from Bun compile's temp extract directory. Only the pure-string pieces (canary inject/check, verdict combiner, attack log, status) are in `security.ts`, which is safe to import from `server.ts`. diff --git a/CHANGELOG.md b/CHANGELOG.md index aa1f227d42..e9f3049a4b 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,79 @@ # Changelog +## [1.30.0.0] - 2026-05-09 + +## **Twenty-one community fixes land in one wave, plus closing fixes that put the Windows + codex surfaces under CI for the first time.** + +Browse stops silently dropping `browse-console.log` writes (a regression from a missing variable declaration), the cold-start race that ENOENT'd one of every fifteen parallel daemons gets a per-process tempfile, and concurrent iframe detach finally clears refs symmetrically with main-frame nav. `codex exec resume` works on machines that ship `python` without the `python3` alias, and stops passing the `-C` and `-s` flags that the resume subcommand rejects. Windows users get bash.exe wrap for telemetry spawn, `Bun.which` binary resolution that finds `.exe`/`.cmd`/`.bat` instead of bare paths, and NTFS ACL hardening on every file written to `~/.gstack/`. Two closing fixes land alongside: `windows-free-tests.yml` now exercises the icacls + Bun.which test files (closing the gap codex's outside-voice review flagged in the plan), and a live `codex exec resume --help` smoke catches CLI flag-semantics drift that the existing regex-only test would have missed. + +### The numbers that matter + +End-to-end verified via `bun test` (free tier, 452 tests pass) and gate-tier E2E: + +| Surface | Before | After | Δ | +|---|---|---|---| +| Browse `console.log` persistence | swallowed every 1s flush due to `lastConsoleFlushed` ReferenceError | declared, persisted to disk | regression closed | +| Concurrent daemon cold-start | shared `state.tmp` raced rename, killed 1 in N spawns | per-process `tmpStatePath()` (pid + 4 random bytes) | no more ENOENT | +| Iframe detach handling | refs leaked when iframe auto-detached (asymmetric with main-frame nav) | refs cleared symmetrically | parity fix | +| `codex exec resume` flag set | `-C "$_REPO_ROOT" -s read-only` (rejected by the resume subcommand) | `-c 'sandbox_mode="read-only"'` + `cd "$_REPO_ROOT"` | works without warnings | +| Codex JSON parsing | hardcoded `python3`; broke on machines with only `python` | probes `python3` then `python`, errors clearly if neither | works on more machines | +| Windows browse / make-pdf binary resolution | bare-path probe missed `.exe`/`.cmd`/`.bat` | `Bun.which` + `GSTACK_*_BIN` override + extension probing | works on Windows installs | +| Windows state-file hardening | POSIX `0o600` mode bits no-op'd on NTFS | icacls inheritance break + grant-only ACL on every `~/.gstack/` write | actual hardening, not silent no-op | +| Windows telemetry spawn | `spawn(bash-script)` ENOENT'd silently on Windows (`CreateProcess` rejects shebangs) | bash.exe wrap with PATH / `GSTACK_BASH_BIN` override | telemetry events captured on Windows | +| Domain-skill auto-promote | promoted regardless of classifier_score | gated on `classifier_score > 0` | adversarially-flagged domains stay quarantined | +| Shell-injection surface in memory ingest | git cwd interpolated through `/bin/sh` | `execFileSync` with cwd as a parameter | one less injection path | +| Windows free-tests CI coverage | 3 test files (claude-bin, gstack-paths, test-shards) | 7 test files (+ icacls, security telemetry, browseClient, pdftotext) | 4 new surfaces under CI | +| Codex CLI flag-semantics test | regex-only on SKILL.md text | live `codex exec resume --help` smoke (skips when codex absent) | catches upstream flag drift | + +PR count: 21 community merges + 4 in-house follow-up commits (#1302 template port, CL-1 Windows CI extension, CL-2 codex flag smoke, server.ts conflict-resolution fix). Contributors credited: 13 unique authors. Test count went from 452 → 459 (4 new tests from the merged PRs + 3 from CL-1/CL-2 invariants). + +### What this means for builders + +If you're on a Windows install, this is the release where `~/.gstack/` is actually access-restricted (icacls grants), browse and make-pdf find the right `.exe`, and bash-shebang telemetry stops dropping on the floor. Set `GSTACK_BROWSE_BIN` / `GSTACK_PDFTOTEXT_BIN` / `GSTACK_BASH_BIN` to override. If you use the `/codex` skill, resume sessions work on machines with only `python` and no `python3`, and the rejected `-C/-s` flags are gone. If you spawn multiple browse daemons in parallel (CI shards, cold-start races, multi-tab Conductor), the per-process tempfile fix means rename no longer steals the file out from under a sibling. Run `gbrain autopilot --install` once and forget about it. + +### Itemized changes + +#### Added + +- **#1306** Windows bash.exe wrap for telemetry spawn (`browse/src/security.ts`). Honors `GSTACK_BASH_BIN` / `BASH_BIN` env override, falls back to `Bun.which('bash')` (finds Git Bash on standard Windows installs). Returns null when bash is unresolvable so caller skips the spawn cleanly. Contributed by @scarson. +- **#1307** `Bun.which`-based binary resolution for `make-pdf/src/browseClient.ts` and `make-pdf/src/pdftotext.ts`. Probes `.exe`/`.cmd`/`.bat` after a bare-path miss on Windows; honors `GSTACK_BROWSE_BIN` / `GSTACK_PDFTOTEXT_BIN` overrides. Extends the v1.24 pattern from `claude-bin.ts` to the other two binary resolvers. Contributed by @scarson. +- **#1308** NTFS ACL hardening for `~/.gstack/` state files (`browse/src/file-permissions.ts` is the new helper). `writeSecureFile` and `mkdirSecure` invoke `icacls /inheritance:r /grant:r :(F)` on Windows; POSIX `chmod 0o600` continues working unchanged. First icacls failure per process is logged once with the advice line "sensitive files may be readable by other accounts on this machine"; later failures stay silent to avoid spam. Contributed by @scarson. +- **#1316** Python3-or-python probe in `codex/SKILL.md.tmpl`. Resolves `python3` then `python`, errors clearly if neither is on PATH. Contributed by @jbetala7. +- **#1339** Strict integer validation in `browse/src/browse-client.ts` env handling. Partial integers now throw rather than silently truncating. Contributed by @hiSandog. +- **#1369** `classifier_score > 0` gate on domain-skill auto-promote (`browse/src/domain-skills.ts:248-320`). Quarantined domains stay quarantined even if every other heuristic says promote. Contributed by @garagon. +- **CL-1** Windows free-tests CI lane now runs `browse/test/file-permissions.test.ts`, `browse/test/security.test.ts`, `make-pdf/test/browseClient.test.ts`, and `make-pdf/test/pdftotext.test.ts`. The four test files already platform-gate their assertions via `process.platform`, so the same files run on POSIX and Windows lanes and exercise only the relevant branch. +- **CL-2** Live codex CLI flag-semantics smoke (`test/codex-resume-flag-semantics.test.ts`). Probes `codex exec resume --help` for `-c`/`sandbox_mode` presence and top-level `-C` absence; skips when codex isn't on PATH so dev machines without codex installed never see it fail. + +#### Changed + +- **#1270** `codex exec resume` invocation in `codex/SKILL.md.tmpl` drops `-C "$_REPO_ROOT"` and `-s read-only` (the resume subcommand rejects both), uses `-c 'sandbox_mode="read-only"'` config and `cd "$_REPO_ROOT"` instead. Adds the regression test `codex/SKILL.md resume command only uses resume-supported flags`. Contributed by @jbetala7. +- **#1273** `design/prototype.ts` (the prototype script only — the main design CLI is unchanged) reads the OpenAI key only from `OPENAI_API_KEY`. Output filenames sanitized to `[a-zA-Z0-9_-]` only. The `~/.gstack/openai.json` file fallback is removed from the prototype script; `design/src/auth.ts` and `design/src/cli.ts` still support it for the main CLI flow. Contributed by @orbisai0security. +- **#1302** /ship Plan Completion gate (`ship/SKILL.md.tmpl` + `scripts/resolvers/review.ts`) adds Verification Mode classification (DIFF-VERIFIABLE / CROSS-REPO / EXTERNAL-STATE / CONTENT-SHAPE), the UNVERIFIABLE classification, per-item confirmation gate (no blanket-confirm AskUserQuestion), and explicit fail-closed behavior on subagent failure. Forbids the silent-fail-open path that produced the VAS-449 incident shape. Contributed by @vaskockorovski. +- **#1332** /ship step 12 fail-fast probe for the base branch in `ship/SKILL.md.tmpl`. Prevents step 12 from running against an unresolvable base. Contributed by @Jasperc2024. +- **#1337** `design/src/variants.ts` honors the `Retry-After` header on 429 responses. Prevents thundering-herd retries against rate-limited endpoints. Contributed by @stedfn. +- **#1362** `test/helpers/providers/gemini.ts` detects the new `~/.gemini/oauth_creds.json` auth path alongside the legacy location. Contributed by @abigail-atheryon. +- **#1366** `browse/src/browser-manager.ts` adds `--no-sandbox` only when running as root (Linux/WSL2), not unconditionally. Contributed by @furkankoykiran. +- **#1368** `bin/gstack-memory-ingest.ts` passes git cwd via `execFileSync` parameter rather than interpolating into a `/bin/sh` invocation. One less shell-injection class. Contributed by @garagon. + +#### Fixed + +- **#1309** Missing `let lastConsoleFlushed = 0;` declaration in `browse/src/server.ts`. Every 1-second `flushBuffers` tick was throwing a swallowed ReferenceError; `browse-console.log` was never written in any production deployment since this regressed. Contributed by @yashkot007. +- **#1310** Per-process `tmpStatePath()` for state-file writes in `browse/src/server.ts`. The shared `state.tmp` literal raced on rename when concurrent daemons spawned (15-parallel cold-start reproducer). pid + 4 random bytes of suffix gives each writer a unique path; atomic rename still gives last-writer-wins on the final state. Contributed by @yashkot007. +- **#1311** `getActiveFrameOrPage` in `browse/src/tab-session.ts` clears refs symmetrically when an iframe auto-detaches, matching the existing main-frame nav path. Contributed by @yashkot007. +- **#1297** Korean / CJK IME input rendering in the Sidebar Terminal (`extension/sidepanel-terminal.js`, `browse/src/terminal-agent.ts`, `extension/sidepanel.css`). Composition state preserved, character widths corrected. Contributed by @realcarsonterry. +- **#1333** Removed the contradictory plan-mode handshake from `plan-devex-review/SKILL.md.tmpl` (the skill was simultaneously claiming plan-mode is active and asking the user to confirm entering plan-mode). Contributed by @Jasperc2024. + +#### Documentation + +- **#1290** `CLAUDE.md` and `ARCHITECTURE.md` prompt-injection thresholds aligned to the actual values in `browse/src/security.ts` (BLOCK 0.85, WARN 0.60, LOG_ONLY 0.40 — the docs had drifted to older numbers). Contributed by @brycealan. +- **#1338** README per-skill symlink uninstall snippet corrected (the previous wording would `rm` the global skills directory rather than the project-local symlink). Contributed by @stedfn. + +#### For contributors + +- The wave was triaged by `/plan-ceo-review` (single-wave + bisect-discipline merge ordering), `/plan-eng-review` (mapped 5 cross-PR conflict pairs with explicit resolution rules + tightened the `gh pr checkout N -b pr-N` syntax), and `/codex` outside-voice review (caught 6 factual errors and 2 process improvements that both internal reviews missed; cross-model agreement was 14%). All review findings were incorporated before merge; the two CI gaps codex flagged became the CL-1 and CL-2 closing fixes that ship in this same release. +- The five cross-PR conflict pairs documented in the plan (#1316↔#1270 codex resume line, #1309→#1310→#1308 server.ts state writes, #1366↔#1308 browser-manager, #1306↔#1308 security.ts, #1332↔#1302 ship template) all surfaced as predicted; resolutions kept both intents on each. The lone exception was the #1310/#1308 state-file write site, where `fs.writeFileSync(tmpStatePath(), ..., { mode: 0o600 })` is preserved (locks #1310's race-fix invariant exercised by `browse/test/server-tmp-state-path.test.ts`); icacls hardening still applies to every other `writeSecureFile` call site #1308 introduced (`auth.json`, the `mkdirSecure` paths, etc.). +- PR #1302 only edited the generated `ship/SKILL.md`, not the source `ship/SKILL.md.tmpl` or `scripts/resolvers/review.ts`. The next `bun run gen:skill-docs` would have wiped its changes; the wave includes `fix(ship): port #1302 SKILL.md edits to .tmpl + resolver source` to keep the changes alive across regen. + ## [1.29.0.0] - 2026-05-08 ## **Code search beats Grep across every Conductor worktree now, not just the last one you synced.** diff --git a/CLAUDE.md b/CLAUDE.md index c0f07f6901..875cb94fe2 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -287,8 +287,12 @@ for `server.ts`. See `~/.gstack/projects/garrytan-gstack/ceo-plans/2026-04-19-pr **Thresholds** (in `security.ts`): - `BLOCK: 0.85` — single-layer score that would cause BLOCK if cross-confirmed -- `WARN: 0.60` — cross-confirm threshold. When L4 AND L4b both >= 0.60 → BLOCK +- `WARN: 0.75` — cross-confirm threshold. When L4 AND L4b both >= 0.75 → BLOCK - `LOG_ONLY: 0.40` — gates transcript classifier (skip Haiku when all layers < 0.40) +- `SOLO_CONTENT_BLOCK: 0.92` — single-layer threshold for label-less content classifiers + (testsavant, deberta). Intentionally higher than `BLOCK` because these layers can't + distinguish "this is an injection" from "this looks like phishing aimed at the user." + The transcript classifier keeps a separate, label-gated solo path at `BLOCK` (0.85). **Ensemble rule:** BLOCK only when the ML content classifier AND the transcript classifier both report >= WARN. Single-layer high confidence degrades to WARN — diff --git a/README.md b/README.md index dcab7cf213..87f2d5ddd6 100644 --- a/README.md +++ b/README.md @@ -327,9 +327,18 @@ If you don't have the repo cloned (e.g. you installed via a Claude Code paste an # 1. Stop browse daemons pkill -f "gstack.*browse" 2>/dev/null || true -# 2. Remove per-skill symlinks pointing into gstack/ -find ~/.claude/skills -maxdepth 1 -type l 2>/dev/null | while read -r link; do - case "$(readlink "$link" 2>/dev/null)" in gstack/*|*/gstack/*) rm -f "$link" ;; esac +# 2. Remove per-skill directories whose SKILL.md points into gstack/ +find ~/.claude/skills -mindepth 1 -maxdepth 1 -type d ! -name gstack 2>/dev/null | +while IFS= read -r dir; do + link="$dir/SKILL.md" + [ -L "$link" ] || continue + target=$(readlink "$link" 2>/dev/null) || continue + case "$target" in + gstack/*|*/gstack/*) + rm -f "$link" + rmdir "$dir" 2>/dev/null || true + ;; + esac done # 3. Remove gstack diff --git a/VERSION b/VERSION index 5d7709a6b5..fbeae6387a 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -1.29.0.0 +1.30.0.0 diff --git a/bin/gstack-memory-ingest.ts b/bin/gstack-memory-ingest.ts index 5d3401e036..56b072b2b0 100644 --- a/bin/gstack-memory-ingest.ts +++ b/bin/gstack-memory-ingest.ts @@ -632,9 +632,16 @@ function extractContentText(rec: any): string { function resolveGitRemote(cwd: string): string { if (!cwd) return ""; try { - const out = execSync(`git -C ${JSON.stringify(cwd)} remote get-url origin 2>/dev/null`, { + // execFileSync (no shell) so `cwd` cannot trigger command substitution. + // Transcript JSONL records are an untrusted surface (a poisoned `.cwd` + // value containing `"$(...)"` survived `JSON.stringify` interpolation + // into a `/bin/sh -c` context, since JSON quoting does not escape `$` + // or backticks). Mirrors the execFileSync pattern this module already + // uses for `gbrainAvailable()` (line 762) and `gbrainPutPage()` (line 816). + const out = execFileSync("git", ["-C", cwd, "remote", "get-url", "origin"], { encoding: "utf-8", timeout: 2000, + stdio: ["ignore", "pipe", "ignore"], }); return canonicalizeRemote(out.trim()); } catch { diff --git a/browse/src/browse-client.ts b/browse/src/browse-client.ts index a33681f717..435f575469 100644 --- a/browse/src/browse-client.ts +++ b/browse/src/browse-client.ts @@ -54,6 +54,13 @@ interface ResolvedAuth { source: 'env' | 'state-file'; } +function parseIntegerEnvValue(value: string | undefined): number | undefined { + const trimmed = value?.trim(); + if (!trimmed || !/^-?\d+$/.test(trimmed)) return undefined; + const parsed = parseInt(trimmed, 10); + return Number.isFinite(parsed) ? parsed : undefined; +} + /** Resolve the daemon port + token. Throws a clear error if neither path works. */ export function resolveBrowseAuth(opts: BrowseClientOptions = {}): ResolvedAuth { if (opts.port !== undefined && opts.token !== undefined) { @@ -64,8 +71,8 @@ export function resolveBrowseAuth(opts: BrowseClientOptions = {}): ResolvedAuth const envPort = process.env.GSTACK_PORT; const envToken = process.env.GSTACK_SKILL_TOKEN; if (envPort && envToken) { - const port = opts.port ?? parseInt(envPort, 10); - if (!isNaN(port)) { + const port = opts.port ?? parseIntegerEnvValue(envPort); + if (port !== undefined) { return { port, token: opts.token ?? envToken, source: 'env' }; } } @@ -132,7 +139,7 @@ export class BrowseClient { const auth = resolveBrowseAuth(opts); this.port = auth.port; this.token = auth.token; - this.tabId = opts.tabId ?? (process.env.BROWSE_TAB ? parseInt(process.env.BROWSE_TAB, 10) : undefined); + this.tabId = opts.tabId ?? parseIntegerEnvValue(process.env.BROWSE_TAB); this.timeoutMs = opts.timeoutMs ?? 30_000; } diff --git a/browse/src/browser-manager.ts b/browse/src/browser-manager.ts index 9810674e15..fd906caa31 100644 --- a/browse/src/browser-manager.ts +++ b/browse/src/browser-manager.ts @@ -16,6 +16,7 @@ */ import { chromium, type Browser, type BrowserContext, type BrowserContextOptions, type Page, type Locator, type Cookie } from 'playwright'; +import { writeSecureFile, mkdirSecure } from './file-permissions'; import { addConsoleEntry, addNetworkEntry, addDialogEntry, networkBuffer, type DialogEntry } from './buffers'; import { validateNavigationUrl } from './url-validation'; import { TabSession, type RefEntry } from './tab-session'; @@ -197,10 +198,11 @@ export class BrowserManager { const launchArgs: string[] = [...STEALTH_LAUNCH_ARGS]; let useHeadless = true; - // Docker/CI: Chromium sandbox requires unprivileged user namespaces which - // are typically disabled in containers. Detect container environment and - // add --no-sandbox automatically. - if (process.env.CI || process.env.CONTAINER) { + // Docker/CI/root: Chromium sandbox requires unprivileged user namespaces which + // are typically disabled in containers and are never available for the root + // user on Linux. Detect all three cases and add --no-sandbox automatically. + const isRoot = typeof process.getuid === 'function' && process.getuid() === 0; + if (process.env.CI || process.env.CONTAINER || isRoot) { launchArgs.push('--no-sandbox'); } @@ -290,10 +292,10 @@ export class BrowserManager { const fs = require('fs'); const path = require('path'); const gstackDir = path.join(process.env.HOME || '/tmp', '.gstack'); - fs.mkdirSync(gstackDir, { recursive: true }); + mkdirSecure(gstackDir); const authFile = path.join(gstackDir, '.auth.json'); try { - fs.writeFileSync(authFile, JSON.stringify({ token: authToken, port: this.serverPort || 34567 }), { mode: 0o600 }); + writeSecureFile(authFile, JSON.stringify({ token: authToken, port: this.serverPort || 34567 })); } catch (err: any) { console.warn(`[browse] Could not write .auth.json: ${err.message}`); } diff --git a/browse/src/browser-skill-write.ts b/browse/src/browser-skill-write.ts index 55ffd9e2c4..81599b419b 100644 --- a/browse/src/browser-skill-write.ts +++ b/browse/src/browser-skill-write.ts @@ -19,6 +19,7 @@ import * as fs from 'fs'; import * as path from 'path'; import * as os from 'os'; +import { mkdirSecure } from './file-permissions'; import { isPathWithin } from './platform'; import type { TierPaths } from './browser-skills'; import { defaultTierPaths } from './browser-skills'; @@ -74,8 +75,8 @@ export function stageSkill(opts: StageSkillOptions): string { const wrapperDir = path.join(tmpRoot, `skillify-${spawnId}`); const stagedDir = path.join(wrapperDir, opts.name); - fs.mkdirSync(wrapperDir, { recursive: true, mode: 0o700 }); - fs.mkdirSync(stagedDir, { recursive: true, mode: 0o700 }); + mkdirSecure(wrapperDir); + mkdirSecure(stagedDir); for (const [relPath, contents] of opts.files) { if (relPath.startsWith('/') || relPath.includes('..')) { diff --git a/browse/src/cli.ts b/browse/src/cli.ts index 3ddbf2f3bc..4f523bea7a 100644 --- a/browse/src/cli.ts +++ b/browse/src/cli.ts @@ -12,6 +12,7 @@ import * as fs from 'fs'; import * as path from 'path'; import { safeUnlink, safeUnlinkQuiet, safeKill, isProcessAlive } from './error-handling'; +import { writeSecureFile, mkdirSecure } from './file-permissions'; import { resolveConfig, ensureStateDir, readVersionHash } from './config'; import { parseProxyConfig, computeConfigHash, ProxyConfigError } from './proxy-config'; import { redactProxyUrl } from './proxy-redact'; @@ -852,7 +853,7 @@ async function handlePairAgent(state: ServerState, args: string[]): Promise= PROMOTE_THRESHOLD AND flag_count == 0 → state:active - * - else stay quarantined with updated counter + * - if use_count >= PROMOTE_THRESHOLD AND flag_count == 0 AND L4 has scored + * the body (classifier_score > 0) → state:active + * - else stay quarantined with updated counter; user must run + * `domain-skill promote-to-global` manually + * + * The classifier_score > 0 gate is load-bearing: handleSave currently writes + * classifier_score=0 with the comment "L4 deferred to load-time / sidebar-agent + * fills this in on first prompt-injection load," but sidebar-agent was ripped + * (CLAUDE.md "Sidebar architecture") and nothing else updates the score, so + * skills authored via the production path never had their body scanned by L4. + * Without this gate, three benign uses promote any quarantined skill — including + * one written under the influence of a poisoned page — into the prompt context + * for every subsequent visit. The gate re-opens automatically the day L4 is + * rewired and writeSkill / recordSkillUse start receiving non-zero scores. */ export async function recordSkillUse(host: string, projectSlug: string, classifierFlagged: boolean): Promise { const normalized = normalizeHost(host); @@ -303,7 +315,12 @@ export async function recordSkillUse(host: string, projectSlug: string, classifi const useCount = current.use_count + 1; const flagCount = current.flag_count + (classifierFlagged ? 1 : 0); let state: SkillState = current.state; - if (state === 'quarantined' && useCount >= PROMOTE_THRESHOLD && flagCount === 0) { + if ( + state === 'quarantined' && + useCount >= PROMOTE_THRESHOLD && + flagCount === 0 && + current.classifier_score > 0 + ) { state = 'active'; } const updated: DomainSkillRow = { diff --git a/browse/src/file-permissions.ts b/browse/src/file-permissions.ts new file mode 100644 index 0000000000..d3d404acde --- /dev/null +++ b/browse/src/file-permissions.ts @@ -0,0 +1,157 @@ +/** + * Cross-platform file permission restriction for sensitive gstack state. + * + * Why this exists + * ---------------- + * POSIX mode bits (`0o600` for files, `0o700` for dirs) are how gstack marks + * sensitive state files — auth tokens, canary tokens, chat history, agent + * queue, device salt, per-tab security decisions. On Linux and macOS, + * `fs.chmodSync(path, 0o600)` and `fs.writeFileSync(path, data, { mode: 0o600 })` + * do exactly what you'd hope: the file ends up readable and writable only + * by the owning user, no access for group / other. + * + * On Windows, both calls are effectively no-ops. NTFS uses ACLs, not POSIX + * mode bits, and Node's fs module doesn't translate. So on every Windows + * install, sensitive gstack state files inherit whatever ACL the parent + * directory grants — typically user-full + inherited admin-full. That's + * fine on a single-user laptop but leaks on: + * + * - Self-hosted CI runners (GitHub Actions / GitLab / Jenkins agents + * running as a different service account on the same box — they can + * read developer state) + * - Shared development machines (agencies, studios, lab machines) + * - Multi-tenant servers with shared home directories + * - Malware running as the same user (no in-user-account isolation) + * + * This module wraps the platform-correct call. POSIX: chmod. Windows: + * icacls with inheritance break + explicit user grant. Failures on either + * platform are best-effort — the filesystem is still functional if ACL + * restriction fails; we just don't hit the intended hardening target. + * + * Warning behavior: to avoid spamming the console on a machine where + * icacls is unavailable (rare — it ships in System32 on every Windows + * version since 7), we log the first failure per process and stay silent + * afterward. The warning includes the advice "sensitive files may be + * readable by other accounts on this machine" so operators know to audit + * their runner / share setup. + */ + +import { execFileSync } from 'child_process'; +import * as fs from 'fs'; +import * as os from 'os'; + +let warnedOnce = false; + +function warnIcaclsFailure(fsPath: string, err: unknown): void { + if (warnedOnce) return; + warnedOnce = true; + const msg = err instanceof Error ? err.message : String(err); + // biome-ignore lint/suspicious/noConsole: intentional user-facing warning + console.warn( + `[gstack] Failed to restrict Windows ACL on ${fsPath}: ${msg}\n` + + ` Sensitive files may be readable by other accounts on this machine.\n` + + ` This warning appears once per process; subsequent failures are silent.` + ); +} + +/** + * Restrict a file to owner-only access (POSIX 0o600 equivalent). + * + * POSIX: `fs.chmodSync(path, 0o600)`. Idempotent if the file was already + * written with `{ mode: 0o600 }`, so safe to call regardless. + * + * Windows: invokes `icacls /inheritance:r /grant:r :(F)` to remove + * any inherited ACLs and replace the ACL with a single entry granting the + * current user full control. + */ +export function restrictFilePermissions(filePath: string): void { + if (process.platform === 'win32') { + try { + const user = os.userInfo().username; + execFileSync( + 'icacls', + [filePath, '/inheritance:r', '/grant:r', `${user}:(F)`], + { stdio: 'ignore' }, + ); + } catch (err) { + warnIcaclsFailure(filePath, err); + } + return; + } + try { fs.chmodSync(filePath, 0o600); } catch { /* best-effort */ } +} + +/** + * Restrict a directory to owner-only access (POSIX 0o700 equivalent), + * with new children inheriting the restricted ACL. + * + * POSIX: `fs.chmodSync(path, 0o700)`. Idempotent if the dir was already + * created with `{ mode: 0o700 }`. + * + * Windows: `icacls /inheritance:r /grant:r :(OI)(CI)(F)`. The + * `(OI)(CI)` flags make new files (OI = object inherit) and subdirs + * (CI = container inherit) inherit the single-user-full ACL — important + * because child creations in `fs.writeFileSync(...)` without explicit + * `restrictFilePermissions` still end up owner-only. + */ +export function restrictDirectoryPermissions(dirPath: string): void { + if (process.platform === 'win32') { + try { + const user = os.userInfo().username; + execFileSync( + 'icacls', + [dirPath, '/inheritance:r', '/grant:r', `${user}:(OI)(CI)(F)`], + { stdio: 'ignore' }, + ); + } catch (err) { + warnIcaclsFailure(dirPath, err); + } + return; + } + try { fs.chmodSync(dirPath, 0o700); } catch { /* best-effort */ } +} + +/** + * Write a file and restrict it to owner-only access, cross-platform. + * Replaces `fs.writeFileSync(path, data, { mode: 0o600 })` + Windows ACL. + */ +export function writeSecureFile( + filePath: string, + data: string | NodeJS.ArrayBufferView, +): void { + fs.writeFileSync(filePath, data, { mode: 0o600 }); + restrictFilePermissions(filePath); +} + +/** + * Append to a file with owner-only permissions, cross-platform. + * Replaces `fs.appendFileSync(path, data, { mode: 0o600 })` + Windows ACL. + * + * ACL is applied only on first write — subsequent appends are fire-and-forget + * (no need to re-run icacls on every log line). + */ +export function appendSecureFile( + filePath: string, + data: string | NodeJS.ArrayBufferView, +): void { + const existed = fs.existsSync(filePath); + fs.appendFileSync(filePath, data, { mode: 0o600 }); + if (!existed) restrictFilePermissions(filePath); +} + +/** + * `mkdir -p` with owner-only directory permissions, cross-platform. + * Replaces `fs.mkdirSync(path, { recursive: true, mode: 0o700 })` + Windows ACL. + * Safe to call on an existing directory — re-applies the ACL idempotently. + */ +export function mkdirSecure(dirPath: string): void { + fs.mkdirSync(dirPath, { recursive: true, mode: 0o700 }); + restrictDirectoryPermissions(dirPath); +} + +/** + * Reset the once-per-process warning gate. Test-only. + */ +export function __resetWarnedForTests(): void { + warnedOnce = false; +} diff --git a/browse/src/meta-commands.ts b/browse/src/meta-commands.ts index 543185bf2e..c505d4cf41 100644 --- a/browse/src/meta-commands.ts +++ b/browse/src/meta-commands.ts @@ -16,6 +16,7 @@ export { validateOutputPath, escapeRegExp } from './path-security'; import * as Diff from 'diff'; import * as fs from 'fs'; import * as path from 'path'; +import { writeSecureFile, mkdirSecure } from './file-permissions'; import { TEMP_DIR } from './platform'; import { resolveConfig } from './config'; import type { Frame } from 'playwright'; @@ -917,7 +918,7 @@ export async function handleMetaCommand( const config = resolveConfig(); const stateDir = path.join(config.stateDir, 'browse-states'); - fs.mkdirSync(stateDir, { recursive: true }); + mkdirSecure(stateDir); const statePath = path.join(stateDir, `${name}.json`); if (action === 'save') { @@ -929,7 +930,7 @@ export async function handleMetaCommand( cookies: state.cookies, pages: state.pages.map(p => ({ url: p.url, isActive: p.isActive })), }; - fs.writeFileSync(statePath, JSON.stringify(saveData, null, 2), { mode: 0o600 }); + writeSecureFile(statePath, JSON.stringify(saveData, null, 2)); return `State saved: ${statePath} (${state.cookies.length} cookies, ${state.pages.length} pages)\n⚠️ Cookies stored in plaintext. Delete when no longer needed.`; } diff --git a/browse/src/security-classifier.ts b/browse/src/security-classifier.ts index d631df506e..5cd852bb29 100644 --- a/browse/src/security-classifier.ts +++ b/browse/src/security-classifier.ts @@ -29,6 +29,7 @@ import { spawn } from 'child_process'; import * as fs from 'fs'; import * as path from 'path'; import * as os from 'os'; +import { mkdirSecure } from './file-permissions'; import { THRESHOLDS, type LayerSignal } from './security'; import { resolveClaudeCommand } from './claude-bin'; @@ -156,7 +157,7 @@ async function downloadFile(url: string, dest: string): Promise { } async function ensureTestsavantStaged(onProgress?: (msg: string) => void): Promise { - fs.mkdirSync(path.join(TESTSAVANT_DIR, 'onnx'), { recursive: true, mode: 0o700 }); + mkdirSecure(path.join(TESTSAVANT_DIR, 'onnx')); // Small config/tokenizer files for (const f of TESTSAVANT_FILES) { @@ -301,7 +302,7 @@ export async function scanPageContent(text: string): Promise { // ─── L4c: DeBERTa-v3 ensemble (opt-in) ─────────────────────── async function ensureDebertaStaged(onProgress?: (msg: string) => void): Promise { - fs.mkdirSync(path.join(DEBERTA_DIR, 'onnx'), { recursive: true, mode: 0o700 }); + mkdirSecure(path.join(DEBERTA_DIR, 'onnx')); for (const f of DEBERTA_FILES) { const dst = path.join(DEBERTA_DIR, f); if (fs.existsSync(dst)) continue; diff --git a/browse/src/security.ts b/browse/src/security.ts index 22009e0c36..d0e4039712 100644 --- a/browse/src/security.ts +++ b/browse/src/security.ts @@ -24,6 +24,7 @@ import { spawn } from 'child_process'; import * as fs from 'fs'; import * as path from 'path'; import * as os from 'os'; +import { writeSecureFile, appendSecureFile, mkdirSecure } from './file-permissions'; // ─── Thresholds + verdict types ────────────────────────────── @@ -344,11 +345,11 @@ function getDeviceSalt(): string { // fall through to generate } try { - fs.mkdirSync(SECURITY_DIR, { recursive: true, mode: 0o700 }); + mkdirSecure(SECURITY_DIR); } catch {} cachedSalt = randomBytes(16).toString('hex'); try { - fs.writeFileSync(SALT_FILE, cachedSalt, { mode: 0o600 }); + writeSecureFile(SALT_FILE, cachedSalt); } catch { // Can't persist (read-only fs, disk full). Keep the in-memory salt // for this process so cross-log correlation still works within a @@ -413,6 +414,61 @@ function findTelemetryBinary(): string | null { return null; } +/** + * Resolve a bash binary for invoking shebang scripts on Windows. Mirrors the + * GSTACK_*_BIN override pattern from `browse/src/claude-bin.ts:resolveClaudeCommand` + * (introduced in v1.24.0.0 #1252) so users on WSL/MSYS2/non-default Git Bash + * installs can redirect. + * + * Override precedence: + * 1. GSTACK_BASH_BIN (or BASH_BIN) — absolute path or PATH-resolvable command. + * 2. Plain Bun.which('bash') — finds Git Bash on the standard Windows install. + * + * Returns null if nothing resolves; callers must degrade gracefully (telemetry + * already swallows spawn errors, so a null here means the local attempts.jsonl + * audit trail keeps working without surfacing a Windows-only failure). + */ +export function resolveBashBinary(env: NodeJS.ProcessEnv = process.env): string | null { + const PATH = env.PATH ?? env.Path ?? ''; + const override = (env.GSTACK_BASH_BIN ?? env.BASH_BIN)?.trim(); + if (override) { + const trimmed = override.replace(/^"(.*)"$/, '$1'); + return path.isAbsolute(trimmed) ? trimmed : (Bun.which(trimmed, { PATH }) ?? null); + } + return Bun.which('bash', { PATH }) ?? null; +} + +/** + * Build the [cmd, args] tuple for invoking a bash-script telemetry binary + * in a way that works on both POSIX and Windows. + * + * POSIX: returns [bin, args] unchanged — shebang gets honored by execve. + * Win32: wraps in bash explicitly. `gstack-telemetry-log` is a shell script + * (`#!/usr/bin/env bash`) and Windows `CreateProcess` can't dispatch on a + * shebang — it tries to load the file as a PE image, fails with ENOEXEC, + * and our 'error' handler silently swallows it. Resolves bash via the same + * Bun.which + GSTACK_*_BIN override pattern as claude-bin.ts. + * + * Returns null when bash can't be resolved on Windows (rare — Git Bash ships + * with the standard gstack install path). Caller skips spawn; the local + * attempts.jsonl write still gives the audit trail. + * + * Exported for testability — resolution is a pure function of (platform, + * env, bin, args) so we can assert on it without actually spawning. + */ +export function buildTelemetrySpawnCommand( + bin: string, + args: string[], + env: NodeJS.ProcessEnv = process.env, +): { cmd: string; cmdArgs: string[] } | null { + if (process.platform === 'win32') { + const bashPath = resolveBashBinary(env); + if (!bashPath) return null; + return { cmd: bashPath, cmdArgs: [bin, ...args] }; + } + return { cmd: bin, cmdArgs: args }; +} + /** * Fire-and-forget subprocess invocation of gstack-telemetry-log with the * attack_attempt event type. The binary handles tier gating internally @@ -426,14 +482,16 @@ function reportAttemptTelemetry(record: AttemptRecord): void { const bin = findTelemetryBinary(); if (!bin) return; try { - const child = spawn(bin, [ + const result = buildTelemetrySpawnCommand(bin, [ '--event-type', 'attack_attempt', '--url-domain', record.urlDomain || '', '--payload-hash', record.payloadHash, '--confidence', String(record.confidence), '--layer', record.layer, '--verdict', record.verdict, - ], { + ]); + if (!result) return; + const child = spawn(result.cmd, result.cmdArgs, { stdio: 'ignore', detached: true, }); @@ -456,10 +514,10 @@ export function logAttempt(record: AttemptRecord): boolean { // the event reported (it goes to a different directory anyway). reportAttemptTelemetry(record); try { - fs.mkdirSync(SECURITY_DIR, { recursive: true, mode: 0o700 }); + mkdirSecure(SECURITY_DIR); rotateIfNeeded(); const line = JSON.stringify(record) + '\n'; - fs.appendFileSync(ATTEMPTS_LOG, line, { mode: 0o600 }); + appendSecureFile(ATTEMPTS_LOG, line); return true; } catch (err) { // Non-fatal. Log to stderr for debugging but don't block. @@ -489,9 +547,9 @@ export interface SessionState { */ export function writeSessionState(state: SessionState): void { try { - fs.mkdirSync(SECURITY_DIR, { recursive: true, mode: 0o700 }); + mkdirSecure(SECURITY_DIR); const tmp = `${STATE_FILE}.tmp.${process.pid}`; - fs.writeFileSync(tmp, JSON.stringify(state, null, 2), { mode: 0o600 }); + writeSecureFile(tmp, JSON.stringify(state, null, 2)); fs.renameSync(tmp, STATE_FILE); } catch (err) { console.error('[security] writeSessionState failed:', (err as Error).message); @@ -532,10 +590,10 @@ export interface DecisionRecord { export function writeDecision(record: DecisionRecord): void { try { - fs.mkdirSync(DECISIONS_DIR, { recursive: true, mode: 0o700 }); + mkdirSecure(DECISIONS_DIR); const file = decisionFileForTab(record.tabId); const tmp = `${file}.tmp.${process.pid}`; - fs.writeFileSync(tmp, JSON.stringify(record), { mode: 0o600 }); + writeSecureFile(tmp, JSON.stringify(record)); fs.renameSync(tmp, file); } catch (err) { console.error('[security] writeDecision failed:', (err as Error).message); diff --git a/browse/src/server.ts b/browse/src/server.ts index 4df55ad8b5..81af14acdb 100644 --- a/browse/src/server.ts +++ b/browse/src/server.ts @@ -26,6 +26,7 @@ import { markHiddenElements, getCleanTextWithStripping, cleanupHiddenMarkers, } from './content-security'; import { generateCanary, injectCanary, getStatus as getSecurityStatus, writeDecision } from './security'; +import { writeSecureFile, mkdirSecure } from './file-permissions'; import { handleSnapshot, SNAPSHOT_FLAGS } from './snapshot'; import { initRegistry, validateToken as validateScopedToken, checkScope, checkDomain, @@ -317,6 +318,27 @@ const CONSOLE_LOG_PATH = config.consoleLog; const NETWORK_LOG_PATH = config.networkLog; const DIALOG_LOG_PATH = config.dialogLog; +/** + * Per-process state-file temp path. The state-file write pattern is + * `writeFileSync(tmp, ...) → renameSync(tmp, stateFile)` for atomicity, + * but a shared `${stateFile}.tmp` filename means two concurrent writers + * (cold-start race when N CLIs hit a fresh repo simultaneously, parallel + * /tunnel/start handlers, or a combination) collide on the rename: the + * first writer's renameSync moves the shared temp file out of the way, + * the second writer's writeFileSync re-creates it, the second rename + * then races with the first writer's already-renamed state. Worst case + * the second renameSync throws ENOENT mid-air, killing one of the + * spawning daemons during startup. + * + * Per-process suffix (pid + 4 random bytes) makes each writer's temp + * path unique. The atomic rename still gives last-writer-wins semantics + * for the final state.json content; the only behavior change is that + * concurrent writers no longer kill each other on the rename. + */ +function tmpStatePath(): string { + return `${config.stateFile}.tmp.${process.pid}.${crypto.randomBytes(4).toString('hex')}`; +} + // ─── Sidebar agent / chat state ripped ────────────────────────────── // ChatEntry, SidebarSession, TabAgentState interfaces; chatBuffer, @@ -328,6 +350,7 @@ const DIALOG_LOG_PATH = config.dialogLog; // terminal-agent.ts; chat queue + per-tab agent multiplexing are no // longer needed. +let lastConsoleFlushed = 0; let lastNetworkFlushed = 0; let lastDialogFlushed = 0; let flushInProgress = false; @@ -1596,7 +1619,7 @@ async function start() { // Update state file const stateContent = JSON.parse(fs.readFileSync(config.stateFile, 'utf-8')); stateContent.tunnel = { url: tunnelUrl, domain: domain || null, startedAt: new Date().toISOString() }; - const tmpState = config.stateFile + '.tmp'; + const tmpState = tmpStatePath(); fs.writeFileSync(tmpState, JSON.stringify(stateContent, null, 2), { mode: 0o600 }); fs.renameSync(tmpState, config.stateFile); @@ -2126,7 +2149,7 @@ async function start() { // without clobbering a recycled PID. ...(xvfb ? { xvfbPid: xvfb.pid, xvfbStartTime: xvfb.startTime, xvfbDisplay: xvfb.display } : {}), }; - const tmpFile = config.stateFile + '.tmp'; + const tmpFile = tmpStatePath(); fs.writeFileSync(tmpFile, JSON.stringify(state, null, 2), { mode: 0o600 }); fs.renameSync(tmpFile, config.stateFile); @@ -2207,7 +2230,7 @@ async function start() { // Update state file with tunnel URL const stateContent = JSON.parse(fs.readFileSync(config.stateFile, 'utf-8')); stateContent.tunnel = { url: tunnelUrl, domain: domain || null, startedAt: new Date().toISOString() }; - const tmpState = config.stateFile + '.tmp'; + const tmpState = tmpStatePath(); fs.writeFileSync(tmpState, JSON.stringify(stateContent, null, 2), { mode: 0o600 }); fs.renameSync(tmpState, config.stateFile); } catch (err: any) { @@ -2237,7 +2260,7 @@ async function start() { console.log(`[browse] Tunnel listener bound (local-only test mode) on 127.0.0.1:${tunnelPort}`); const stateContent = JSON.parse(fs.readFileSync(config.stateFile, 'utf-8')); stateContent.tunnelLocalPort = tunnelPort; - const tmpState = config.stateFile + '.tmp'; + const tmpState = tmpStatePath(); fs.writeFileSync(tmpState, JSON.stringify(stateContent, null, 2), { mode: 0o600 }); fs.renameSync(tmpState, config.stateFile); } catch (err: any) { @@ -2252,8 +2275,8 @@ start().catch((err) => { // stderr because the server is launched with detached: true, stdio: 'ignore'. try { const errorLogPath = path.join(config.stateDir, 'browse-startup-error.log'); - fs.mkdirSync(config.stateDir, { recursive: true, mode: 0o700 }); - fs.writeFileSync(errorLogPath, `${new Date().toISOString()} ${err.message}\n${err.stack || ''}\n`, { mode: 0o600 }); + mkdirSecure(config.stateDir); + writeSecureFile(errorLogPath, `${new Date().toISOString()} ${err.message}\n${err.stack || ''}\n`); } catch { // stateDir may not exist — nothing more we can do } diff --git a/browse/src/tab-session.ts b/browse/src/tab-session.ts index 739942689a..dccabf28f4 100644 --- a/browse/src/tab-session.ts +++ b/browse/src/tab-session.ts @@ -149,9 +149,16 @@ export class TabSession { * Use this for operations that work on both Page and Frame (locator, evaluate, etc.). */ getActiveFrameOrPage(): Page | Frame { - // Auto-recover from detached frames (iframe removed/navigated) + // Auto-recover from detached frames (iframe removed/navigated). Clear + // refs alongside the activeFrame — same staleness condition as + // onMainFrameNavigated() below: refs were captured against a frame + // that no longer exists. Without this, refMap entries linger against + // a dead frame after silently falling back to the main page; the + // next snapshot's role+name keys collide with stale entries and the + // resolver picks one at random. if (this.activeFrame?.isDetached()) { this.activeFrame = null; + this.clearRefs(); } return this.activeFrame ?? this.page; } diff --git a/browse/src/terminal-agent.ts b/browse/src/terminal-agent.ts index 9ebc8cbbf2..189a49f096 100644 --- a/browse/src/terminal-agent.ts +++ b/browse/src/terminal-agent.ts @@ -23,6 +23,7 @@ import * as fs from 'fs'; import * as path from 'path'; import * as crypto from 'crypto'; +import { writeSecureFile, mkdirSecure } from './file-permissions'; import { safeUnlink } from './error-handling'; const STATE_FILE = process.env.BROWSE_STATE_FILE || path.join(process.env.HOME || '/tmp', '.gstack', 'browse.json'); @@ -83,7 +84,7 @@ function findClaude(): string | null { /** Probe + persist claude availability for the bootstrap card. */ function writeClaudeAvailable(): void { const stateDir = path.dirname(STATE_FILE); - try { fs.mkdirSync(stateDir, { recursive: true, mode: 0o700 }); } catch {} + try { mkdirSecure(stateDir); } catch {} const found = findClaude(); const status = { available: !!found, @@ -94,7 +95,7 @@ function writeClaudeAvailable(): void { const target = path.join(stateDir, 'claude-available.json'); const tmp = path.join(stateDir, `.tmp-claude-${process.pid}`); try { - fs.writeFileSync(tmp, JSON.stringify(status, null, 2), { mode: 0o600 }); + writeSecureFile(tmp, JSON.stringify(status, null, 2)); fs.renameSync(tmp, target); } catch { safeUnlink(tmp); @@ -361,8 +362,26 @@ function buildServer() { // Binary input. Lazy-spawn claude on the first byte. if (!session.spawned) { session.spawned = true; + // UTF-8 boundary detection to prevent splitting multi-byte characters (issue #1272). + // Buffer incomplete UTF-8 sequences until the next chunk completes them. + let leftover = Buffer.alloc(0); const proc = spawnClaude(session.cols, session.rows, (chunk) => { - try { ws.sendBinary(chunk); } catch {} + const combined = Buffer.concat([leftover, Buffer.from(chunk)]); + // Find the last index where a UTF-8 codepoint ends. Look back at most 3 bytes. + let safeEnd = combined.length; + for (let i = combined.length - 1; i >= Math.max(0, combined.length - 3); i--) { + const b = combined[i]; + if ((b & 0x80) === 0) { safeEnd = i + 1; break; } // ASCII + if ((b & 0xC0) === 0x80) continue; // continuation byte + const expected = (b & 0xE0) === 0xC0 ? 2 : (b & 0xF0) === 0xE0 ? 3 : 4; + safeEnd = (combined.length - i >= expected) ? combined.length : i; + break; + } + const flush = combined.slice(0, safeEnd); + leftover = combined.slice(safeEnd); + if (flush.length) { + try { ws.sendBinary(flush); } catch {} + } }); if (!proc) { try { @@ -422,7 +441,7 @@ function handleTabState(msg: { reason?: string; }): void { const stateDir = path.dirname(STATE_FILE); - try { fs.mkdirSync(stateDir, { recursive: true, mode: 0o700 }); } catch {} + try { mkdirSecure(stateDir); } catch {} // tabs.json — full list if (Array.isArray(msg.tabs)) { @@ -442,7 +461,7 @@ function handleTabState(msg: { const target = path.join(stateDir, 'tabs.json'); const tmp = path.join(stateDir, `.tmp-tabs-${process.pid}`); try { - fs.writeFileSync(tmp, JSON.stringify(payload, null, 2), { mode: 0o600 }); + writeSecureFile(tmp, JSON.stringify(payload, null, 2)); fs.renameSync(tmp, target); } catch { safeUnlink(tmp); @@ -457,11 +476,11 @@ function handleTabState(msg: { const ctxFile = path.join(stateDir, 'active-tab.json'); const tmp = path.join(stateDir, `.tmp-tab-${process.pid}`); try { - fs.writeFileSync(tmp, JSON.stringify({ + writeSecureFile(tmp, JSON.stringify({ tabId: active.tabId ?? null, url: active.url, title: active.title ?? '', - }), { mode: 0o600 }); + })); fs.renameSync(tmp, ctxFile); } catch { safeUnlink(tmp); @@ -477,11 +496,11 @@ function handleTabSwitch(msg: { tabId?: number; url?: string; title?: string }): const ctxFile = path.join(stateDir, 'active-tab.json'); const tmp = path.join(stateDir, `.tmp-tab-${process.pid}`); try { - fs.writeFileSync(tmp, JSON.stringify({ + writeSecureFile(tmp, JSON.stringify({ tabId: msg.tabId ?? null, url, title: msg.title ?? '', - }), { mode: 0o600 }); + })); fs.renameSync(tmp, ctxFile); } catch { safeUnlink(tmp); @@ -524,9 +543,9 @@ function main() { // Write port file atomically so the parent server can pick it up. const dir = path.dirname(PORT_FILE); - try { fs.mkdirSync(dir, { recursive: true, mode: 0o700 }); } catch {} + try { mkdirSecure(dir); } catch {} const tmp = `${PORT_FILE}.tmp-${process.pid}`; - fs.writeFileSync(tmp, String(port), { mode: 0o600 }); + writeSecureFile(tmp, String(port)); fs.renameSync(tmp, PORT_FILE); // Hand the parent the internal token so it can call /internal/grant. @@ -549,8 +568,8 @@ function main() { // to a state file the parent reads. This avoids env-passing races. See main(). const INTERNAL_TOKEN_FILE = path.join(path.dirname(STATE_FILE), 'terminal-internal-token'); try { - fs.mkdirSync(path.dirname(INTERNAL_TOKEN_FILE), { recursive: true, mode: 0o700 }); - fs.writeFileSync(INTERNAL_TOKEN_FILE, INTERNAL_TOKEN, { mode: 0o600 }); + mkdirSecure(path.dirname(INTERNAL_TOKEN_FILE)); + writeSecureFile(INTERNAL_TOKEN_FILE, INTERNAL_TOKEN); } catch {} main(); diff --git a/browse/src/tunnel-denial-log.ts b/browse/src/tunnel-denial-log.ts index 2676594078..82b9c34a5e 100644 --- a/browse/src/tunnel-denial-log.ts +++ b/browse/src/tunnel-denial-log.ts @@ -18,6 +18,7 @@ import { promises as fsp } from 'fs'; import * as path from 'path'; import * as os from 'os'; +import { mkdirSecure } from './file-permissions'; const LOG_DIR = path.join(os.homedir(), '.gstack', 'security'); const LOG_PATH = path.join(LOG_DIR, 'attempts.jsonl'); @@ -31,7 +32,10 @@ let dirEnsured = false; async function ensureDir(): Promise { if (dirEnsured) return; try { - await fsp.mkdir(LOG_DIR, { recursive: true, mode: 0o700 }); + // Sync mkdir is fine here — runs once per process at first denial. The + // (OI)(CI) inheritance set on Windows means subsequent fsp.appendFile + // writes pick up the owner-only ACL automatically. + mkdirSecure(LOG_DIR); dirEnsured = true; } catch { // Swallow — log writes are best-effort. Failure to mkdir just means diff --git a/browse/test/browse-client.test.ts b/browse/test/browse-client.test.ts index 1def4a8833..61853f9941 100644 --- a/browse/test/browse-client.test.ts +++ b/browse/test/browse-client.test.ts @@ -87,6 +87,18 @@ describe('browse-client', () => { expect(auth.source).toBe('env'); }); + it('rejects GSTACK_PORT env values with trailing characters', () => { + process.env.GSTACK_PORT = `${server.port}abc`; + process.env.GSTACK_SKILL_TOKEN = 'scoped-token'; + const tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), 'browse-client-test-')); + try { + expect(() => resolveBrowseAuth({ stateFile: path.join(tmpDir, 'missing.json') })) + .toThrow('browse-client: cannot find daemon port + token'); + } finally { + fs.rmSync(tmpDir, { recursive: true, force: true }); + } + }); + it('falls back to state file when env vars missing', () => { const tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), 'browse-client-test-')); const stateFile = path.join(tmpDir, 'browse.json'); @@ -154,6 +166,13 @@ describe('browse-client', () => { expect(server.requests[0].body).toEqual({ command: 'text', args: [], tabId: 7 }); }); + it('omits tabId when BROWSE_TAB has trailing characters', async () => { + process.env.BROWSE_TAB = '7abc'; + const client = new BrowseClient({ port: server.port, token: 't' }); + await client.command('text', []); + expect(server.requests[0].body).toEqual({ command: 'text', args: [] }); + }); + it('throws BrowseClientError with status on non-2xx', async () => { const client = new BrowseClient({ port: server.port, token: 't' }); server.setResponse(403, JSON.stringify({ error: 'Insufficient scope' })); diff --git a/browse/test/domain-skills-e2e.test.ts b/browse/test/domain-skills-e2e.test.ts index 4c26ac56b5..29d33c4bc8 100644 --- a/browse/test/domain-skills-e2e.test.ts +++ b/browse/test/domain-skills-e2e.test.ts @@ -84,11 +84,34 @@ describe('$B domain-skill (E2E gate tier)', () => { expect(out).toContain('[quarantined] 127.0.0.1'); }); - test('readSkill returns null until the skill is promoted to active (T6)', async () => { + test('readSkill returns null while quarantined; classifier_score=0 blocks auto-promote (#1369)', async () => { const { readSkill, recordSkillUse } = await import('../src/domain-skills'); + const jsonlPath = path.join(TMP_HOME, 'projects', 'e2e-test-slug', 'learnings.jsonl'); + // While quarantined, readSkill returns null expect(await readSkill('127.0.0.1', 'e2e-test-slug')).toBeNull(); - // Three uses without flag triggers auto-promote + + // Three uses without flag with classifier_score=0 (the default until L4 is + // rewired) MUST stay quarantined per #1369. The gate is load-bearing: a + // quarantined skill written under the influence of a poisoned page would + // otherwise auto-promote after three benign uses without the L4 body scan + // ever running. + await recordSkillUse('127.0.0.1', 'e2e-test-slug', false); + await recordSkillUse('127.0.0.1', 'e2e-test-slug', false); + await recordSkillUse('127.0.0.1', 'e2e-test-slug', false); + expect(await readSkill('127.0.0.1', 'e2e-test-slug')).toBeNull(); + + // Simulate L4 having scored the body (classifier_score > 0) by appending a + // new tombstone row with a non-zero score, then verify the next use + // promotes. This documents the unblock path the day L4 starts populating + // classifier_score for skill writes again. + const lines = (await fs.readFile(jsonlPath, 'utf8')).trim().split('\n').map((l) => JSON.parse(l)); + const latest = lines.filter((r: any) => r.type === 'domain' && r.host === '127.0.0.1').pop(); + expect(latest).toBeTruthy(); + const scored = { ...latest, classifier_score: 0.05, version: latest.version + 1, updated_ts: new Date().toISOString() }; + await fs.appendFile(jsonlPath, JSON.stringify(scored) + '\n'); + + // Now three uses promote await recordSkillUse('127.0.0.1', 'e2e-test-slug', false); await recordSkillUse('127.0.0.1', 'e2e-test-slug', false); await recordSkillUse('127.0.0.1', 'e2e-test-slug', false); diff --git a/browse/test/domain-skills-storage.test.ts b/browse/test/domain-skills-storage.test.ts index cdc238f183..df53d8bc92 100644 --- a/browse/test/domain-skills-storage.test.ts +++ b/browse/test/domain-skills-storage.test.ts @@ -106,6 +106,31 @@ describe('domain-skills: state machine (T6)', () => { }) ).rejects.toThrow(/classifier flagged/); }); + + // domain-skill-commands.ts:140 (handleSave) writes classifier_score=0 with + // the comment "L4 deferred to load-time" — but sidebar-agent (the deferred + // scanner) was ripped per CLAUDE.md "Sidebar architecture." Without an + // explicit gate, three benign uses promote any quarantined skill, including + // one authored under a poisoned page, into prompt context permanently. + it('does NOT auto-promote when classifier_score is 0 (production handleSave shape)', async () => { + const m = await freshImport(); + await m.writeSkill({ + host: 'linkedin.com', + body: '# LinkedIn', + projectSlug: 'test-slug', + source: 'agent', + classifierScore: 0, // matches domain-skill-commands.ts:140 production path + }); + const after3 = await m.recordSkillUse('linkedin.com', 'test-slug', false); + await m.recordSkillUse('linkedin.com', 'test-slug', false); + const final = await m.recordSkillUse('linkedin.com', 'test-slug', false); + expect(after3?.state).toBe('quarantined'); + expect(final?.state).toBe('quarantined'); + expect(final?.use_count).toBe(3); + // readSkill returns null for quarantined skills — they don't fire. + const read = await m.readSkill('linkedin.com', 'test-slug'); + expect(read).toBeNull(); + }); }); describe('domain-skills: scope shadowing (T4)', () => { diff --git a/browse/test/file-permissions.test.ts b/browse/test/file-permissions.test.ts new file mode 100644 index 0000000000..e073b9945c --- /dev/null +++ b/browse/test/file-permissions.test.ts @@ -0,0 +1,148 @@ +/** + * Unit tests for browse/src/file-permissions.ts + * + * Strategy: + * - POSIX assertions check fs.statSync.mode bits directly (cheap, reliable, + * runs on every CI config). + * - Windows assertions don't check ACLs (that'd require parsing icacls + * output, which is brittle across Windows versions / locales). Instead + * we verify the helper doesn't throw and the file ends up accessible + * to the current user — the "doesn't crash, file still usable" + * contract the callers rely on. + */ + +import { afterEach, beforeEach, describe, expect, test } from 'bun:test'; +import * as fs from 'fs'; +import * as os from 'os'; +import * as path from 'path'; + +import { + restrictFilePermissions, + restrictDirectoryPermissions, + writeSecureFile, + appendSecureFile, + mkdirSecure, + __resetWarnedForTests, +} from '../src/file-permissions'; + +let tmpDir: string; + +beforeEach(() => { + tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), 'file-perms-')); + __resetWarnedForTests(); +}); + +afterEach(() => { + try { fs.rmSync(tmpDir, { recursive: true, force: true }); } catch { /* best-effort */ } +}); + +describe('restrictFilePermissions', () => { + test('on POSIX, sets file mode to 0o600', () => { + if (process.platform === 'win32') return; + const p = path.join(tmpDir, 'secret'); + fs.writeFileSync(p, 'token'); + fs.chmodSync(p, 0o644); // start world-readable to prove the call mutates it + restrictFilePermissions(p); + expect(fs.statSync(p).mode & 0o777).toBe(0o600); + }); + + test('on Windows, does not throw on an existing file', () => { + if (process.platform !== 'win32') return; + const p = path.join(tmpDir, 'secret'); + fs.writeFileSync(p, 'token'); + expect(() => restrictFilePermissions(p)).not.toThrow(); + // File remains readable by the caller — core contract. + expect(fs.readFileSync(p, 'utf8')).toBe('token'); + }); + + test('on Windows, does not throw when icacls fails (bad path)', () => { + if (process.platform !== 'win32') return; + // icacls emits an error for a nonexistent path; helper must swallow. + expect(() => restrictFilePermissions(path.join(tmpDir, 'nonexistent'))).not.toThrow(); + }); +}); + +describe('restrictDirectoryPermissions', () => { + test('on POSIX, sets directory mode to 0o700', () => { + if (process.platform === 'win32') return; + const d = path.join(tmpDir, 'subdir'); + fs.mkdirSync(d, { mode: 0o755 }); + restrictDirectoryPermissions(d); + expect(fs.statSync(d).mode & 0o777).toBe(0o700); + }); + + test('on Windows, does not throw on an existing directory', () => { + if (process.platform !== 'win32') return; + const d = path.join(tmpDir, 'subdir'); + fs.mkdirSync(d); + expect(() => restrictDirectoryPermissions(d)).not.toThrow(); + }); +}); + +describe('writeSecureFile', () => { + test('writes the payload and restricts permissions atomically', () => { + const p = path.join(tmpDir, 'data'); + writeSecureFile(p, 'hello'); + expect(fs.readFileSync(p, 'utf8')).toBe('hello'); + if (process.platform !== 'win32') { + expect(fs.statSync(p).mode & 0o777).toBe(0o600); + } + }); + + test('accepts Buffer payloads', () => { + const p = path.join(tmpDir, 'buffer'); + writeSecureFile(p, Buffer.from([0xde, 0xad, 0xbe, 0xef])); + const out = fs.readFileSync(p); + expect(out.length).toBe(4); + expect(out[0]).toBe(0xde); + }); + + test('overwrites existing file', () => { + const p = path.join(tmpDir, 'existing'); + fs.writeFileSync(p, 'old', { mode: 0o644 }); + writeSecureFile(p, 'new'); + expect(fs.readFileSync(p, 'utf8')).toBe('new'); + }); +}); + +describe('appendSecureFile', () => { + test('appends to a new file and sets owner-only permissions', () => { + const p = path.join(tmpDir, 'log'); + appendSecureFile(p, 'line1\n'); + expect(fs.readFileSync(p, 'utf8')).toBe('line1\n'); + if (process.platform !== 'win32') { + expect(fs.statSync(p).mode & 0o777).toBe(0o600); + } + }); + + test('appends without re-applying ACL on subsequent writes', () => { + const p = path.join(tmpDir, 'log'); + appendSecureFile(p, 'line1\n'); + appendSecureFile(p, 'line2\n'); + expect(fs.readFileSync(p, 'utf8')).toBe('line1\nline2\n'); + }); +}); + +describe('mkdirSecure', () => { + test('creates directory with owner-only mode (POSIX)', () => { + if (process.platform === 'win32') return; + const d = path.join(tmpDir, 'nested', 'deep'); + mkdirSecure(d); + expect(fs.statSync(d).isDirectory()).toBe(true); + expect(fs.statSync(d).mode & 0o777).toBe(0o700); + }); + + test('is idempotent — safe to call on existing directory', () => { + const d = path.join(tmpDir, 'dir'); + mkdirSecure(d); + expect(() => mkdirSecure(d)).not.toThrow(); + }); + + test('recursive behavior: creates intermediate directories', () => { + const d = path.join(tmpDir, 'a', 'b', 'c'); + mkdirSecure(d); + expect(fs.existsSync(path.join(tmpDir, 'a'))).toBe(true); + expect(fs.existsSync(path.join(tmpDir, 'a', 'b'))).toBe(true); + expect(fs.existsSync(d)).toBe(true); + }); +}); diff --git a/browse/test/security.test.ts b/browse/test/security.test.ts index 43888cd3aa..a452b435e2 100644 --- a/browse/test/security.test.ts +++ b/browse/test/security.test.ts @@ -20,6 +20,8 @@ import { readSessionState, getStatus, extractDomain, + buildTelemetrySpawnCommand, + resolveBashBinary, type LayerSignal, } from '../src/security'; @@ -325,3 +327,77 @@ describe('extractDomain', () => { expect(extractDomain('')).toBe(''); }); }); + +// ─── Bash binary resolution (Windows shebang-script invocation) ───── + +describe('resolveBashBinary', () => { + test('on POSIX, returns the system bash via Bun.which', () => { + if (process.platform === 'win32') return; + const out = resolveBashBinary({ PATH: process.env.PATH ?? '' }); + expect(out).toBeTruthy(); + expect(out!.endsWith('bash')).toBe(true); + }); + + test('honors GSTACK_BASH_BIN absolute-path override', () => { + // Construct a synthetic absolute path; the helper short-circuits on + // path.isAbsolute and never touches the filesystem, so this is portable. + const fake = process.platform === 'win32' ? 'C:\\opt\\bash.exe' : '/opt/custom/bash'; + const out = resolveBashBinary({ GSTACK_BASH_BIN: fake, PATH: '' }); + expect(out).toBe(fake); + }); + + test('strips wrapping double quotes from override values', () => { + const fake = process.platform === 'win32' ? 'C:\\opt\\bash.exe' : '/opt/custom/bash'; + const out = resolveBashBinary({ GSTACK_BASH_BIN: `"${fake}"`, PATH: '' }); + expect(out).toBe(fake); + }); + + test('BASH_BIN works as a fallback when GSTACK_BASH_BIN is unset', () => { + const fake = process.platform === 'win32' ? 'C:\\opt\\bash.exe' : '/opt/custom/bash'; + const out = resolveBashBinary({ BASH_BIN: fake, PATH: '' }); + expect(out).toBe(fake); + }); + + test('returns null when nothing resolves (override is unset and PATH is empty)', () => { + // Empty PATH means Bun.which finds nothing. + const out = resolveBashBinary({ PATH: '' }); + expect(out).toBeNull(); + }); +}); + +// ─── Telemetry spawn command (Windows bash wrapper, v1.24-aligned) ── + +describe('buildTelemetrySpawnCommand', () => { + const bin = '/home/user/.claude/skills/gstack/bin/gstack-telemetry-log'; + const args = ['--event-type', 'attack_attempt', '--confidence', '0.95']; + + test('on POSIX, returns the binary path and args unchanged', () => { + if (process.platform === 'win32') return; + const out = buildTelemetrySpawnCommand(bin, args); + expect(out).not.toBeNull(); + expect(out!.cmd).toBe(bin); + expect(out!.cmdArgs).toEqual(args); + }); + + test('on win32 with bash resolvable, wraps the call in bash with the script as first arg', () => { + if (process.platform !== 'win32') return; + const fakeBash = 'C:\\Program Files\\Git\\bin\\bash.exe'; + const out = buildTelemetrySpawnCommand(bin, args, { GSTACK_BASH_BIN: fakeBash, PATH: '' }); + expect(out).not.toBeNull(); + expect(out!.cmd).toBe(fakeBash); + expect(out!.cmdArgs).toEqual([bin, ...args]); + }); + + test('on win32 with bash unresolvable, returns null so caller skips spawn', () => { + if (process.platform !== 'win32') return; + // No override, empty PATH — Bun.which finds nothing on Windows. + const out = buildTelemetrySpawnCommand(bin, args, { PATH: '' }); + expect(out).toBeNull(); + }); + + test('does not mutate the caller-supplied args array', () => { + const originalArgs = [...args]; + buildTelemetrySpawnCommand(bin, args); + expect(args).toEqual(originalArgs); + }); +}); diff --git a/browse/test/server-flush-trackers.test.ts b/browse/test/server-flush-trackers.test.ts new file mode 100644 index 0000000000..306729af4e --- /dev/null +++ b/browse/test/server-flush-trackers.test.ts @@ -0,0 +1,73 @@ +/** + * Regression: flushBuffers state-tracker declaration audit. + * + * `flushBuffers()` (server.ts) maintains per-buffer cursors so it only + * appends *new* entries to each on-disk log on every interval tick: + * + * const newConsoleCount = consoleBuffer.totalAdded - lastConsoleFlushed; + * const newNetworkCount = networkBuffer.totalAdded - lastNetworkFlushed; + * const newDialogCount = dialogBuffer.totalAdded - lastDialogFlushed; + * + * The trackers must be declared with `let X = 0;` at module scope so the + * subtraction returns a real number on the first tick. If a tracker is + * referenced inside flushBuffers but never declared at module scope, the + * interval throws `ReferenceError: X is not defined` every second — the + * throw is swallowed by the catch at the bottom of flushBuffers (logged + * as `[browse] Buffer flush failed: is not defined`), the + * corresponding on-disk log file is *never written*, and the regression + * is silent in production. + * + * This source-level guard catches that exact class of regression — a + * future flush-perf refactor that adds a fourth buffer cursor (or a + * future contributor that copy-pastes the `last*Flushed` pattern without + * the matching declaration) will fail this test before it ships. + * + * Pattern matches `terminal-agent.test.ts` and `dual-listener.test.ts`: + * read source as text, assert an invariant, no daemon required. + */ + +import { describe, test, expect } from 'bun:test'; +import { readFileSync } from 'fs'; +import * as path from 'path'; + +const SERVER_TS = readFileSync( + path.resolve(import.meta.dir, '../src/server.ts'), + 'utf-8', +); + +describe('server.ts — flushBuffers tracker declarations', () => { + test('every `last*Flushed` tracker referenced inside flushBuffers is declared at module scope', () => { + // Locate the flushBuffers function body. The function is `async function + // flushBuffers() { ... }` — match through the closing brace at the start + // of a line (one-level-deep function in the file). + const fnMatch = SERVER_TS.match( + /async function flushBuffers\([^)]*\)[^{]*\{([\s\S]*?)\n\}/, + ); + expect(fnMatch, 'flushBuffers function not found in server.ts').not.toBeNull(); + const body = fnMatch![1]!; + + // Pull every identifier matching the `lastXxxFlushed` cursor pattern. + const trackerMatches = [...body.matchAll(/\blast([A-Z]\w+)Flushed\b/g)]; + const trackers = Array.from(new Set(trackerMatches.map((m) => `last${m[1]}Flushed`))); + + expect( + trackers.length, + 'flushBuffers should reference at least one last*Flushed tracker', + ).toBeGreaterThan(0); + + for (const tracker of trackers) { + // Module-level `let X = 0;` declaration (not inside a function body). + // Anchored start-of-line to avoid matching nested re-declarations or + // string literals. + const declared = new RegExp( + `(?:^|\\n)let\\s+${tracker}\\s*=\\s*0\\s*;`, + ).test(SERVER_TS); + expect( + declared, + `\`${tracker}\` is referenced inside flushBuffers but never declared at module scope ` + + `with \`let ${tracker} = 0;\` — the interval will throw ReferenceError every tick ` + + `and the corresponding on-disk log will never be written`, + ).toBe(true); + } + }); +}); diff --git a/browse/test/server-tmp-state-path.test.ts b/browse/test/server-tmp-state-path.test.ts new file mode 100644 index 0000000000..4234ae1ba0 --- /dev/null +++ b/browse/test/server-tmp-state-path.test.ts @@ -0,0 +1,110 @@ +/** + * Regression: state-file temp path uniqueness. + * + * The daemon writes `.gstack/browse.json` via the standard atomic-rename + * pattern: `writeFileSync(tmp, …) → renameSync(tmp, stateFile)`. The + * pattern is correct for a single writer. It breaks for *concurrent* + * writers when they share a single temp filename: + * + * t0 Writer A: writeFileSync(stateFile + '.tmp', payloadA) + * t1 Writer B: writeFileSync(stateFile + '.tmp', payloadB) // overwrites A + * t2 Writer A: renameSync(stateFile + '.tmp', stateFile) // moves B's payload + * t3 Writer B: renameSync(stateFile + '.tmp', stateFile) // ENOENT — file gone + * + * A 15-CLI cold-start race against a fresh repo reproduces this in the + * wild — one of the spawned daemons dies with: + * + * [browse] Failed to start: ENOENT: no such file or directory, + * rename '…/.gstack/browse.json.tmp' -> '…/.gstack/browse.json' + * + * Fix: per-process temp path via `tmpStatePath()` (pid + 4 random bytes + * of suffix). Each concurrent writer gets a unique path; the atomic + * rename still gives last-writer-wins semantics on the final state file + * content, but writers no longer kill each other on the rename step. + * + * This source-level guard locks two invariants: + * 1. No remaining `stateFile + '.tmp'` literals in server.ts (regression + * catch — a future copy-paste or revert would re-introduce the bug) + * 2. The 4 known state-write call sites all use `tmpStatePath()` + * (positive coverage) + * + * Same pattern as terminal-agent.test.ts and dual-listener.test.ts: + * read source as text, assert invariant, no daemon required. + */ + +import { describe, test, expect } from 'bun:test'; +import { readFileSync } from 'fs'; +import * as path from 'path'; + +const SERVER_TS = readFileSync( + path.resolve(import.meta.dir, '../src/server.ts'), + 'utf-8', +); + +describe('server.ts — state-file temp-path uniqueness', () => { + test('no remaining `stateFile + \'.tmp\'` literals (regression catch)', () => { + // The shared-temp-filename pattern that caused the cold-start ENOENT + // race. A future contributor that copy-pastes the old pattern (or a + // revert) will fail this test. + const sharedTempLiterals = [ + ...SERVER_TS.matchAll(/stateFile\s*\+\s*['"`]\.tmp['"`]/g), + ]; + expect( + sharedTempLiterals.length, + `Found ${sharedTempLiterals.length} reference(s) to the shared ` + + `\`stateFile + '.tmp'\` pattern. Use \`tmpStatePath()\` instead — ` + + `the shared pattern races on rename when two daemons spawn ` + + `concurrently (cold-start race + parallel /tunnel/start).`, + ).toBe(0); + }); + + test('every state-file writeFileSync call uses tmpStatePath()', () => { + // Find every `writeFileSync(X, JSON.stringify(stateContent...` or + // `…(state, …)` call and verify X is `tmpStatePath()` or a variable + // assigned from `tmpStatePath()`. + const writeCalls = [ + ...SERVER_TS.matchAll( + /fs\.writeFileSync\s*\(\s*(\w+)\s*,\s*JSON\.stringify\(\s*(state|stateContent)/g, + ), + ]; + expect( + writeCalls.length, + 'expected at least one state-file write site', + ).toBeGreaterThan(0); + + for (const m of writeCalls) { + const varName = m[1]!; + // Walk back to the assignment of varName — must come from tmpStatePath() + const assignRe = new RegExp( + `(?:const|let)\\s+${varName}\\s*=\\s*tmpStatePath\\(\\)`, + ); + expect( + assignRe.test(SERVER_TS), + `state-file writeFileSync uses \`${varName}\` but no \`const ${varName} = tmpStatePath()\` ` + + `assignment was found upstream. Either assign from tmpStatePath() ` + + `or pass tmpStatePath() inline — the shared \`stateFile + '.tmp'\` ` + + `pattern races under concurrent daemon startup`, + ).toBe(true); + } + }); + + test('tmpStatePath() declaration includes a per-process unique suffix', () => { + // Lock the suffix shape so a future contributor doesn't accidentally + // strip the uniqueness back out by simplifying the helper. + const declMatch = SERVER_TS.match( + /function tmpStatePath\(\)[^{]*\{([\s\S]*?)\n\}/, + ); + expect(declMatch, 'tmpStatePath() declaration not found').not.toBeNull(); + const body = declMatch![1]!; + + // Must reference both process.pid and crypto.randomBytes — two + // independent sources of uniqueness. + expect(body, 'tmpStatePath() must include process.pid in the suffix').toContain( + 'process.pid', + ); + expect( + body, + 'tmpStatePath() must include a random suffix via crypto.randomBytes', + ).toContain('crypto.randomBytes'); + }); +}); diff --git a/browse/test/tab-session-frame-detach.test.ts b/browse/test/tab-session-frame-detach.test.ts new file mode 100644 index 0000000000..32f2632b90 --- /dev/null +++ b/browse/test/tab-session-frame-detach.test.ts @@ -0,0 +1,154 @@ +/** + * Regression: refMap must be cleared when an iframe detaches. + * + * `TabSession.getActiveFrameOrPage()` (tab-session.ts:151) auto-recovers + * from detached iframes by setting `activeFrame = null` and silently + * falling back to the main page. The asymmetric bug: the matching + * `clearRefs()` call is missing. + * + * Compare to `onMainFrameNavigated()` (tab-session.ts:167) — the + * staleness condition is equivalent (refs were captured against a frame + * that no longer exists), and the main-frame path correctly clears both + * the activeFrame AND the refMap: + * + * onMainFrameNavigated(): void { + * this.clearRefs(); // ← clears refs + * this.activeFrame = null; + * this.loadedHtml = null; + * this.loadedHtmlWaitUntil = undefined; + * } + * + * getActiveFrameOrPage(): Page | Frame { + * if (this.activeFrame?.isDetached()) { + * this.activeFrame = null; // ← but no clearRefs() here + * } + * return this.activeFrame ?? this.page; + * } + * + * The lazy click-time staleness check at `resolveRef` (tab-session.ts:97) + * partially saves us — `entry.locator.count()` on a detached-frame + * locator throws or returns 0, so a click against a stale ref errors out + * with "Ref X is stale". But the user has no signal that frame context + * silently changed underfoot: the next `snapshot` runs against + * `this.page` (main) while old iframe refs still litter `refMap` with + * the same role+name keys. New refs collide with stale ones, the + * resolver picks one at random, the user clicks the wrong element. + * + * Behavior the test locks: when an iframe detaches and + * `getActiveFrameOrPage()` auto-recovers, the refMap is cleared in the + * same step (matching the `onMainFrameNavigated` symmetry). TODOS.md + * line 816-820 documents "Detached frame auto-recovery" as a feature; + * this restores the documented intent. + */ + +import { describe, test, expect, beforeEach } from 'bun:test'; +import { TabSession, type RefEntry } from '../src/tab-session'; +import type { Page, Frame, Locator } from 'playwright'; + +// Minimal type-cast mocks. Same pattern as tab-isolation.test.ts — +// pure-logic tests don't launch a browser. +function mockPage(): Page { + return {} as Page; +} + +function mockDetachedFrame(): Frame { + return { isDetached: () => true } as unknown as Frame; +} + +function mockAttachedFrame(): Frame { + return { isDetached: () => false } as unknown as Frame; +} + +function mockRefEntry(role: string, name: string): RefEntry { + return { + locator: {} as Locator, + role, + name, + }; +} + +// Fresh refs Map per call — avoid by-reference mutation poisoning across +// halves of the symmetry test (clearRefs() clears the same Map instance +// the test holds a reference to). +function makeRefs(): Map { + const r = new Map(); + r.set('e1', mockRefEntry('button', 'Submit')); + r.set('e2', mockRefEntry('textbox', 'Email')); + r.set('e3', mockRefEntry('link', 'Forgot password')); + return r; +} + +describe('TabSession — frame detach + ref staleness', () => { + let session: TabSession; + + beforeEach(() => { + session = new TabSession(mockPage()); + session.setRefMap(makeRefs()); + }); + + test('refs cleared when getActiveFrameOrPage detects detached iframe', () => { + // Pre-condition: refs captured inside an iframe context + session.setFrame(mockDetachedFrame()); + expect(session.getRefCount()).toBe(3); + + // Act: caller invokes getActiveFrameOrPage (e.g. via the next /command + // dispatch). The detach gets noticed inside. + const result = session.getActiveFrameOrPage(); + + // Auto-recovery: activeFrame nulled (already worked pre-fix) + expect(session.getFrame()).toBeNull(); + + // The fix: refs ALSO cleared so the next snapshot runs against a + // clean ref namespace. Pre-fix this was 3 — refs lingered against a + // dead frame, colliding with refs the next snapshot would emit. + expect(session.getRefCount()).toBe(0); + }); + + test('refs preserved when active frame is still attached', () => { + // No regression on the happy path — attached frame should NOT + // trigger the cleanup. + session.setFrame(mockAttachedFrame()); + expect(session.getRefCount()).toBe(3); + + session.getActiveFrameOrPage(); + + // Frame still set, refs still present. + expect(session.getFrame()).not.toBeNull(); + expect(session.getRefCount()).toBe(3); + }); + + test('refs preserved when no frame is set (page-level snapshot)', () => { + // No frame ever set → the if-branch never enters → refs untouched. + expect(session.getFrame()).toBeNull(); + expect(session.getRefCount()).toBe(3); + + session.getActiveFrameOrPage(); + + expect(session.getRefCount()).toBe(3); + }); + + test('matches onMainFrameNavigated symmetry (refs+frame both cleared)', () => { + // Pin the design symmetry: both staleness paths (main-frame nav AND + // iframe detach) must clear both pieces of state together. If a + // future refactor splits these, the test fails before merge. + session.setFrame(mockDetachedFrame()); + expect(session.getRefCount()).toBe(3); + + session.onMainFrameNavigated(); + + expect(session.getFrame()).toBeNull(); + expect(session.getRefCount()).toBe(0); + + // Reset with a FRESH Map (the previous one was emptied by clearRefs + // by-reference) and exercise the iframe-detach path. End state must + // match. + session.setRefMap(makeRefs()); + session.setFrame(mockDetachedFrame()); + expect(session.getRefCount()).toBe(3); + + session.getActiveFrameOrPage(); + + expect(session.getFrame()).toBeNull(); + expect(session.getRefCount()).toBe(0); + }); +}); diff --git a/browser-skills/hackernews-frontpage/_lib/browse-client.ts b/browser-skills/hackernews-frontpage/_lib/browse-client.ts index a33681f717..435f575469 100644 --- a/browser-skills/hackernews-frontpage/_lib/browse-client.ts +++ b/browser-skills/hackernews-frontpage/_lib/browse-client.ts @@ -54,6 +54,13 @@ interface ResolvedAuth { source: 'env' | 'state-file'; } +function parseIntegerEnvValue(value: string | undefined): number | undefined { + const trimmed = value?.trim(); + if (!trimmed || !/^-?\d+$/.test(trimmed)) return undefined; + const parsed = parseInt(trimmed, 10); + return Number.isFinite(parsed) ? parsed : undefined; +} + /** Resolve the daemon port + token. Throws a clear error if neither path works. */ export function resolveBrowseAuth(opts: BrowseClientOptions = {}): ResolvedAuth { if (opts.port !== undefined && opts.token !== undefined) { @@ -64,8 +71,8 @@ export function resolveBrowseAuth(opts: BrowseClientOptions = {}): ResolvedAuth const envPort = process.env.GSTACK_PORT; const envToken = process.env.GSTACK_SKILL_TOKEN; if (envPort && envToken) { - const port = opts.port ?? parseInt(envPort, 10); - if (!isNaN(port)) { + const port = opts.port ?? parseIntegerEnvValue(envPort); + if (port !== undefined) { return { port, token: opts.token ?? envToken, source: 'env' }; } } @@ -132,7 +139,7 @@ export class BrowseClient { const auth = resolveBrowseAuth(opts); this.port = auth.port; this.token = auth.token; - this.tabId = opts.tabId ?? (process.env.BROWSE_TAB ? parseInt(process.env.BROWSE_TAB, 10) : undefined); + this.tabId = opts.tabId ?? parseIntegerEnvValue(process.env.BROWSE_TAB); this.timeoutMs = opts.timeoutMs ?? 30_000; } diff --git a/codex/SKILL.md b/codex/SKILL.md index 7917b7f2d0..b71d1ed401 100644 --- a/codex/SKILL.md +++ b/codex/SKILL.md @@ -1121,10 +1121,15 @@ If the user passed `--xhigh`, use `"xhigh"` instead of `"high"`. ```bash _REPO_ROOT=$(git rev-parse --show-toplevel) || { echo "ERROR: not in a git repo" >&2; exit 1; } +PYTHON_CMD=$(command -v python3 2>/dev/null || command -v python 2>/dev/null || true) +if [ -z "$PYTHON_CMD" ]; then + echo "ERROR: Python 3 is required to parse Codex JSON output. Install python3 or python and retry." >&2 + exit 1 +fi # Fix 1+2: wrap with timeout (gtimeout/timeout fallback chain via probe helper), # capture stderr to $TMPERR for auth error detection (was: 2>/dev/null). TMPERR=${TMPERR:-$(mktemp "$TMP_ROOT/codex-err-XXXXXX.txt")} -_gstack_codex_timeout_wrapper 600 codex exec "" -C "$_REPO_ROOT" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached --json < /dev/null 2>"$TMPERR" | PYTHONUNBUFFERED=1 python3 -u -c " +_gstack_codex_timeout_wrapper 600 codex exec "" -C "$_REPO_ROOT" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached --json < /dev/null 2>"$TMPERR" | PYTHONUNBUFFERED=1 "$PYTHON_CMD" -u -c " import sys, json turn_completed_count = 0 for line in sys.stdin: @@ -1230,7 +1235,7 @@ If no project-scoped match, fall back to `ls -t "$PLAN_ROOT"/*.md 2>/dev/null | but warn: "Note: this plan may be from a different project — verify before sending to Codex." **IMPORTANT — embed content, don't reference path:** Codex runs sandboxed to the repo -root (`-C`) and cannot access `~/.claude/plans/` or any files outside the repo. You MUST +root and cannot access `~/.claude/plans/` or any files outside the repo. You MUST read the plan file yourself and embed its FULL CONTENT in the prompt below. Do NOT tell Codex the file path or ask it to read the plan file — it will waste 10+ tool calls searching and fail. @@ -1267,8 +1272,13 @@ If the user passed `--xhigh`, use `"xhigh"` instead of `"medium"`. For a **new session:** ```bash _REPO_ROOT=$(git rev-parse --show-toplevel) || { echo "ERROR: not in a git repo" >&2; exit 1; } +PYTHON_CMD=$(command -v python3 2>/dev/null || command -v python 2>/dev/null || true) +if [ -z "$PYTHON_CMD" ]; then + echo "ERROR: Python 3 is required to parse Codex JSON output. Install python3 or python and retry." >&2 + exit 1 +fi # Fix 1: wrap with timeout (gtimeout/timeout fallback chain via probe helper) -_gstack_codex_timeout_wrapper 600 codex exec "" -C "$_REPO_ROOT" -s read-only -c 'model_reasoning_effort="medium"' --enable web_search_cached --json < /dev/null 2>"$TMPERR" | PYTHONUNBUFFERED=1 python3 -u -c " +_gstack_codex_timeout_wrapper 600 codex exec "" -C "$_REPO_ROOT" -s read-only -c 'model_reasoning_effort="medium"' --enable web_search_cached --json < /dev/null 2>"$TMPERR" | PYTHONUNBUFFERED=1 "$PYTHON_CMD" -u -c " import sys, json for line in sys.stdin: line = line.strip() @@ -1309,8 +1319,14 @@ fi For a **resumed session** (user chose "Continue"): ```bash _REPO_ROOT=$(git rev-parse --show-toplevel) || { echo "ERROR: not in a git repo" >&2; exit 1; } +PYTHON_CMD=$(command -v python3 2>/dev/null || command -v python 2>/dev/null || true) +if [ -z "$PYTHON_CMD" ]; then + echo "ERROR: Python 3 is required to parse Codex JSON output. Install python3 or python and retry." >&2 + exit 1 +fi +cd "$_REPO_ROOT" || exit 1 # Fix 1: wrap with timeout (gtimeout/timeout fallback chain via probe helper) -_gstack_codex_timeout_wrapper 600 codex exec resume "" -C "$_REPO_ROOT" -s read-only -c 'model_reasoning_effort="medium"' --enable web_search_cached --json < /dev/null 2>"$TMPERR" | PYTHONUNBUFFERED=1 python3 -u -c " +_gstack_codex_timeout_wrapper 600 codex exec resume "" -c 'sandbox_mode="read-only"' -c 'model_reasoning_effort="medium"' --enable web_search_cached --json < /dev/null 2>"$TMPERR" | PYTHONUNBUFFERED=1 "$PYTHON_CMD" -u -c " " # Fix 1: same hang detection pattern as new-session block diff --git a/codex/SKILL.md.tmpl b/codex/SKILL.md.tmpl index 1b849a82d9..90dd1119ca 100644 --- a/codex/SKILL.md.tmpl +++ b/codex/SKILL.md.tmpl @@ -284,10 +284,15 @@ If the user passed `--xhigh`, use `"xhigh"` instead of `"high"`. ```bash _REPO_ROOT=$(git rev-parse --show-toplevel) || { echo "ERROR: not in a git repo" >&2; exit 1; } +PYTHON_CMD=$(command -v python3 2>/dev/null || command -v python 2>/dev/null || true) +if [ -z "$PYTHON_CMD" ]; then + echo "ERROR: Python 3 is required to parse Codex JSON output. Install python3 or python and retry." >&2 + exit 1 +fi # Fix 1+2: wrap with timeout (gtimeout/timeout fallback chain via probe helper), # capture stderr to $TMPERR for auth error detection (was: 2>/dev/null). TMPERR=${TMPERR:-$(mktemp "$TMP_ROOT/codex-err-XXXXXX.txt")} -_gstack_codex_timeout_wrapper 600 codex exec "" -C "$_REPO_ROOT" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached --json < /dev/null 2>"$TMPERR" | PYTHONUNBUFFERED=1 python3 -u -c " +_gstack_codex_timeout_wrapper 600 codex exec "" -C "$_REPO_ROOT" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached --json < /dev/null 2>"$TMPERR" | PYTHONUNBUFFERED=1 "$PYTHON_CMD" -u -c " import sys, json turn_completed_count = 0 for line in sys.stdin: @@ -393,7 +398,7 @@ If no project-scoped match, fall back to `ls -t "$PLAN_ROOT"/*.md 2>/dev/null | but warn: "Note: this plan may be from a different project — verify before sending to Codex." **IMPORTANT — embed content, don't reference path:** Codex runs sandboxed to the repo -root (`-C`) and cannot access `~/.claude/plans/` or any files outside the repo. You MUST +root and cannot access `~/.claude/plans/` or any files outside the repo. You MUST read the plan file yourself and embed its FULL CONTENT in the prompt below. Do NOT tell Codex the file path or ask it to read the plan file — it will waste 10+ tool calls searching and fail. @@ -430,8 +435,13 @@ If the user passed `--xhigh`, use `"xhigh"` instead of `"medium"`. For a **new session:** ```bash _REPO_ROOT=$(git rev-parse --show-toplevel) || { echo "ERROR: not in a git repo" >&2; exit 1; } +PYTHON_CMD=$(command -v python3 2>/dev/null || command -v python 2>/dev/null || true) +if [ -z "$PYTHON_CMD" ]; then + echo "ERROR: Python 3 is required to parse Codex JSON output. Install python3 or python and retry." >&2 + exit 1 +fi # Fix 1: wrap with timeout (gtimeout/timeout fallback chain via probe helper) -_gstack_codex_timeout_wrapper 600 codex exec "" -C "$_REPO_ROOT" -s read-only -c 'model_reasoning_effort="medium"' --enable web_search_cached --json < /dev/null 2>"$TMPERR" | PYTHONUNBUFFERED=1 python3 -u -c " +_gstack_codex_timeout_wrapper 600 codex exec "" -C "$_REPO_ROOT" -s read-only -c 'model_reasoning_effort="medium"' --enable web_search_cached --json < /dev/null 2>"$TMPERR" | PYTHONUNBUFFERED=1 "$PYTHON_CMD" -u -c " import sys, json for line in sys.stdin: line = line.strip() @@ -472,8 +482,14 @@ fi For a **resumed session** (user chose "Continue"): ```bash _REPO_ROOT=$(git rev-parse --show-toplevel) || { echo "ERROR: not in a git repo" >&2; exit 1; } +PYTHON_CMD=$(command -v python3 2>/dev/null || command -v python 2>/dev/null || true) +if [ -z "$PYTHON_CMD" ]; then + echo "ERROR: Python 3 is required to parse Codex JSON output. Install python3 or python and retry." >&2 + exit 1 +fi +cd "$_REPO_ROOT" || exit 1 # Fix 1: wrap with timeout (gtimeout/timeout fallback chain via probe helper) -_gstack_codex_timeout_wrapper 600 codex exec resume "" -C "$_REPO_ROOT" -s read-only -c 'model_reasoning_effort="medium"' --enable web_search_cached --json < /dev/null 2>"$TMPERR" | PYTHONUNBUFFERED=1 python3 -u -c " +_gstack_codex_timeout_wrapper 600 codex exec resume "" -c 'sandbox_mode="read-only"' -c 'model_reasoning_effort="medium"' --enable web_search_cached --json < /dev/null 2>"$TMPERR" | PYTHONUNBUFFERED=1 "$PYTHON_CMD" -u -c " " # Fix 1: same hang detection pattern as new-session block diff --git a/design/prototype.ts b/design/prototype.ts index 74b9ec497b..b59db30199 100644 --- a/design/prototype.ts +++ b/design/prototype.ts @@ -9,8 +9,7 @@ import fs from "fs"; import path from "path"; -const API_KEY = process.env.OPENAI_API_KEY - || JSON.parse(fs.readFileSync(path.join(process.env.HOME!, ".gstack/openai.json"), "utf-8")).api_key; +const API_KEY = process.env.OPENAI_API_KEY; if (!API_KEY) { console.error("No API key found. Set OPENAI_API_KEY or save to ~/.gstack/openai.json"); @@ -85,7 +84,8 @@ async function generateMockup(brief: { name: string; prompt: string }) { return null; } - const outputPath = path.join(OUTPUT_DIR, `${brief.name}.png`); + const safeName = brief.name.replace(/[^a-zA-Z0-9_-]/g, "_"); + const outputPath = OUTPUT_DIR + "/" + safeName + ".png"; const imageBuffer = Buffer.from(imageItem.result, "base64"); fs.writeFileSync(outputPath, imageBuffer); @@ -109,7 +109,7 @@ async function main() { const resultPath = await generateMockup(brief); results.push({ name: brief.name, path: resultPath }); } catch (err) { - console.error(`ERROR generating ${brief.name}:`, err); + console.error("ERROR generating:", brief.name, err); results.push({ name: brief.name, path: null }); } } @@ -124,7 +124,7 @@ async function main() { console.log(`${succeeded.length}/${results.length} generated successfully`); if (failed.length > 0) { - console.log(`Failed: ${failed.map(f => f.name).join(", ")}`); + console.log("Failed:", failed.map(f => f.name).join(", ")); } if (succeeded.length > 0) { diff --git a/design/src/variants.ts b/design/src/variants.ts index 87ccca92de..d52eb22829 100644 --- a/design/src/variants.ts +++ b/design/src/variants.ts @@ -31,30 +31,37 @@ const STYLE_VARIATIONS = [ /** * Generate a single variant with retry on 429. + * + * Exported for testability. Pass `fetchFn` to inject a stubbed fetch in tests; + * production code uses the global fetch by default. */ -async function generateVariant( +export async function generateVariant( apiKey: string, prompt: string, outputPath: string, size: string, quality: string, + fetchFn: typeof globalThis.fetch = globalThis.fetch, ): Promise<{ path: string; success: boolean; error?: string }> { const maxRetries = 3; + const MAX_RETRY_AFTER_MS = 60_000; // cap honored Retry-After to bound stalls let lastError = ""; + let skipLeadingDelay = false; for (let attempt = 0; attempt <= maxRetries; attempt++) { - if (attempt > 0) { + if (attempt > 0 && !skipLeadingDelay) { // Exponential backoff: 2s, 4s, 8s const delay = Math.pow(2, attempt) * 1000; console.error(` Rate limited, retrying in ${delay / 1000}s...`); await new Promise(r => setTimeout(r, delay)); } + skipLeadingDelay = false; const controller = new AbortController(); const timeout = setTimeout(() => controller.abort(), 120_000); try { - const response = await fetch("https://api.openai.com/v1/responses", { + const response = await fetchFn("https://api.openai.com/v1/responses", { method: "POST", headers: { "Authorization": `Bearer ${apiKey}`, @@ -72,6 +79,29 @@ async function generateVariant( if (response.status === 429) { lastError = "Rate limited (429)"; + const retryAfter = response.headers.get("retry-after"); + if (retryAfter) { + const trimmed = retryAfter.trim(); + let waitMs: number | null = null; + if (/^\d+$/.test(trimmed)) { + // delta-seconds (RFC 7231) + waitMs = Math.min(Number.parseInt(trimmed, 10) * 1000, MAX_RETRY_AFTER_MS); + } else { + // HTTP-date (RFC 7231) + const dateMs = Date.parse(trimmed); + if (!Number.isNaN(dateMs)) { + waitMs = Math.min(Math.max(0, dateMs - Date.now()), MAX_RETRY_AFTER_MS); + } + } + if (waitMs !== null) { + if (waitMs > 0) { + await new Promise(resolve => setTimeout(resolve, waitMs)); + } + // Honored Retry-After (incl. 0 / past date "retry now") — skip the + // next iteration's leading exponential sleep so we don't double-wait. + skipLeadingDelay = true; + } + } continue; } diff --git a/design/test/variants-retry-after.test.ts b/design/test/variants-retry-after.test.ts new file mode 100644 index 0000000000..2060791d53 --- /dev/null +++ b/design/test/variants-retry-after.test.ts @@ -0,0 +1,133 @@ +import { describe, test, expect, beforeEach, afterEach } from "bun:test"; +import fs from "fs"; +import os from "os"; +import path from "path"; +import { generateVariant } from "../src/variants"; + +// 1x1 transparent PNG, base64 — valid bytes that fs.writeFileSync can write. +const TINY_PNG_BASE64 = + "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNkAAIAAAoAAv/lxKUAAAAASUVORK5CYII="; + +function successResponse(): Response { + return new Response( + JSON.stringify({ + output: [{ type: "image_generation_call", result: TINY_PNG_BASE64 }], + }), + { status: 200, headers: { "Content-Type": "application/json" } }, + ); +} + +function rateLimited(retryAfter?: string): Response { + const headers: Record = {}; + if (retryAfter !== undefined) headers["Retry-After"] = retryAfter; + return new Response("rate limited", { status: 429, headers }); +} + +interface CallRecord { + ts: number; +} + +function makeStubFetch( + responses: Response[], + calls: CallRecord[], +): typeof globalThis.fetch { + let idx = 0; + return (async (_input: any, _init?: any) => { + calls.push({ ts: Date.now() }); + const response = responses[idx]; + if (!response) throw new Error(`stub fetch: no response for call ${idx + 1}`); + idx++; + return response; + }) as typeof globalThis.fetch; +} + +describe("generateVariant Retry-After handling", () => { + let tmpDir: string; + let outputPath: string; + + beforeEach(() => { + tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), "variants-retry-after-")); + outputPath = path.join(tmpDir, "variant.png"); + }); + + afterEach(() => { + fs.rmSync(tmpDir, { recursive: true, force: true }); + }); + + test("delta-seconds: honors Retry-After: 1 with no extra leading exponential", async () => { + const calls: CallRecord[] = []; + const fetchFn = makeStubFetch([rateLimited("1"), successResponse()], calls); + + const result = await generateVariant( + "fake-key", "prompt", outputPath, "1024x1024", "high", fetchFn, + ); + + expect(result.success).toBe(true); + expect(calls.length).toBe(2); + const gap = calls[1].ts - calls[0].ts; + // Honored ~1s; should NOT add the 2s leading exponential on top + expect(gap).toBeGreaterThanOrEqual(900); + expect(gap).toBeLessThan(1700); + }); + + test("HTTP-date: honors a future date with no extra leading exponential", async () => { + const calls: CallRecord[] = []; + const future = new Date(Date.now() + 3000).toUTCString(); + const fetchFn = makeStubFetch([rateLimited(future), successResponse()], calls); + + const result = await generateVariant( + "fake-key", "prompt", outputPath, "1024x1024", "high", fetchFn, + ); + + expect(result.success).toBe(true); + expect(calls.length).toBe(2); + const gap = calls[1].ts - calls[0].ts; + expect(gap).toBeGreaterThanOrEqual(2500); + expect(gap).toBeLessThan(4500); + }); + + test("invalid Retry-After (alphanumeric): falls through to exponential", async () => { + const calls: CallRecord[] = []; + const fetchFn = makeStubFetch([rateLimited("2abc"), successResponse()], calls); + + const result = await generateVariant( + "fake-key", "prompt", outputPath, "1024x1024", "high", fetchFn, + ); + + expect(result.success).toBe(true); + expect(calls.length).toBe(2); + const gap = calls[1].ts - calls[0].ts; + // Falls through to existing 2s exponential leading delay + expect(gap).toBeGreaterThanOrEqual(1800); + expect(gap).toBeLessThan(3000); + }); + + test("no Retry-After header: falls through to exponential", async () => { + const calls: CallRecord[] = []; + const fetchFn = makeStubFetch([rateLimited(), successResponse()], calls); + + const result = await generateVariant( + "fake-key", "prompt", outputPath, "1024x1024", "high", fetchFn, + ); + + expect(result.success).toBe(true); + expect(calls.length).toBe(2); + const gap = calls[1].ts - calls[0].ts; + expect(gap).toBeGreaterThanOrEqual(1800); + expect(gap).toBeLessThan(3000); + }); + + test("Retry-After: 0 retries immediately, skips leading exponential", async () => { + const calls: CallRecord[] = []; + const fetchFn = makeStubFetch([rateLimited("0"), successResponse()], calls); + + const result = await generateVariant( + "fake-key", "prompt", outputPath, "1024x1024", "high", fetchFn, + ); + + expect(result.success).toBe(true); + expect(calls.length).toBe(2); + const gap = calls[1].ts - calls[0].ts; + expect(gap).toBeLessThan(500); + }); +}); diff --git a/extension/sidepanel-terminal.js b/extension/sidepanel-terminal.js index e301d085cc..dc3a0cd75b 100644 --- a/extension/sidepanel-terminal.js +++ b/extension/sidepanel-terminal.js @@ -154,7 +154,7 @@ function ensureXterm() { if (term) return; term = new Terminal({ - fontFamily: '"JetBrains Mono", "SF Mono", Menlo, monospace', + fontFamily: '"JetBrains Mono", "SF Mono", Menlo, "Noto Sans Mono CJK KR", "Malgun Gothic", monospace', fontSize: 13, theme: { background: '#0a0a0a', foreground: '#e5e5e5' }, cursorBlink: true, @@ -196,7 +196,25 @@ }); ro.observe(els.mount); + // IME composition handling for Korean/CJK input (issue #1272). + // Suppress partial jamo during composition; only send the final + // composed string on compositionend. Without this, Korean IME + // sends fragmented input or doubles characters. + let composing = false; + const ta = term.textarea; + if (ta) { + ta.addEventListener('compositionstart', () => { composing = true; }); + ta.addEventListener('compositionend', (e) => { + composing = false; + if (e.data && ws && ws.readyState === WebSocket.OPEN) { + ws.send(new TextEncoder().encode(e.data)); + } + }); + } + + term.onData((data) => { + if (composing) return; // suppress partial input events during IME composition if (ws && ws.readyState === WebSocket.OPEN) { ws.send(new TextEncoder().encode(data)); } diff --git a/extension/sidepanel.css b/extension/sidepanel.css index 8813a0d07e..d83486e6c2 100644 --- a/extension/sidepanel.css +++ b/extension/sidepanel.css @@ -38,7 +38,7 @@ /* Typography */ --font-system: -apple-system, BlinkMacSystemFont, 'Segoe UI', system-ui, sans-serif; - --font-mono: 'JetBrains Mono', 'SF Mono', 'Fira Code', 'Cascadia Code', monospace; + --font-mono: 'JetBrains Mono', 'SF Mono', 'Fira Code', 'Cascadia Code', 'Noto Sans Mono CJK KR', 'Malgun Gothic', monospace; /* Radius */ --radius-sm: 4px; diff --git a/make-pdf/src/browseClient.ts b/make-pdf/src/browseClient.ts index 3fe583eb67..63cec7755c 100644 --- a/make-pdf/src/browseClient.ts +++ b/make-pdf/src/browseClient.ts @@ -7,12 +7,20 @@ * (Windows argv cap is 8191 chars; 200KB HTML dies without this). * - One place that maps non-zero exit codes to typed errors. * - * Binary resolution order (Codex round 2 #4): - * 1. $BROWSE_BIN env override - * 2. sibling dir: dirname(argv[0])/../browse/dist/browse - * 3. ~/.claude/skills/gstack/browse/dist/browse - * 4. PATH lookup: `browse` - * 5. error with setup hint + * Binary resolution order (Codex round 2 #4, v1.24-aligned): + * 1. $GSTACK_BROWSE_BIN env override (preferred, matches v1.24 GSTACK_*_BIN pattern) + * 2. $BROWSE_BIN env override (back-compat alias) + * 3. sibling dir: dirname(argv[0])/../browse/dist/browse[.exe] + * 4. ~/.claude/skills/gstack/browse/dist/browse[.exe] + * 5. PATH lookup via Bun.which('browse') — handles Windows PATHEXT natively + * 6. error with setup hint + * + * Windows quirks: + * - bun build --compile --outfile X emits X.exe on win32, so candidate paths + * need a .exe probe pass (fs.accessSync(X_OK) degrades to existence-checking + * on Windows per Node docs, so the bare path silently misses the .exe file). + * - `which` only exists in Git Bash; Bun.which() handles cmd.exe / PowerShell + * natively via PATHEXT semantics. */ import { execFileSync } from "node:child_process"; @@ -54,16 +62,52 @@ export interface JsOptions { expression: string; // JS expression to evaluate } +/** + * Resolve an absolute or PATH-resolvable command via Bun.which-style semantics, + * with a Windows .exe/.cmd/.bat extension probe for absolute paths. Mirrors + * the v1.24 claude-bin.ts override-resolution shape. + * + * Returns null if nothing resolves; callers degrade with a typed error rather + * than throwing here. + */ +function resolveOverride(value: string | undefined, env: NodeJS.ProcessEnv): string | null { + if (!value?.trim()) return null; + const trimmed = value.trim().replace(/^"(.*)"$/, '$1'); + if (path.isAbsolute(trimmed)) return findExecutable(trimmed); + const PATH = env.PATH ?? env.Path ?? ''; + return Bun.which(trimmed, { PATH }) ?? null; +} + +/** + * Probe a base path for executability, honoring Windows extension suffixes. + * + * On POSIX, isExecutable(base) is the only check that matters. On Windows, + * fs.accessSync(p, X_OK) degrades to an existence check — so a bare-path probe + * misses bun-compiled binaries (which land at base.exe). After the bare probe + * fails on win32, try .exe / .cmd / .bat. Linux/macOS behavior is unchanged. + */ +export function findExecutable(base: string): string | null { + if (isExecutable(base)) return base; + if (process.platform === "win32") { + for (const ext of [".exe", ".cmd", ".bat"]) { + const withExt = base + ext; + if (isExecutable(withExt)) return withExt; + } + } + return null; +} + /** * Locate the browse binary. Throws a BrowseClientError with a - * canonical setup message if not found. + * canonical setup message if not found. See header for resolution order. */ -export function resolveBrowseBin(): string { - const envOverride = process.env.BROWSE_BIN; - if (envOverride && isExecutable(envOverride)) return envOverride; +export function resolveBrowseBin(env: NodeJS.ProcessEnv = process.env): string { + // 1 + 2: env overrides (GSTACK_BROWSE_BIN preferred, BROWSE_BIN back-compat). + const overrideRaw = env.GSTACK_BROWSE_BIN ?? env.BROWSE_BIN; + const override = resolveOverride(overrideRaw, env); + if (override) return override; - // Sibling: look relative to this process's binary - // (for when make-pdf and browse live next to each other in dist/) + // 3: sibling — make-pdf and browse co-located in dist/. const selfDir = path.dirname(process.argv[0]); const siblingCandidates = [ path.resolve(selfDir, "../browse/dist/browse"), @@ -71,21 +115,21 @@ export function resolveBrowseBin(): string { path.resolve(selfDir, "../browse"), ]; for (const candidate of siblingCandidates) { - if (isExecutable(candidate)) return candidate; + const found = findExecutable(candidate); + if (found) return found; } - // Global install + // 4: global install. const home = os.homedir(); const globalPath = path.join(home, ".claude/skills/gstack/browse/dist/browse"); - if (isExecutable(globalPath)) return globalPath; + const globalFound = findExecutable(globalPath); + if (globalFound) return globalFound; - // PATH lookup - try { - const which = execFileSync("which", ["browse"], { encoding: "utf8" }).trim(); - if (which && isExecutable(which)) return which; - } catch { - // `which` exited non-zero; fall through to error - } + // 5: PATH lookup via Bun.which — handles Windows PATHEXT natively (no `which` + // dependency on cmd.exe / PowerShell, no `where`-vs-`which` branch). + const PATH = env.PATH ?? env.Path ?? ''; + const onPath = Bun.which('browse', { PATH }); + if (onPath) return onPath; throw new BrowseClientError( /* exitCode */ 127, @@ -95,7 +139,8 @@ export function resolveBrowseBin(): string { "", "make-pdf needs browse (the gstack Chromium daemon) to render PDFs.", "Tried:", - ` - $BROWSE_BIN (${envOverride || "unset"})`, + ` - $GSTACK_BROWSE_BIN (${env.GSTACK_BROWSE_BIN || "unset"})`, + ` - $BROWSE_BIN (${env.BROWSE_BIN || "unset"})`, ` - sibling: ${siblingCandidates.join(", ")}`, ` - global: ${globalPath}`, " - PATH: `browse`", @@ -103,8 +148,10 @@ export function resolveBrowseBin(): string { "To fix: run gstack setup from the gstack repo:", " cd ~/.claude/skills/gstack && ./setup", "", - "Or set BROWSE_BIN explicitly:", - " export BROWSE_BIN=/path/to/browse", + "Or set GSTACK_BROWSE_BIN explicitly:", + process.platform === "win32" + ? ' setx GSTACK_BROWSE_BIN "C:\\path\\to\\browse.exe"' + : " export GSTACK_BROWSE_BIN=/path/to/browse", ].join("\n"), ); } diff --git a/make-pdf/src/pdftotext.ts b/make-pdf/src/pdftotext.ts index 33e79fc64c..54cc551184 100644 --- a/make-pdf/src/pdftotext.ts +++ b/make-pdf/src/pdftotext.ts @@ -13,11 +13,14 @@ * between paragraphs, and homoglyph substitution. We add a word-token * diff and a paragraph-boundary assertion on top. * - * Resolution order for the pdftotext binary: - * 1. $PDFTOTEXT_BIN env override - * 2. `which pdftotext` on PATH - * 3. standard Homebrew paths on macOS - * 4. throws a friendly "install poppler" error + * Resolution order for the pdftotext binary (v1.24-aligned): + * 1. $GSTACK_PDFTOTEXT_BIN env override (preferred, matches v1.24 GSTACK_*_BIN pattern) + * 2. $PDFTOTEXT_BIN env override (back-compat alias) + * 3. PATH lookup via Bun.which('pdftotext') — handles Windows PATHEXT natively + * 4. standard POSIX paths (Homebrew + distro) — no Windows candidates because + * Poppler scatters across Scoop / Chocolatey / oschwartz10612-poppler-windows + * and guessing causes false positives. Set GSTACK_PDFTOTEXT_BIN explicitly. + * 5. throws a friendly "install poppler" error * * The wrapper is *optional at runtime*: production renders don't need it. * Only the CI gate and unit tests invoke pdftotext. @@ -42,29 +45,52 @@ export interface PdftotextInfo { } /** - * Locate pdftotext. Throws PdftotextUnavailableError if none is found. + * Probe a base path for executability, honoring Windows extension suffixes. + * Matches browseClient.ts:findExecutable — duplicated rather than shared + * because the two modules already duplicate isExecutable for compile-isolation. */ -export function resolvePdftotext(): PdftotextInfo { - const envOverride = process.env.PDFTOTEXT_BIN; - if (envOverride && isExecutable(envOverride)) { - return describeBinary(envOverride); +export function findExecutable(base: string): string | null { + if (isExecutable(base)) return base; + if (process.platform === "win32") { + for (const ext of [".exe", ".cmd", ".bat"]) { + const withExt = base + ext; + if (isExecutable(withExt)) return withExt; + } } + return null; +} - // Try PATH - try { - const which = execFileSync("which", ["pdftotext"], { encoding: "utf8" }).trim(); - if (which && isExecutable(which)) return describeBinary(which); - } catch { - // fall through - } +function resolveOverride(value: string | undefined, env: NodeJS.ProcessEnv): string | null { + if (!value?.trim()) return null; + const trimmed = value.trim().replace(/^"(.*)"$/, '$1'); + if (path.isAbsolute(trimmed)) return findExecutable(trimmed); + const PATH = env.PATH ?? env.Path ?? ''; + return Bun.which(trimmed, { PATH }) ?? null; +} + +/** + * Locate pdftotext. Throws PdftotextUnavailableError if none is found. + */ +export function resolvePdftotext(env: NodeJS.ProcessEnv = process.env): PdftotextInfo { + // 1 + 2: env overrides (GSTACK_PDFTOTEXT_BIN preferred, PDFTOTEXT_BIN back-compat). + const overrideRaw = env.GSTACK_PDFTOTEXT_BIN ?? env.PDFTOTEXT_BIN; + const override = resolveOverride(overrideRaw, env); + if (override) return describeBinary(override); + + // 3: PATH lookup via Bun.which — handles Windows PATHEXT natively. + const PATH = env.PATH ?? env.Path ?? ''; + const onPath = Bun.which('pdftotext', { PATH }); + if (onPath) return describeBinary(onPath); - // Common macOS Homebrew locations - const macCandidates = [ - "/opt/homebrew/bin/pdftotext", // Apple Silicon + // 4: POSIX-only standard locations. No Windows candidates — Poppler installs + // scatter across Scoop/Chocolatey/portable zips and guessing causes false + // positives. Windows users set GSTACK_PDFTOTEXT_BIN explicitly. + const posixCandidates = [ + "/opt/homebrew/bin/pdftotext", // Apple Silicon Homebrew "/usr/local/bin/pdftotext", // Intel Mac or Linuxbrew "/usr/bin/pdftotext", // distro package ]; - for (const candidate of macCandidates) { + for (const candidate of posixCandidates) { if (isExecutable(candidate)) return describeBinary(candidate); } @@ -75,12 +101,16 @@ export function resolvePdftotext(): PdftotextInfo { "(Runtime rendering does NOT need it. This only affects tests.)", "", "To install:", - " macOS: brew install poppler", - " Ubuntu: sudo apt-get install poppler-utils", - " Fedora: sudo dnf install poppler-utils", + " macOS: brew install poppler", + " Ubuntu: sudo apt-get install poppler-utils", + " Fedora: sudo dnf install poppler-utils", + " Windows: scoop install poppler (or download from", + " https://github.com/oschwartz10612/poppler-windows)", "", - "Or set PDFTOTEXT_BIN to an explicit path:", - " export PDFTOTEXT_BIN=/path/to/pdftotext", + "Or set GSTACK_PDFTOTEXT_BIN to an explicit path:", + process.platform === "win32" + ? ' setx GSTACK_PDFTOTEXT_BIN "C:\\path\\to\\pdftotext.exe"' + : " export GSTACK_PDFTOTEXT_BIN=/path/to/pdftotext", ].join("\n")); } diff --git a/make-pdf/test/browseClient.test.ts b/make-pdf/test/browseClient.test.ts index b3459713a3..072278e500 100644 --- a/make-pdf/test/browseClient.test.ts +++ b/make-pdf/test/browseClient.test.ts @@ -2,60 +2,123 @@ * browseClient unit tests — binary resolution and error mapping. * * These are pure unit tests; they do NOT require a running browse daemon. + * Cross-platform: assertions that pin POSIX behavior early-return on win32 + * and vice versa, so both lanes only exercise their own branch. */ import { describe, expect, test } from "bun:test"; +import * as fs from "node:fs"; +import * as os from "node:os"; +import * as path from "node:path"; import { BrowseClientError } from "../src/types"; -import { resolveBrowseBin } from "../src/browseClient"; +import { resolveBrowseBin, findExecutable } from "../src/browseClient"; + +// A real, always-present executable for the test platform — `cmd.exe` on +// Windows (System32 is on every install) and `/bin/sh` on POSIX. Lets the +// "honors override when it points at a real executable" test work in both +// lanes without writing a temp script. +const REAL_EXE: string = + process.platform === "win32" + ? path.join(process.env.SystemRoot ?? "C:\\Windows", "System32", "cmd.exe") + : "/bin/sh"; + +function withEnv(overrides: Record, fn: () => T): T { + const saved: Record = {}; + for (const k of Object.keys(overrides)) saved[k] = process.env[k]; + for (const [k, v] of Object.entries(overrides)) { + if (v === undefined) delete process.env[k]; + else process.env[k] = v; + } + try { + return fn(); + } finally { + for (const [k, v] of Object.entries(saved)) { + if (v === undefined) delete process.env[k]; + else process.env[k] = v; + } + } +} + +describe("findExecutable", () => { + test("returns the bare path on POSIX when it's executable", () => { + if (process.platform === "win32") return; + const found = findExecutable("/bin/sh"); + expect(found).toBe("/bin/sh"); + }); + + test("on win32, probes .exe / .cmd / .bat after the bare-path miss", () => { + if (process.platform !== "win32") return; + // cmd.exe lives at System32\cmd.exe — probe with the bare base. + const base = path.join(process.env.SystemRoot ?? "C:\\Windows", "System32", "cmd"); + const found = findExecutable(base); + expect(found).toBe(base + ".exe"); + }); + + test("returns null when no extension matches", () => { + const found = findExecutable("/nonexistent/path/to/nothing"); + expect(found).toBeNull(); + }); +}); describe("resolveBrowseBin", () => { test("throws BrowseClientError with setup hint when nothing is found", () => { - // Point every candidate path to a non-existent location. - const originalEnv = process.env.BROWSE_BIN; - process.env.BROWSE_BIN = "/nonexistent/browse-does-not-exist"; - - // We can't easily mock the sibling and global paths without touching - // the filesystem, so in a typical dev environment this will usually - // find the real browse. That's fine — on CI it will throw, and the - // error message shape is what we're actually asserting. - let thrown: any = null; + // Point overrides at non-existent paths and clear PATH so Bun.which finds + // nothing. Sibling/global probes go through findExecutable on real paths, + // but the test asserts on the error shape rather than depending on whether + // a real browse install exists on the box. + let thrown: unknown = null; try { - resolveBrowseBin(); + withEnv( + { + GSTACK_BROWSE_BIN: "/nonexistent/gstack-browse-bin", + BROWSE_BIN: "/nonexistent/browse-bin", + PATH: "", + Path: "", + }, + () => resolveBrowseBin(), + ); } catch (err) { thrown = err; } if (thrown) { expect(thrown).toBeInstanceOf(BrowseClientError); - expect(thrown.message).toContain("browse binary not found"); - expect(thrown.message).toContain("./setup"); - expect(thrown.message).toContain("BROWSE_BIN"); + expect((thrown as BrowseClientError).message).toContain("browse binary not found"); + expect((thrown as BrowseClientError).message).toContain("./setup"); + expect((thrown as BrowseClientError).message).toContain("GSTACK_BROWSE_BIN"); + // Back-compat alias still surfaces in the diagnostic. + expect((thrown as BrowseClientError).message).toContain("BROWSE_BIN"); } + // If the test box has a real browse install on disk, sibling/global may + // resolve and the helper won't throw — that's fine; the assertion is + // gated on whether it threw at all. + }); - // Restore env - if (originalEnv === undefined) { - delete process.env.BROWSE_BIN; - } else { - process.env.BROWSE_BIN = originalEnv; - } + test("honors GSTACK_BROWSE_BIN when it points at a real executable", () => { + const resolved = withEnv({ GSTACK_BROWSE_BIN: REAL_EXE }, () => resolveBrowseBin()); + expect(resolved).toBe(REAL_EXE); }); - test("honors BROWSE_BIN when it points at a real executable", () => { - const originalEnv = process.env.BROWSE_BIN; - // `/bin/sh` exists on every POSIX system and is executable. - process.env.BROWSE_BIN = "/bin/sh"; + test("honors BROWSE_BIN as a back-compat alias", () => { + const resolved = withEnv( + { GSTACK_BROWSE_BIN: undefined, BROWSE_BIN: REAL_EXE }, + () => resolveBrowseBin(), + ); + expect(resolved).toBe(REAL_EXE); + }); - try { - const resolved = resolveBrowseBin(); - expect(resolved).toBe("/bin/sh"); - } finally { - if (originalEnv === undefined) { - delete process.env.BROWSE_BIN; - } else { - process.env.BROWSE_BIN = originalEnv; - } - } + test("GSTACK_BROWSE_BIN takes precedence over BROWSE_BIN", () => { + const resolved = withEnv( + { GSTACK_BROWSE_BIN: REAL_EXE, BROWSE_BIN: "/nonexistent/legacy" }, + () => resolveBrowseBin(), + ); + expect(resolved).toBe(REAL_EXE); + }); + + test("strips wrapping double quotes from override values", () => { + const resolved = withEnv({ GSTACK_BROWSE_BIN: `"${REAL_EXE}"` }, () => resolveBrowseBin()); + expect(resolved).toBe(REAL_EXE); }); }); diff --git a/make-pdf/test/pdftotext.test.ts b/make-pdf/test/pdftotext.test.ts index cfeebd14fb..4ab5c4fb78 100644 --- a/make-pdf/test/pdftotext.test.ts +++ b/make-pdf/test/pdftotext.test.ts @@ -8,7 +8,8 @@ import { describe, expect, test } from "bun:test"; -import { normalize, copyPasteGate } from "../src/pdftotext"; +import * as path from "node:path"; +import { normalize, copyPasteGate, findExecutable, resolvePdftotext, PdftotextUnavailableError } from "../src/pdftotext"; describe("normalize", () => { test("strips trailing spaces", () => { @@ -104,3 +105,103 @@ describe("copyPasteGate — assertion logic", () => { expect(Math.abs(expectedBreaks - tooManyBreaksNormalized)).toBeLessThanOrEqual(4); }); }); + +// ─── Binary resolution (v1.24-aligned) ────────────────────────── + +const REAL_EXE: string = + process.platform === "win32" + ? path.join(process.env.SystemRoot ?? "C:\\Windows", "System32", "cmd.exe") + : "/bin/sh"; + +function withEnv(overrides: Record, fn: () => T): T { + const saved: Record = {}; + for (const k of Object.keys(overrides)) saved[k] = process.env[k]; + for (const [k, v] of Object.entries(overrides)) { + if (v === undefined) delete process.env[k]; + else process.env[k] = v; + } + try { + return fn(); + } finally { + for (const [k, v] of Object.entries(saved)) { + if (v === undefined) delete process.env[k]; + else process.env[k] = v; + } + } +} + +describe("findExecutable (pdftotext.ts)", () => { + test("returns the bare path on POSIX when it's executable", () => { + if (process.platform === "win32") return; + expect(findExecutable("/bin/sh")).toBe("/bin/sh"); + }); + + test("on win32, probes .exe / .cmd / .bat after the bare-path miss", () => { + if (process.platform !== "win32") return; + const base = path.join(process.env.SystemRoot ?? "C:\\Windows", "System32", "cmd"); + expect(findExecutable(base)).toBe(base + ".exe"); + }); + + test("returns null when no extension matches", () => { + expect(findExecutable("/nonexistent/path/to/nothing")).toBeNull(); + }); +}); + +describe("resolvePdftotext (override resolution, v1.24-aligned)", () => { + test("honors GSTACK_PDFTOTEXT_BIN when it points at a real executable", () => { + // We can't fake a real pdftotext, but we can fake "any executable" to + // exercise the override-resolution path. describeBinary will mark flavor + // as "unknown" since cmd.exe / /bin/sh don't respond to -v like pdftotext; + // the test asserts on the bin-path resolution, not the version probe. + const info = withEnv({ GSTACK_PDFTOTEXT_BIN: REAL_EXE }, () => resolvePdftotext()); + expect(info.bin).toBe(REAL_EXE); + }); + + test("honors PDFTOTEXT_BIN as a back-compat alias", () => { + const info = withEnv( + { GSTACK_PDFTOTEXT_BIN: undefined, PDFTOTEXT_BIN: REAL_EXE }, + () => resolvePdftotext(), + ); + expect(info.bin).toBe(REAL_EXE); + }); + + test("GSTACK_PDFTOTEXT_BIN takes precedence over PDFTOTEXT_BIN", () => { + const info = withEnv( + { GSTACK_PDFTOTEXT_BIN: REAL_EXE, PDFTOTEXT_BIN: "/nonexistent/legacy" }, + () => resolvePdftotext(), + ); + expect(info.bin).toBe(REAL_EXE); + }); + + test("strips wrapping double quotes from override values", () => { + const info = withEnv({ GSTACK_PDFTOTEXT_BIN: `"${REAL_EXE}"` }, () => resolvePdftotext()); + expect(info.bin).toBe(REAL_EXE); + }); + + test("error message includes Windows install hint and GSTACK_PDFTOTEXT_BIN", () => { + let thrown: unknown = null; + try { + withEnv( + { + GSTACK_PDFTOTEXT_BIN: "/nonexistent/gstack-pdftotext", + PDFTOTEXT_BIN: "/nonexistent/pdftotext", + PATH: "", + Path: "", + }, + () => resolvePdftotext(), + ); + } catch (err) { + thrown = err; + } + // If the test box has a real pdftotext on disk, resolution succeeds + // (POSIX candidates) — that's fine; the assertion is gated on whether + // it threw. On Windows-CI without poppler, it throws. + if (thrown) { + expect(thrown).toBeInstanceOf(PdftotextUnavailableError); + expect((thrown as Error).message).toContain("pdftotext not found"); + expect((thrown as Error).message).toContain("GSTACK_PDFTOTEXT_BIN"); + expect((thrown as Error).message).toContain("Windows"); + expect((thrown as Error).message).toContain("scoop install poppler"); + } + }); +}); diff --git a/plan-devex-review/SKILL.md b/plan-devex-review/SKILL.md index 1e90c10f98..dda12c3fe2 100644 --- a/plan-devex-review/SKILL.md +++ b/plan-devex-review/SKILL.md @@ -1823,21 +1823,6 @@ DX IMPLEMENTATION CHECKLIST ### Unresolved Decisions If any AskUserQuestion goes unanswered, note here. Never silently default. -## Review Log - -After producing the DX Scorecard above, persist the review result. - -**PLAN MODE EXCEPTION — ALWAYS RUN:** This command writes review metadata to -`~/.gstack/` (user config directory, not project files). - -```bash -~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"plan-devex-review","timestamp":"TIMESTAMP","status":"STATUS","initial_score":N,"overall_score":N,"product_type":"TYPE","tthw_current":"TTHW_CURRENT","tthw_target":"TTHW_TARGET","mode":"MODE","persona":"PERSONA","competitive_tier":"TIER","pass_scores":{"getting_started":N,"api_design":N,"errors":N,"docs":N,"upgrade":N,"dev_env":N,"community":N,"measurement":N},"unresolved":N,"commit":"COMMIT"}' -``` - -Substitute values from the DX Scorecard. MODE is EXPANSION/POLISH/TRIAGE. -PERSONA is a short label (e.g., "yc-founder", "platform-eng"). -TIER is Champion/Competitive/NeedsWork/RedFlag. - ## Review Readiness Dashboard After completing the review, read the review log and config to display the dashboard. diff --git a/plan-devex-review/SKILL.md.tmpl b/plan-devex-review/SKILL.md.tmpl index a9f39ebb1d..449b4132cf 100644 --- a/plan-devex-review/SKILL.md.tmpl +++ b/plan-devex-review/SKILL.md.tmpl @@ -780,21 +780,6 @@ DX IMPLEMENTATION CHECKLIST ### Unresolved Decisions If any AskUserQuestion goes unanswered, note here. Never silently default. -## Review Log - -After producing the DX Scorecard above, persist the review result. - -**PLAN MODE EXCEPTION — ALWAYS RUN:** This command writes review metadata to -`~/.gstack/` (user config directory, not project files). - -```bash -~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"plan-devex-review","timestamp":"TIMESTAMP","status":"STATUS","initial_score":N,"overall_score":N,"product_type":"TYPE","tthw_current":"TTHW_CURRENT","tthw_target":"TTHW_TARGET","mode":"MODE","persona":"PERSONA","competitive_tier":"TIER","pass_scores":{"getting_started":N,"api_design":N,"errors":N,"docs":N,"upgrade":N,"dev_env":N,"community":N,"measurement":N},"unresolved":N,"commit":"COMMIT"}' -``` - -Substitute values from the DX Scorecard. MODE is EXPANSION/POLISH/TRIAGE. -PERSONA is a short label (e.g., "yc-founder", "platform-eng"). -TIER is Champion/Competitive/NeedsWork/RedFlag. - {{REVIEW_DASHBOARD}} {{PLAN_FILE_REVIEW_REPORT}} diff --git a/review/SKILL.md b/review/SKILL.md index 37c1651584..e29e4ad91c 100644 --- a/review/SKILL.md +++ b/review/SKILL.md @@ -889,19 +889,43 @@ For each item, note: - The item text (verbatim or concise summary) - Its category: CODE | TEST | MIGRATION | CONFIG | DOCS +### Verification Mode + +Before judging completion, classify HOW each item can be verified. The diff alone cannot prove every kind of work. Items outside the current repo or system are structurally invisible to `git diff`. + +- **DIFF-VERIFIABLE** — A code change in this repo would manifest in `git diff ...HEAD`. Examples: "add UserService" (file appears), "validate input X" (validation logic appears), "create users table" (migration file appears). +- **CROSS-REPO** — Item names a file or change in a sibling repo (e.g., `domain-hq/docs/dashboard.md`, `~/Development//...`). The current diff CANNOT prove this. +- **EXTERNAL-STATE** — Item names state in an external system: Supabase config/RLS, Cloudflare DNS, Vercel env vars, OAuth provider allowlists, third-party SaaS, DNS records. The current diff CANNOT prove this. +- **CONTENT-SHAPE** — Item requires a file to follow a specific convention. If the file is in this repo: diff-verifiable. If in another repo or system: see CROSS-REPO / EXTERNAL-STATE. + +**Verification dispatch:** + +- **DIFF-VERIFIABLE** → cross-reference against diff (next section). +- **CROSS-REPO** → if the sibling repo is reachable on disk (try `~/Development//`, `~/code//`, the parent of the current repo), run `[ -f ]` to check file existence. File exists → DONE (cite path). File missing → NOT DONE (cite path). Path unreachable → UNVERIFIABLE (cite what needs manual check). +- **EXTERNAL-STATE** → UNVERIFIABLE. Cite the system and the specific check the user must perform. +- **CONTENT-SHAPE in another repo** → if the file exists, run any project-detected validator (see "Validator detection" below) before falling back to UNVERIFIABLE. With a validator: pass → DONE; fail → NOT DONE (cite validator output). No validator available: classify UNVERIFIABLE and cite both the file path and the convention to confirm. + +**Path concreteness rule.** If a plan item names a *concrete filesystem path* (absolute, `~/...`, or `/`), it MUST be classified DONE or NOT DONE based on `[ -f ]`. UNVERIFIABLE is only valid when the path is genuinely abstract ("Cloudflare DNS", "Supabase allowlist") or the sibling root is unreachable on this machine. "I don't want to check" is not unreachable. + +**Validator detection.** Before falling back to UNVERIFIABLE on a CONTENT-SHAPE item, scan the target repo's `package.json` for any script matching `validate-*`, `lint-wiki`, `check-docs`, or similar. If found, invoke it with the relevant path argument (e.g., `npm run validate-wiki -- `). For multi-target validators (e.g., `validate-wiki --all`), run once and reconcile per-item from the output. A passing validator promotes the item from UNVERIFIABLE to DONE; a failing one demotes to NOT DONE. + +**Honesty rule.** Do NOT classify an item as DONE just because related code shipped. Code that *handles* a deliverable is not the deliverable. Shipping a markdown-extraction library is not the same as shipping the markdown file. When in doubt between DONE and UNVERIFIABLE, prefer UNVERIFIABLE — better to surface a confirmation prompt than silently miss a deliverable. + ### Cross-Reference Against Diff Run `git diff origin/...HEAD` and `git log origin/..HEAD --oneline` to understand what was implemented. -For each extracted plan item, check the diff and classify: +For each extracted plan item, run the verification dispatch from the previous section, then classify: -- **DONE** — Clear evidence in the diff that this item was implemented. Cite the specific file(s) changed. -- **PARTIAL** — Some work toward this item exists in the diff but it's incomplete (e.g., model created but controller missing, function exists but edge cases not handled). -- **NOT DONE** — No evidence in the diff that this item was addressed. +- **DONE** — Clear evidence the item shipped. Cite the specific file(s) changed in the diff for DIFF-VERIFIABLE items, or the verified path that exists for CROSS-REPO items with a reachable sibling repo. +- **PARTIAL** — Some work toward this item exists but is incomplete (e.g., model created but controller missing, function exists but edge cases not handled). +- **NOT DONE** — Verification ran and produced negative evidence (file missing, code absent in diff, sibling-repo file confirmed absent). - **CHANGED** — The item was implemented using a different approach than the plan described, but the same goal is achieved. Note the difference. +- **UNVERIFIABLE** — The diff and any reachable sibling-repo checks cannot prove or disprove this. Always applies to EXTERNAL-STATE items and to CROSS-REPO items where the sibling repo isn't reachable. Cite the specific manual verification the user must perform (e.g., "check Cloudflare DNS shows DNS-only mode for dashboard.example.com", "confirm /docs/dashboard.md exists in domain-hq repo"). -**Be conservative with DONE** — require clear evidence in the diff. A file being touched is not enough; the specific functionality described must be present. +**Be conservative with DONE** — require clear evidence. A file being touched is not enough; the specific functionality described must be present. **Be generous with CHANGED** — if the goal is met by different means, that counts as addressed. +**Be honest with UNVERIFIABLE** — better to surface 5 items the user must manually confirm than silently classify them DONE. ### Output Format @@ -911,20 +935,25 @@ PLAN COMPLETION AUDIT Plan: {plan file path} ## Implementation Items - [DONE] Create UserService — src/services/user_service.rb (+142 lines) - [PARTIAL] Add validation — model validates but missing controller checks - [NOT DONE] Add caching layer — no cache-related changes in diff - [CHANGED] "Redis queue" → implemented with Sidekiq instead + [DONE] Create UserService — src/services/user_service.rb (+142 lines) + [PARTIAL] Add validation — model validates but missing controller checks + [NOT DONE] Add caching layer — no cache-related changes in diff + [CHANGED] "Redis queue" → implemented with Sidekiq instead ## Test Items - [DONE] Unit tests for UserService — test/services/user_service_test.rb - [NOT DONE] E2E test for signup flow + [DONE] Unit tests for UserService — test/services/user_service_test.rb + [NOT DONE] E2E test for signup flow ## Migration Items - [DONE] Create users table — db/migrate/20240315_create_users.rb + [DONE] Create users table — db/migrate/20240315_create_users.rb + +## Cross-Repo / External Items + [DONE] sibling-repo has /docs/dashboard.md — verified at ~/Development/sibling-repo/docs/dashboard.md + [UNVERIFIABLE] Cloudflare DNS-only on api.example.com — external system, manual check required + [UNVERIFIABLE] Supabase auth allowlist contains user email — external system, confirm in Supabase dashboard ───────────────────────────────── -COMPLETION: 4/7 DONE, 1 PARTIAL, 1 NOT DONE, 1 CHANGED +COMPLETION: 5/9 DONE, 1 PARTIAL, 1 NOT DONE, 1 CHANGED, 2 UNVERIFIABLE ───────────────────────────────── ``` diff --git a/scripts/resolvers/review.ts b/scripts/resolvers/review.ts index faf7dbf2c3..263767d699 100644 --- a/scripts/resolvers/review.ts +++ b/scripts/resolvers/review.ts @@ -772,21 +772,47 @@ For each item, note: - The item text (verbatim or concise summary) - Its category: CODE | TEST | MIGRATION | CONFIG | DOCS`); + // ── Verification Mode (per PR #1302 — VAS-449 remediation) ── + sections.push(` +### Verification Mode + +Before judging completion, classify HOW each item can be verified. The diff alone cannot prove every kind of work. Items outside the current repo or system are structurally invisible to \`git diff\`. + +- **DIFF-VERIFIABLE** — A code change in this repo would manifest in \`git diff ...HEAD\`. Examples: "add UserService" (file appears), "validate input X" (validation logic appears), "create users table" (migration file appears). +- **CROSS-REPO** — Item names a file or change in a sibling repo (e.g., \`domain-hq/docs/dashboard.md\`, \`~/Development//...\`). The current diff CANNOT prove this. +- **EXTERNAL-STATE** — Item names state in an external system: Supabase config/RLS, Cloudflare DNS, Vercel env vars, OAuth provider allowlists, third-party SaaS, DNS records. The current diff CANNOT prove this. +- **CONTENT-SHAPE** — Item requires a file to follow a specific convention. If the file is in this repo: diff-verifiable. If in another repo or system: see CROSS-REPO / EXTERNAL-STATE. + +**Verification dispatch:** + +- **DIFF-VERIFIABLE** → cross-reference against diff (next section). +- **CROSS-REPO** → if the sibling repo is reachable on disk (try \`~/Development//\`, \`~/code//\`, the parent of the current repo), run \`[ -f ]\` to check file existence. File exists → DONE (cite path). File missing → NOT DONE (cite path). Path unreachable → UNVERIFIABLE (cite what needs manual check). +- **EXTERNAL-STATE** → UNVERIFIABLE. Cite the system and the specific check the user must perform. +- **CONTENT-SHAPE in another repo** → if the file exists, run any project-detected validator (see "Validator detection" below) before falling back to UNVERIFIABLE. With a validator: pass → DONE; fail → NOT DONE (cite validator output). No validator available: classify UNVERIFIABLE and cite both the file path and the convention to confirm. + +**Path concreteness rule.** If a plan item names a *concrete filesystem path* (absolute, \`~/...\`, or \`/\`), it MUST be classified DONE or NOT DONE based on \`[ -f ]\`. UNVERIFIABLE is only valid when the path is genuinely abstract ("Cloudflare DNS", "Supabase allowlist") or the sibling root is unreachable on this machine. "I don't want to check" is not unreachable. + +**Validator detection.** Before falling back to UNVERIFIABLE on a CONTENT-SHAPE item, scan the target repo's \`package.json\` for any script matching \`validate-*\`, \`lint-wiki\`, \`check-docs\`, or similar. If found, invoke it with the relevant path argument (e.g., \`npm run validate-wiki -- \`). For multi-target validators (e.g., \`validate-wiki --all\`), run once and reconcile per-item from the output. A passing validator promotes the item from UNVERIFIABLE to DONE; a failing one demotes to NOT DONE. + +**Honesty rule.** Do NOT classify an item as DONE just because related code shipped. Code that *handles* a deliverable is not the deliverable. Shipping a markdown-extraction library is not the same as shipping the markdown file. When in doubt between DONE and UNVERIFIABLE, prefer UNVERIFIABLE — better to surface a confirmation prompt than silently miss a deliverable.`); + // ── Cross-reference against diff ── sections.push(` ### Cross-Reference Against Diff Run \`git diff origin/...HEAD\` and \`git log origin/..HEAD --oneline\` to understand what was implemented. -For each extracted plan item, check the diff and classify: +For each extracted plan item, run the verification dispatch from the previous section, then classify: -- **DONE** — Clear evidence in the diff that this item was implemented. Cite the specific file(s) changed. -- **PARTIAL** — Some work toward this item exists in the diff but it's incomplete (e.g., model created but controller missing, function exists but edge cases not handled). -- **NOT DONE** — No evidence in the diff that this item was addressed. +- **DONE** — Clear evidence the item shipped. Cite the specific file(s) changed in the diff for DIFF-VERIFIABLE items, or the verified path that exists for CROSS-REPO items with a reachable sibling repo. +- **PARTIAL** — Some work toward this item exists but is incomplete (e.g., model created but controller missing, function exists but edge cases not handled). +- **NOT DONE** — Verification ran and produced negative evidence (file missing, code absent in diff, sibling-repo file confirmed absent). - **CHANGED** — The item was implemented using a different approach than the plan described, but the same goal is achieved. Note the difference. +- **UNVERIFIABLE** — The diff and any reachable sibling-repo checks cannot prove or disprove this. Always applies to EXTERNAL-STATE items and to CROSS-REPO items where the sibling repo isn't reachable. Cite the specific manual verification the user must perform (e.g., "check Cloudflare DNS shows DNS-only mode for dashboard.example.com", "confirm /docs/dashboard.md exists in domain-hq repo"). -**Be conservative with DONE** — require clear evidence in the diff. A file being touched is not enough; the specific functionality described must be present. -**Be generous with CHANGED** — if the goal is met by different means, that counts as addressed.`); +**Be conservative with DONE** — require clear evidence. A file being touched is not enough; the specific functionality described must be present. +**Be generous with CHANGED** — if the goal is met by different means, that counts as addressed. +**Be honest with UNVERIFIABLE** — better to surface 5 items the user must manually confirm than silently classify them DONE.`); // ── Output format ── sections.push(` @@ -798,20 +824,25 @@ PLAN COMPLETION AUDIT Plan: {plan file path} ## Implementation Items - [DONE] Create UserService — src/services/user_service.rb (+142 lines) - [PARTIAL] Add validation — model validates but missing controller checks - [NOT DONE] Add caching layer — no cache-related changes in diff - [CHANGED] "Redis queue" → implemented with Sidekiq instead + [DONE] Create UserService — src/services/user_service.rb (+142 lines) + [PARTIAL] Add validation — model validates but missing controller checks + [NOT DONE] Add caching layer — no cache-related changes in diff + [CHANGED] "Redis queue" → implemented with Sidekiq instead ## Test Items - [DONE] Unit tests for UserService — test/services/user_service_test.rb - [NOT DONE] E2E test for signup flow + [DONE] Unit tests for UserService — test/services/user_service_test.rb + [NOT DONE] E2E test for signup flow ## Migration Items - [DONE] Create users table — db/migrate/20240315_create_users.rb + [DONE] Create users table — db/migrate/20240315_create_users.rb + +## Cross-Repo / External Items + [DONE] sibling-repo has /docs/dashboard.md — verified at ~/Development/sibling-repo/docs/dashboard.md + [UNVERIFIABLE] Cloudflare DNS-only on api.example.com — external system, manual check required + [UNVERIFIABLE] Supabase auth allowlist contains user email — external system, confirm in Supabase dashboard ───────────────────────────────── -COMPLETION: 4/7 DONE, 1 PARTIAL, 1 NOT DONE, 1 CHANGED +COMPLETION: 5/9 DONE, 1 PARTIAL, 1 NOT DONE, 1 CHANGED, 2 UNVERIFIABLE ───────────────────────────────── \`\`\``); @@ -820,21 +851,41 @@ COMPLETION: 4/7 DONE, 1 PARTIAL, 1 NOT DONE, 1 CHANGED sections.push(` ### Gate Logic -After producing the completion checklist: +After producing the completion checklist, evaluate in priority order: -- **All DONE or CHANGED:** Pass. "Plan completion: PASS — all items addressed." Continue. -- **Only PARTIAL items (no NOT DONE):** Continue with a note in the PR body. Not blocking. -- **Any NOT DONE items:** Use AskUserQuestion: - - Show the completion checklist above - - "{N} items from the plan are NOT DONE. These were part of the original plan but are missing from the implementation." - - RECOMMENDATION: depends on item count and severity. If 1-2 minor items (docs, config), recommend B. If core functionality is missing, recommend A. - - Options: - A) Stop — implement the missing items before shipping - B) Ship anyway — defer these to a follow-up (will create P1 TODOs in Step 5.5) - C) These items were intentionally dropped — remove from scope - - If A: STOP. List the missing items for the user to implement. - - If B: Continue. For each NOT DONE item, create a P1 TODO in Step 5.5 with "Deferred from plan: {plan file path}". - - If C: Continue. Note in PR body: "Plan items intentionally dropped: {list}." +1. **Any NOT DONE items** (highest priority — known missing work). Use AskUserQuestion: + - Show the completion checklist above + - "{N} items from the plan are NOT DONE. These were part of the original plan but are missing from the implementation." + - RECOMMENDATION: depends on item count and severity. If 1-2 minor items (docs, config), recommend B. If core functionality is missing, recommend A. + - Options: + A) Stop — implement the missing items before shipping + B) Ship anyway — defer these to a follow-up (will create P1 TODOs in Step 5.5) + C) These items were intentionally dropped — remove from scope + - If A: STOP. List the missing items for the user to implement. + - If B: Continue. For each NOT DONE item, create a P1 TODO in Step 5.5 with "Deferred from plan: {plan file path}". + - If C: Continue. Note in PR body: "Plan items intentionally dropped: {list}." + +2. **Any UNVERIFIABLE items** (silent gaps — the diff cannot prove them either way). Only fires after NOT DONE is resolved or absent. + + **Per-item confirmation is mandatory.** Do NOT use a single AskUserQuestion to blanket-confirm all UNVERIFIABLE items. Blanket confirmation is the failure mode that surfaced in VAS-449 (user clicks A without opening any file). Instead: + + - Loop through UNVERIFIABLE items one at a time. + - For each item, use AskUserQuestion with the item's *specific* manual check (e.g., "Confirm: does \`~/Development/domain-hq/docs/dashboard.md\` exist?", not "Have you checked all items?"). + - Options per item: + Y) Confirmed done — cite what you verified (free-text, embedded in PR body) + N) Not done — block ship; treat as NOT DONE and re-enter the priority-1 gate + D) Intentionally dropped — note in PR body: "Plan item intentionally dropped: {item}" + - RECOMMENDATION per item: Y if the item is concrete and easily verified; N if it's critical-path (auth, DNS, deliverables to other repos) and the user shows hesitation. + + **Exit conditions:** + - Any N: STOP. Surface the missing items, suggest re-running /ship after they're addressed. + - All Y or D: Continue. Embed \`## Plan Completion — Manual Verifications\` section in PR body listing each Y'd item with the user's free-text evidence and each D'd item with "intentionally dropped". + + **Cap.** If there are more than 5 UNVERIFIABLE items, present them as a numbered list first and ask whether the user wants to (1) confirm each individually, (2) stop and reduce scope, or (3) explicitly accept blanket-confirmation with the warning that this is the VAS-449 failure shape. Default and recommended option is (1). + +3. **Only PARTIAL items (no NOT DONE, no UNVERIFIABLE):** Continue with a note in the PR body. Not blocking. + +4. **All DONE or CHANGED:** Pass. "Plan completion: PASS — all items addressed." Continue. **No plan file found:** Skip entirely. "No plan file detected — skipping plan completion audit." diff --git a/ship/SKILL.md b/ship/SKILL.md index 36b0716523..27b785e5ab 100644 --- a/ship/SKILL.md +++ b/ship/SKILL.md @@ -1613,19 +1613,43 @@ For each item, note: - The item text (verbatim or concise summary) - Its category: CODE | TEST | MIGRATION | CONFIG | DOCS +### Verification Mode + +Before judging completion, classify HOW each item can be verified. The diff alone cannot prove every kind of work. Items outside the current repo or system are structurally invisible to `git diff`. + +- **DIFF-VERIFIABLE** — A code change in this repo would manifest in `git diff ...HEAD`. Examples: "add UserService" (file appears), "validate input X" (validation logic appears), "create users table" (migration file appears). +- **CROSS-REPO** — Item names a file or change in a sibling repo (e.g., `domain-hq/docs/dashboard.md`, `~/Development//...`). The current diff CANNOT prove this. +- **EXTERNAL-STATE** — Item names state in an external system: Supabase config/RLS, Cloudflare DNS, Vercel env vars, OAuth provider allowlists, third-party SaaS, DNS records. The current diff CANNOT prove this. +- **CONTENT-SHAPE** — Item requires a file to follow a specific convention. If the file is in this repo: diff-verifiable. If in another repo or system: see CROSS-REPO / EXTERNAL-STATE. + +**Verification dispatch:** + +- **DIFF-VERIFIABLE** → cross-reference against diff (next section). +- **CROSS-REPO** → if the sibling repo is reachable on disk (try `~/Development//`, `~/code//`, the parent of the current repo), run `[ -f ]` to check file existence. File exists → DONE (cite path). File missing → NOT DONE (cite path). Path unreachable → UNVERIFIABLE (cite what needs manual check). +- **EXTERNAL-STATE** → UNVERIFIABLE. Cite the system and the specific check the user must perform. +- **CONTENT-SHAPE in another repo** → if the file exists, run any project-detected validator (see "Validator detection" below) before falling back to UNVERIFIABLE. With a validator: pass → DONE; fail → NOT DONE (cite validator output). No validator available: classify UNVERIFIABLE and cite both the file path and the convention to confirm. + +**Path concreteness rule.** If a plan item names a *concrete filesystem path* (absolute, `~/...`, or `/`), it MUST be classified DONE or NOT DONE based on `[ -f ]`. UNVERIFIABLE is only valid when the path is genuinely abstract ("Cloudflare DNS", "Supabase allowlist") or the sibling root is unreachable on this machine. "I don't want to check" is not unreachable. + +**Validator detection.** Before falling back to UNVERIFIABLE on a CONTENT-SHAPE item, scan the target repo's `package.json` for any script matching `validate-*`, `lint-wiki`, `check-docs`, or similar. If found, invoke it with the relevant path argument (e.g., `npm run validate-wiki -- `). For multi-target validators (e.g., `validate-wiki --all`), run once and reconcile per-item from the output. A passing validator promotes the item from UNVERIFIABLE to DONE; a failing one demotes to NOT DONE. + +**Honesty rule.** Do NOT classify an item as DONE just because related code shipped. Code that *handles* a deliverable is not the deliverable. Shipping a markdown-extraction library is not the same as shipping the markdown file. When in doubt between DONE and UNVERIFIABLE, prefer UNVERIFIABLE — better to surface a confirmation prompt than silently miss a deliverable. + ### Cross-Reference Against Diff Run `git diff origin/...HEAD` and `git log origin/..HEAD --oneline` to understand what was implemented. -For each extracted plan item, check the diff and classify: +For each extracted plan item, run the verification dispatch from the previous section, then classify: -- **DONE** — Clear evidence in the diff that this item was implemented. Cite the specific file(s) changed. -- **PARTIAL** — Some work toward this item exists in the diff but it's incomplete (e.g., model created but controller missing, function exists but edge cases not handled). -- **NOT DONE** — No evidence in the diff that this item was addressed. +- **DONE** — Clear evidence the item shipped. Cite the specific file(s) changed in the diff for DIFF-VERIFIABLE items, or the verified path that exists for CROSS-REPO items with a reachable sibling repo. +- **PARTIAL** — Some work toward this item exists but is incomplete (e.g., model created but controller missing, function exists but edge cases not handled). +- **NOT DONE** — Verification ran and produced negative evidence (file missing, code absent in diff, sibling-repo file confirmed absent). - **CHANGED** — The item was implemented using a different approach than the plan described, but the same goal is achieved. Note the difference. +- **UNVERIFIABLE** — The diff and any reachable sibling-repo checks cannot prove or disprove this. Always applies to EXTERNAL-STATE items and to CROSS-REPO items where the sibling repo isn't reachable. Cite the specific manual verification the user must perform (e.g., "check Cloudflare DNS shows DNS-only mode for dashboard.example.com", "confirm /docs/dashboard.md exists in domain-hq repo"). -**Be conservative with DONE** — require clear evidence in the diff. A file being touched is not enough; the specific functionality described must be present. +**Be conservative with DONE** — require clear evidence. A file being touched is not enough; the specific functionality described must be present. **Be generous with CHANGED** — if the goal is met by different means, that counts as addressed. +**Be honest with UNVERIFIABLE** — better to surface 5 items the user must manually confirm than silently classify them DONE. ### Output Format @@ -1635,56 +1659,81 @@ PLAN COMPLETION AUDIT Plan: {plan file path} ## Implementation Items - [DONE] Create UserService — src/services/user_service.rb (+142 lines) - [PARTIAL] Add validation — model validates but missing controller checks - [NOT DONE] Add caching layer — no cache-related changes in diff - [CHANGED] "Redis queue" → implemented with Sidekiq instead + [DONE] Create UserService — src/services/user_service.rb (+142 lines) + [PARTIAL] Add validation — model validates but missing controller checks + [NOT DONE] Add caching layer — no cache-related changes in diff + [CHANGED] "Redis queue" → implemented with Sidekiq instead ## Test Items - [DONE] Unit tests for UserService — test/services/user_service_test.rb - [NOT DONE] E2E test for signup flow + [DONE] Unit tests for UserService — test/services/user_service_test.rb + [NOT DONE] E2E test for signup flow ## Migration Items - [DONE] Create users table — db/migrate/20240315_create_users.rb + [DONE] Create users table — db/migrate/20240315_create_users.rb + +## Cross-Repo / External Items + [DONE] sibling-repo has /docs/dashboard.md — verified at ~/Development/sibling-repo/docs/dashboard.md + [UNVERIFIABLE] Cloudflare DNS-only on api.example.com — external system, manual check required + [UNVERIFIABLE] Supabase auth allowlist contains user email — external system, confirm in Supabase dashboard ───────────────────────────────── -COMPLETION: 4/7 DONE, 1 PARTIAL, 1 NOT DONE, 1 CHANGED +COMPLETION: 5/9 DONE, 1 PARTIAL, 1 NOT DONE, 1 CHANGED, 2 UNVERIFIABLE ───────────────────────────────── ``` ### Gate Logic -After producing the completion checklist: +After producing the completion checklist, evaluate in priority order: -- **All DONE or CHANGED:** Pass. "Plan completion: PASS — all items addressed." Continue. -- **Only PARTIAL items (no NOT DONE):** Continue with a note in the PR body. Not blocking. -- **Any NOT DONE items:** Use AskUserQuestion: - - Show the completion checklist above - - "{N} items from the plan are NOT DONE. These were part of the original plan but are missing from the implementation." - - RECOMMENDATION: depends on item count and severity. If 1-2 minor items (docs, config), recommend B. If core functionality is missing, recommend A. - - Options: - A) Stop — implement the missing items before shipping - B) Ship anyway — defer these to a follow-up (will create P1 TODOs in Step 5.5) - C) These items were intentionally dropped — remove from scope - - If A: STOP. List the missing items for the user to implement. - - If B: Continue. For each NOT DONE item, create a P1 TODO in Step 5.5 with "Deferred from plan: {plan file path}". - - If C: Continue. Note in PR body: "Plan items intentionally dropped: {list}." +1. **Any NOT DONE items** (highest priority — known missing work). Use AskUserQuestion: + - Show the completion checklist above + - "{N} items from the plan are NOT DONE. These were part of the original plan but are missing from the implementation." + - RECOMMENDATION: depends on item count and severity. If 1-2 minor items (docs, config), recommend B. If core functionality is missing, recommend A. + - Options: + A) Stop — implement the missing items before shipping + B) Ship anyway — defer these to a follow-up (will create P1 TODOs in Step 5.5) + C) These items were intentionally dropped — remove from scope + - If A: STOP. List the missing items for the user to implement. + - If B: Continue. For each NOT DONE item, create a P1 TODO in Step 5.5 with "Deferred from plan: {plan file path}". + - If C: Continue. Note in PR body: "Plan items intentionally dropped: {list}." + +2. **Any UNVERIFIABLE items** (silent gaps — the diff cannot prove them either way). Only fires after NOT DONE is resolved or absent. + + **Per-item confirmation is mandatory.** Do NOT use a single AskUserQuestion to blanket-confirm all UNVERIFIABLE items. Blanket confirmation is the failure mode that surfaced in VAS-449 (user clicks A without opening any file). Instead: + + - Loop through UNVERIFIABLE items one at a time. + - For each item, use AskUserQuestion with the item's *specific* manual check (e.g., "Confirm: does `~/Development/domain-hq/docs/dashboard.md` exist?", not "Have you checked all items?"). + - Options per item: + Y) Confirmed done — cite what you verified (free-text, embedded in PR body) + N) Not done — block ship; treat as NOT DONE and re-enter the priority-1 gate + D) Intentionally dropped — note in PR body: "Plan item intentionally dropped: {item}" + - RECOMMENDATION per item: Y if the item is concrete and easily verified; N if it's critical-path (auth, DNS, deliverables to other repos) and the user shows hesitation. + + **Exit conditions:** + - Any N: STOP. Surface the missing items, suggest re-running /ship after they're addressed. + - All Y or D: Continue. Embed `## Plan Completion — Manual Verifications` section in PR body listing each Y'd item with the user's free-text evidence and each D'd item with "intentionally dropped". + + **Cap.** If there are more than 5 UNVERIFIABLE items, present them as a numbered list first and ask whether the user wants to (1) confirm each individually, (2) stop and reduce scope, or (3) explicitly accept blanket-confirmation with the warning that this is the VAS-449 failure shape. Default and recommended option is (1). + +3. **Only PARTIAL items (no NOT DONE, no UNVERIFIABLE):** Continue with a note in the PR body. Not blocking. + +4. **All DONE or CHANGED:** Pass. "Plan completion: PASS — all items addressed." Continue. **No plan file found:** Skip entirely. "No plan file detected — skipping plan completion audit." **Include in PR body (Step 8):** Add a `## Plan Completion` section with the checklist summary. > > After your analysis, output a single JSON object on the LAST LINE of your response (no other text after it): -> `{"total_items":N,"done":N,"changed":N,"deferred":N,"summary":""}` +> `{"total_items":N,"done":N,"changed":N,"deferred":N,"unverifiable":N,"summary":""}` **Parent processing:** 1. Parse the LAST line of the subagent's output as JSON. -2. Store `done`, `deferred` for Step 20 metrics; use `summary` in PR body. -3. If `deferred > 0` and no user override, present the deferred items via AskUserQuestion before continuing. -4. Embed `summary` in PR body's `## Plan Completion` section (Step 19). +2. Store `done`, `deferred`, `unverifiable` for Step 20 metrics; use `summary` in PR body. +3. If `deferred > 0` or `unverifiable > 0` and no user override, present the items via the appropriate AskUserQuestion (see Gate Logic priority order above) before continuing. +4. Embed `summary` in PR body's `## Plan Completion` section (Step 19). If `unverifiable > 0` and the user picked option A in the UNVERIFIABLE gate, also embed `## Plan Completion — Manual Verifications` listing each user-confirmed item. -**If the subagent fails or returns invalid JSON:** Fall back to running the audit inline. Never block /ship on subagent failure. +**If the subagent fails or returns invalid JSON:** Fall back to running the audit inline (parent processes the same plan-extraction + classification logic). If the inline fallback also fails (e.g., plan file unreadable, parser error), do NOT silently pass — surface the failure as an explicit AskUserQuestion: "Plan Completion audit could not run ({reason}). Options: (A) Skip audit and ship anyway — record that the audit was skipped in PR body and Step 20 metrics; (B) Stop and fix the audit." Default and recommended option is (B). Silent fail-open is the failure shape that VAS-449 surfaced. --- @@ -2394,6 +2443,11 @@ already knows. A good test: would this insight save time in a future session? If **Idempotency check:** Before bumping, classify the state by comparing `VERSION` against the base branch AND against `package.json`'s `version` field. Four states: FRESH (do bump), ALREADY_BUMPED (skip bump), DRIFT_STALE_PKG (sync pkg only, no re-bump), DRIFT_UNEXPECTED (stop and ask). ```bash +if ! git rev-parse --verify origin/ >/dev/null 2>&1; then + echo "ERROR: Unable to resolve origin/. Run 'git fetch origin' or verify the base branch exists." + exit 1 +fi + BASE_VERSION=$(git show origin/:VERSION 2>/dev/null | tr -d '\r\n[:space:]' || echo "0.0.0.0") CURRENT_VERSION=$(cat VERSION 2>/dev/null | tr -d '\r\n[:space:]' || echo "0.0.0.0") [ -z "$BASE_VERSION" ] && BASE_VERSION="0.0.0.0" diff --git a/ship/SKILL.md.tmpl b/ship/SKILL.md.tmpl index 470068fd89..a1c18ecf51 100644 --- a/ship/SKILL.md.tmpl +++ b/ship/SKILL.md.tmpl @@ -268,16 +268,16 @@ If multiple suites need to run, run them sequentially (each needs a test lane). > {{PLAN_COMPLETION_AUDIT_SHIP}} > > After your analysis, output a single JSON object on the LAST LINE of your response (no other text after it): -> `{"total_items":N,"done":N,"changed":N,"deferred":N,"summary":""}` +> `{"total_items":N,"done":N,"changed":N,"deferred":N,"unverifiable":N,"summary":""}` **Parent processing:** 1. Parse the LAST line of the subagent's output as JSON. -2. Store `done`, `deferred` for Step 20 metrics; use `summary` in PR body. -3. If `deferred > 0` and no user override, present the deferred items via AskUserQuestion before continuing. -4. Embed `summary` in PR body's `## Plan Completion` section (Step 19). +2. Store `done`, `deferred`, `unverifiable` for Step 20 metrics; use `summary` in PR body. +3. If `deferred > 0` or `unverifiable > 0` and no user override, present the items via the appropriate AskUserQuestion (see Gate Logic priority order above) before continuing. +4. Embed `summary` in PR body's `## Plan Completion` section (Step 19). If `unverifiable > 0` and the user picked option A in the UNVERIFIABLE gate, also embed `## Plan Completion — Manual Verifications` listing each user-confirmed item. -**If the subagent fails or returns invalid JSON:** Fall back to running the audit inline. Never block /ship on subagent failure. +**If the subagent fails or returns invalid JSON:** Fall back to running the audit inline (parent processes the same plan-extraction + classification logic). If the inline fallback also fails (e.g., plan file unreadable, parser error), do NOT silently pass — surface the failure as an explicit AskUserQuestion: "Plan Completion audit could not run ({reason}). Options: (A) Skip audit and ship anyway — record that the audit was skipped in PR body and Step 20 metrics; (B) Stop and fix the audit." Default and recommended option is (B). Silent fail-open is the failure shape that VAS-449 surfaced. --- @@ -406,6 +406,11 @@ For each comment in `comments`: **Idempotency check:** Before bumping, classify the state by comparing `VERSION` against the base branch AND against `package.json`'s `version` field. Four states: FRESH (do bump), ALREADY_BUMPED (skip bump), DRIFT_STALE_PKG (sync pkg only, no re-bump), DRIFT_UNEXPECTED (stop and ask). ```bash +if ! git rev-parse --verify origin/ >/dev/null 2>&1; then + echo "ERROR: Unable to resolve origin/. Run 'git fetch origin' or verify the base branch exists." + exit 1 +fi + BASE_VERSION=$(git show origin/:VERSION 2>/dev/null | tr -d '\r\n[:space:]' || echo "0.0.0.0") CURRENT_VERSION=$(cat VERSION 2>/dev/null | tr -d '\r\n[:space:]' || echo "0.0.0.0") [ -z "$BASE_VERSION" ] && BASE_VERSION="0.0.0.0" diff --git a/test/codex-resume-flag-semantics.test.ts b/test/codex-resume-flag-semantics.test.ts new file mode 100644 index 0000000000..9075e04897 --- /dev/null +++ b/test/codex-resume-flag-semantics.test.ts @@ -0,0 +1,60 @@ +/** + * Live codex CLI flag-semantics smoke for `codex exec resume`. + * + * Closes the gap left by #1270's regex-only assertion against codex/SKILL.md. + * That regex catches the SKILL.md regressing back to `-C/-s` flags, but it + * does not catch the codex CLI itself flipping flag semantics again. This + * test probes the live `codex exec resume --help` output and asserts the + * surface the gstack /codex skill depends on. + * + * Skips silently when codex is not on PATH, so dev machines without codex + * installed never see this fail. CI lanes that run with codex installed + * (the periodic-tier eval runners) will exercise it. + */ + +import { describe, test, expect } from 'bun:test'; +import { spawnSync } from 'child_process'; + +const codexPath = spawnSync('which', ['codex'], { encoding: 'utf-8' }).stdout.trim(); +const codexAvailable = codexPath.length > 0; + +describe.skipIf(!codexAvailable)( + 'codex exec resume — flag semantics (live CLI smoke; closes #1270 regex-only gap)', + () => { + test('codex exec resume --help mentions sandbox_mode as a -c config key', () => { + const result = spawnSync('codex', ['exec', 'resume', '--help'], { + encoding: 'utf-8', + stdio: ['ignore', 'pipe', 'pipe'], + timeout: 10_000, + }); + const helpText = (result.stdout || '') + '\n' + (result.stderr || ''); + // The /codex skill builds resume invocations with `-c 'sandbox_mode="read-only"'`. + // If codex stops accepting `-c sandbox_mode=...` for the resume subcommand, + // every resume invocation through gstack starts failing. + expect(helpText).toMatch(/-c\b|--config\b|sandbox_mode/i); + }); + + test('codex exec resume --help does NOT advertise -C as a top-level flag', () => { + const result = spawnSync('codex', ['exec', 'resume', '--help'], { + encoding: 'utf-8', + stdio: ['ignore', 'pipe', 'pipe'], + timeout: 10_000, + }); + const helpText = (result.stdout || '') + '\n' + (result.stderr || ''); + // The whole point of #1270 was that `codex exec resume` rejects `-C `. + // If the help text starts listing `-C` again, the SKILL.md guidance to + // drop `-C` is wrong and the surrounding `cd "$_REPO_ROOT"` workaround is + // unnecessary. Either way, /codex needs an update. + // Allow `-C` to appear in flag descriptions or option-name strings as long + // as it isn't presented as a flag of the `resume` subcommand. The cheapest + // signal: the `Options:` block (or first-column flag list) should not + // contain a literal `-C ` flag entry on its own line. + const optionLines = helpText + .split('\n') + .map((l) => l.trimStart()) + .filter((l) => /^-[A-Za-z],?\s/.test(l) || /^--[A-Za-z]/.test(l)); + const hasTopLevelDashC = optionLines.some((l) => /^-C[\s,]/.test(l)); + expect(hasTopLevelDashC).toBe(false); + }); + }, +); diff --git a/test/fixtures/golden/claude-ship-SKILL.md b/test/fixtures/golden/claude-ship-SKILL.md index f3930cf2ec..27b785e5ab 100644 --- a/test/fixtures/golden/claude-ship-SKILL.md +++ b/test/fixtures/golden/claude-ship-SKILL.md @@ -350,30 +350,29 @@ _BRAIN_SYNC_BIN="~/.claude/skills/gstack/bin/gstack-brain-sync" _BRAIN_CONFIG_BIN="~/.claude/skills/gstack/bin/gstack-config" # /sync-gbrain context-load: teach the agent to use gbrain when it's available. -# Mutually exclusive variants per /plan-eng-review §4. Empty string when gbrain -# is not configured (zero context cost for non-gbrain users). +# Per-worktree pin: post-spike redesign uses kubectl-style `.gbrain-source` in the +# git toplevel to scope queries. Look for the pin in the worktree (not a global +# state file) so that opening worktree B without a pin doesn't claim "indexed" +# just because worktree A was synced. Empty string when gbrain is not +# configured (zero context cost for non-gbrain users). _GBRAIN_CONFIG="$HOME/.gbrain/config.json" if [ -f "$_GBRAIN_CONFIG" ] && command -v gbrain >/dev/null 2>&1; then _GBRAIN_VERSION_OK=$(gbrain --version 2>/dev/null | grep -c '^gbrain ' || echo 0) if [ "$_GBRAIN_VERSION_OK" -gt 0 ] 2>/dev/null; then - _SYNC_STATE="$_GSTACK_HOME/.gbrain-sync-state.json" - _CWD_PAGES=0 - if [ -f "$_SYNC_STATE" ]; then - # Flatten newlines so the regex works against pretty-printed JSON too. - _CWD_PAGES=$(tr -d '\n' < "$_SYNC_STATE" 2>/dev/null \ - | grep -o '"name": *"code"[^}]*"detail": *{[^}]*"page_count": *[0-9]*' \ - | grep -o '"page_count": *[0-9]*' | grep -o '[0-9]\+' | head -1) - _CWD_PAGES=${_CWD_PAGES:-0} + _GBRAIN_PIN_PATH="" + _REPO_TOP=$(git rev-parse --show-toplevel 2>/dev/null || echo "") + if [ -n "$_REPO_TOP" ] && [ -f "$_REPO_TOP/.gbrain-source" ]; then + _GBRAIN_PIN_PATH="$_REPO_TOP/.gbrain-source" fi - if [ "$_CWD_PAGES" -gt 0 ] 2>/dev/null; then + if [ -n "$_GBRAIN_PIN_PATH" ]; then echo "GBrain configured. Prefer \`gbrain search\`/\`gbrain query\` over Grep for" echo "semantic questions; use \`gbrain code-def\`/\`code-refs\`/\`code-callers\` for" echo "symbol-aware code lookup. See \"## GBrain Search Guidance\" in CLAUDE.md." echo "Run /sync-gbrain to refresh." else - echo "GBrain configured but this repo isn't indexed yet. Run \`/sync-gbrain --full\`" - echo "before relying on \`gbrain search\` for code questions in this repo." - echo "Falls back to Grep until indexed." + echo "GBrain configured but this worktree isn't pinned yet. Run \`/sync-gbrain --full\`" + echo "before relying on \`gbrain search\` for code questions in this worktree." + echo "Falls back to Grep until pinned." fi fi fi @@ -1614,19 +1613,43 @@ For each item, note: - The item text (verbatim or concise summary) - Its category: CODE | TEST | MIGRATION | CONFIG | DOCS +### Verification Mode + +Before judging completion, classify HOW each item can be verified. The diff alone cannot prove every kind of work. Items outside the current repo or system are structurally invisible to `git diff`. + +- **DIFF-VERIFIABLE** — A code change in this repo would manifest in `git diff ...HEAD`. Examples: "add UserService" (file appears), "validate input X" (validation logic appears), "create users table" (migration file appears). +- **CROSS-REPO** — Item names a file or change in a sibling repo (e.g., `domain-hq/docs/dashboard.md`, `~/Development//...`). The current diff CANNOT prove this. +- **EXTERNAL-STATE** — Item names state in an external system: Supabase config/RLS, Cloudflare DNS, Vercel env vars, OAuth provider allowlists, third-party SaaS, DNS records. The current diff CANNOT prove this. +- **CONTENT-SHAPE** — Item requires a file to follow a specific convention. If the file is in this repo: diff-verifiable. If in another repo or system: see CROSS-REPO / EXTERNAL-STATE. + +**Verification dispatch:** + +- **DIFF-VERIFIABLE** → cross-reference against diff (next section). +- **CROSS-REPO** → if the sibling repo is reachable on disk (try `~/Development//`, `~/code//`, the parent of the current repo), run `[ -f ]` to check file existence. File exists → DONE (cite path). File missing → NOT DONE (cite path). Path unreachable → UNVERIFIABLE (cite what needs manual check). +- **EXTERNAL-STATE** → UNVERIFIABLE. Cite the system and the specific check the user must perform. +- **CONTENT-SHAPE in another repo** → if the file exists, run any project-detected validator (see "Validator detection" below) before falling back to UNVERIFIABLE. With a validator: pass → DONE; fail → NOT DONE (cite validator output). No validator available: classify UNVERIFIABLE and cite both the file path and the convention to confirm. + +**Path concreteness rule.** If a plan item names a *concrete filesystem path* (absolute, `~/...`, or `/`), it MUST be classified DONE or NOT DONE based on `[ -f ]`. UNVERIFIABLE is only valid when the path is genuinely abstract ("Cloudflare DNS", "Supabase allowlist") or the sibling root is unreachable on this machine. "I don't want to check" is not unreachable. + +**Validator detection.** Before falling back to UNVERIFIABLE on a CONTENT-SHAPE item, scan the target repo's `package.json` for any script matching `validate-*`, `lint-wiki`, `check-docs`, or similar. If found, invoke it with the relevant path argument (e.g., `npm run validate-wiki -- `). For multi-target validators (e.g., `validate-wiki --all`), run once and reconcile per-item from the output. A passing validator promotes the item from UNVERIFIABLE to DONE; a failing one demotes to NOT DONE. + +**Honesty rule.** Do NOT classify an item as DONE just because related code shipped. Code that *handles* a deliverable is not the deliverable. Shipping a markdown-extraction library is not the same as shipping the markdown file. When in doubt between DONE and UNVERIFIABLE, prefer UNVERIFIABLE — better to surface a confirmation prompt than silently miss a deliverable. + ### Cross-Reference Against Diff Run `git diff origin/...HEAD` and `git log origin/..HEAD --oneline` to understand what was implemented. -For each extracted plan item, check the diff and classify: +For each extracted plan item, run the verification dispatch from the previous section, then classify: -- **DONE** — Clear evidence in the diff that this item was implemented. Cite the specific file(s) changed. -- **PARTIAL** — Some work toward this item exists in the diff but it's incomplete (e.g., model created but controller missing, function exists but edge cases not handled). -- **NOT DONE** — No evidence in the diff that this item was addressed. +- **DONE** — Clear evidence the item shipped. Cite the specific file(s) changed in the diff for DIFF-VERIFIABLE items, or the verified path that exists for CROSS-REPO items with a reachable sibling repo. +- **PARTIAL** — Some work toward this item exists but is incomplete (e.g., model created but controller missing, function exists but edge cases not handled). +- **NOT DONE** — Verification ran and produced negative evidence (file missing, code absent in diff, sibling-repo file confirmed absent). - **CHANGED** — The item was implemented using a different approach than the plan described, but the same goal is achieved. Note the difference. +- **UNVERIFIABLE** — The diff and any reachable sibling-repo checks cannot prove or disprove this. Always applies to EXTERNAL-STATE items and to CROSS-REPO items where the sibling repo isn't reachable. Cite the specific manual verification the user must perform (e.g., "check Cloudflare DNS shows DNS-only mode for dashboard.example.com", "confirm /docs/dashboard.md exists in domain-hq repo"). -**Be conservative with DONE** — require clear evidence in the diff. A file being touched is not enough; the specific functionality described must be present. +**Be conservative with DONE** — require clear evidence. A file being touched is not enough; the specific functionality described must be present. **Be generous with CHANGED** — if the goal is met by different means, that counts as addressed. +**Be honest with UNVERIFIABLE** — better to surface 5 items the user must manually confirm than silently classify them DONE. ### Output Format @@ -1636,56 +1659,81 @@ PLAN COMPLETION AUDIT Plan: {plan file path} ## Implementation Items - [DONE] Create UserService — src/services/user_service.rb (+142 lines) - [PARTIAL] Add validation — model validates but missing controller checks - [NOT DONE] Add caching layer — no cache-related changes in diff - [CHANGED] "Redis queue" → implemented with Sidekiq instead + [DONE] Create UserService — src/services/user_service.rb (+142 lines) + [PARTIAL] Add validation — model validates but missing controller checks + [NOT DONE] Add caching layer — no cache-related changes in diff + [CHANGED] "Redis queue" → implemented with Sidekiq instead ## Test Items - [DONE] Unit tests for UserService — test/services/user_service_test.rb - [NOT DONE] E2E test for signup flow + [DONE] Unit tests for UserService — test/services/user_service_test.rb + [NOT DONE] E2E test for signup flow ## Migration Items - [DONE] Create users table — db/migrate/20240315_create_users.rb + [DONE] Create users table — db/migrate/20240315_create_users.rb + +## Cross-Repo / External Items + [DONE] sibling-repo has /docs/dashboard.md — verified at ~/Development/sibling-repo/docs/dashboard.md + [UNVERIFIABLE] Cloudflare DNS-only on api.example.com — external system, manual check required + [UNVERIFIABLE] Supabase auth allowlist contains user email — external system, confirm in Supabase dashboard ───────────────────────────────── -COMPLETION: 4/7 DONE, 1 PARTIAL, 1 NOT DONE, 1 CHANGED +COMPLETION: 5/9 DONE, 1 PARTIAL, 1 NOT DONE, 1 CHANGED, 2 UNVERIFIABLE ───────────────────────────────── ``` ### Gate Logic -After producing the completion checklist: +After producing the completion checklist, evaluate in priority order: -- **All DONE or CHANGED:** Pass. "Plan completion: PASS — all items addressed." Continue. -- **Only PARTIAL items (no NOT DONE):** Continue with a note in the PR body. Not blocking. -- **Any NOT DONE items:** Use AskUserQuestion: - - Show the completion checklist above - - "{N} items from the plan are NOT DONE. These were part of the original plan but are missing from the implementation." - - RECOMMENDATION: depends on item count and severity. If 1-2 minor items (docs, config), recommend B. If core functionality is missing, recommend A. - - Options: - A) Stop — implement the missing items before shipping - B) Ship anyway — defer these to a follow-up (will create P1 TODOs in Step 5.5) - C) These items were intentionally dropped — remove from scope - - If A: STOP. List the missing items for the user to implement. - - If B: Continue. For each NOT DONE item, create a P1 TODO in Step 5.5 with "Deferred from plan: {plan file path}". - - If C: Continue. Note in PR body: "Plan items intentionally dropped: {list}." +1. **Any NOT DONE items** (highest priority — known missing work). Use AskUserQuestion: + - Show the completion checklist above + - "{N} items from the plan are NOT DONE. These were part of the original plan but are missing from the implementation." + - RECOMMENDATION: depends on item count and severity. If 1-2 minor items (docs, config), recommend B. If core functionality is missing, recommend A. + - Options: + A) Stop — implement the missing items before shipping + B) Ship anyway — defer these to a follow-up (will create P1 TODOs in Step 5.5) + C) These items were intentionally dropped — remove from scope + - If A: STOP. List the missing items for the user to implement. + - If B: Continue. For each NOT DONE item, create a P1 TODO in Step 5.5 with "Deferred from plan: {plan file path}". + - If C: Continue. Note in PR body: "Plan items intentionally dropped: {list}." + +2. **Any UNVERIFIABLE items** (silent gaps — the diff cannot prove them either way). Only fires after NOT DONE is resolved or absent. + + **Per-item confirmation is mandatory.** Do NOT use a single AskUserQuestion to blanket-confirm all UNVERIFIABLE items. Blanket confirmation is the failure mode that surfaced in VAS-449 (user clicks A without opening any file). Instead: + + - Loop through UNVERIFIABLE items one at a time. + - For each item, use AskUserQuestion with the item's *specific* manual check (e.g., "Confirm: does `~/Development/domain-hq/docs/dashboard.md` exist?", not "Have you checked all items?"). + - Options per item: + Y) Confirmed done — cite what you verified (free-text, embedded in PR body) + N) Not done — block ship; treat as NOT DONE and re-enter the priority-1 gate + D) Intentionally dropped — note in PR body: "Plan item intentionally dropped: {item}" + - RECOMMENDATION per item: Y if the item is concrete and easily verified; N if it's critical-path (auth, DNS, deliverables to other repos) and the user shows hesitation. + + **Exit conditions:** + - Any N: STOP. Surface the missing items, suggest re-running /ship after they're addressed. + - All Y or D: Continue. Embed `## Plan Completion — Manual Verifications` section in PR body listing each Y'd item with the user's free-text evidence and each D'd item with "intentionally dropped". + + **Cap.** If there are more than 5 UNVERIFIABLE items, present them as a numbered list first and ask whether the user wants to (1) confirm each individually, (2) stop and reduce scope, or (3) explicitly accept blanket-confirmation with the warning that this is the VAS-449 failure shape. Default and recommended option is (1). + +3. **Only PARTIAL items (no NOT DONE, no UNVERIFIABLE):** Continue with a note in the PR body. Not blocking. + +4. **All DONE or CHANGED:** Pass. "Plan completion: PASS — all items addressed." Continue. **No plan file found:** Skip entirely. "No plan file detected — skipping plan completion audit." **Include in PR body (Step 8):** Add a `## Plan Completion` section with the checklist summary. > > After your analysis, output a single JSON object on the LAST LINE of your response (no other text after it): -> `{"total_items":N,"done":N,"changed":N,"deferred":N,"summary":""}` +> `{"total_items":N,"done":N,"changed":N,"deferred":N,"unverifiable":N,"summary":""}` **Parent processing:** 1. Parse the LAST line of the subagent's output as JSON. -2. Store `done`, `deferred` for Step 20 metrics; use `summary` in PR body. -3. If `deferred > 0` and no user override, present the deferred items via AskUserQuestion before continuing. -4. Embed `summary` in PR body's `## Plan Completion` section (Step 19). +2. Store `done`, `deferred`, `unverifiable` for Step 20 metrics; use `summary` in PR body. +3. If `deferred > 0` or `unverifiable > 0` and no user override, present the items via the appropriate AskUserQuestion (see Gate Logic priority order above) before continuing. +4. Embed `summary` in PR body's `## Plan Completion` section (Step 19). If `unverifiable > 0` and the user picked option A in the UNVERIFIABLE gate, also embed `## Plan Completion — Manual Verifications` listing each user-confirmed item. -**If the subagent fails or returns invalid JSON:** Fall back to running the audit inline. Never block /ship on subagent failure. +**If the subagent fails or returns invalid JSON:** Fall back to running the audit inline (parent processes the same plan-extraction + classification logic). If the inline fallback also fails (e.g., plan file unreadable, parser error), do NOT silently pass — surface the failure as an explicit AskUserQuestion: "Plan Completion audit could not run ({reason}). Options: (A) Skip audit and ship anyway — record that the audit was skipped in PR body and Step 20 metrics; (B) Stop and fix the audit." Default and recommended option is (B). Silent fail-open is the failure shape that VAS-449 surfaced. --- @@ -2395,6 +2443,11 @@ already knows. A good test: would this insight save time in a future session? If **Idempotency check:** Before bumping, classify the state by comparing `VERSION` against the base branch AND against `package.json`'s `version` field. Four states: FRESH (do bump), ALREADY_BUMPED (skip bump), DRIFT_STALE_PKG (sync pkg only, no re-bump), DRIFT_UNEXPECTED (stop and ask). ```bash +if ! git rev-parse --verify origin/ >/dev/null 2>&1; then + echo "ERROR: Unable to resolve origin/. Run 'git fetch origin' or verify the base branch exists." + exit 1 +fi + BASE_VERSION=$(git show origin/:VERSION 2>/dev/null | tr -d '\r\n[:space:]' || echo "0.0.0.0") CURRENT_VERSION=$(cat VERSION 2>/dev/null | tr -d '\r\n[:space:]' || echo "0.0.0.0") [ -z "$BASE_VERSION" ] && BASE_VERSION="0.0.0.0" diff --git a/test/fixtures/golden/codex-ship-SKILL.md b/test/fixtures/golden/codex-ship-SKILL.md index 3c4cf10ed3..06f90461a4 100644 --- a/test/fixtures/golden/codex-ship-SKILL.md +++ b/test/fixtures/golden/codex-ship-SKILL.md @@ -339,30 +339,29 @@ _BRAIN_SYNC_BIN="$GSTACK_BIN/gstack-brain-sync" _BRAIN_CONFIG_BIN="$GSTACK_BIN/gstack-config" # /sync-gbrain context-load: teach the agent to use gbrain when it's available. -# Mutually exclusive variants per /plan-eng-review §4. Empty string when gbrain -# is not configured (zero context cost for non-gbrain users). +# Per-worktree pin: post-spike redesign uses kubectl-style `.gbrain-source` in the +# git toplevel to scope queries. Look for the pin in the worktree (not a global +# state file) so that opening worktree B without a pin doesn't claim "indexed" +# just because worktree A was synced. Empty string when gbrain is not +# configured (zero context cost for non-gbrain users). _GBRAIN_CONFIG="$HOME/.gbrain/config.json" if [ -f "$_GBRAIN_CONFIG" ] && command -v gbrain >/dev/null 2>&1; then _GBRAIN_VERSION_OK=$(gbrain --version 2>/dev/null | grep -c '^gbrain ' || echo 0) if [ "$_GBRAIN_VERSION_OK" -gt 0 ] 2>/dev/null; then - _SYNC_STATE="$_GSTACK_HOME/.gbrain-sync-state.json" - _CWD_PAGES=0 - if [ -f "$_SYNC_STATE" ]; then - # Flatten newlines so the regex works against pretty-printed JSON too. - _CWD_PAGES=$(tr -d '\n' < "$_SYNC_STATE" 2>/dev/null \ - | grep -o '"name": *"code"[^}]*"detail": *{[^}]*"page_count": *[0-9]*' \ - | grep -o '"page_count": *[0-9]*' | grep -o '[0-9]\+' | head -1) - _CWD_PAGES=${_CWD_PAGES:-0} + _GBRAIN_PIN_PATH="" + _REPO_TOP=$(git rev-parse --show-toplevel 2>/dev/null || echo "") + if [ -n "$_REPO_TOP" ] && [ -f "$_REPO_TOP/.gbrain-source" ]; then + _GBRAIN_PIN_PATH="$_REPO_TOP/.gbrain-source" fi - if [ "$_CWD_PAGES" -gt 0 ] 2>/dev/null; then + if [ -n "$_GBRAIN_PIN_PATH" ]; then echo "GBrain configured. Prefer \`gbrain search\`/\`gbrain query\` over Grep for" echo "semantic questions; use \`gbrain code-def\`/\`code-refs\`/\`code-callers\` for" echo "symbol-aware code lookup. See \"## GBrain Search Guidance\" in CLAUDE.md." echo "Run /sync-gbrain to refresh." else - echo "GBrain configured but this repo isn't indexed yet. Run \`/sync-gbrain --full\`" - echo "before relying on \`gbrain search\` for code questions in this repo." - echo "Falls back to Grep until indexed." + echo "GBrain configured but this worktree isn't pinned yet. Run \`/sync-gbrain --full\`" + echo "before relying on \`gbrain search\` for code questions in this worktree." + echo "Falls back to Grep until pinned." fi fi fi @@ -1603,19 +1602,43 @@ For each item, note: - The item text (verbatim or concise summary) - Its category: CODE | TEST | MIGRATION | CONFIG | DOCS +### Verification Mode + +Before judging completion, classify HOW each item can be verified. The diff alone cannot prove every kind of work. Items outside the current repo or system are structurally invisible to `git diff`. + +- **DIFF-VERIFIABLE** — A code change in this repo would manifest in `git diff ...HEAD`. Examples: "add UserService" (file appears), "validate input X" (validation logic appears), "create users table" (migration file appears). +- **CROSS-REPO** — Item names a file or change in a sibling repo (e.g., `domain-hq/docs/dashboard.md`, `~/Development//...`). The current diff CANNOT prove this. +- **EXTERNAL-STATE** — Item names state in an external system: Supabase config/RLS, Cloudflare DNS, Vercel env vars, OAuth provider allowlists, third-party SaaS, DNS records. The current diff CANNOT prove this. +- **CONTENT-SHAPE** — Item requires a file to follow a specific convention. If the file is in this repo: diff-verifiable. If in another repo or system: see CROSS-REPO / EXTERNAL-STATE. + +**Verification dispatch:** + +- **DIFF-VERIFIABLE** → cross-reference against diff (next section). +- **CROSS-REPO** → if the sibling repo is reachable on disk (try `~/Development//`, `~/code//`, the parent of the current repo), run `[ -f ]` to check file existence. File exists → DONE (cite path). File missing → NOT DONE (cite path). Path unreachable → UNVERIFIABLE (cite what needs manual check). +- **EXTERNAL-STATE** → UNVERIFIABLE. Cite the system and the specific check the user must perform. +- **CONTENT-SHAPE in another repo** → if the file exists, run any project-detected validator (see "Validator detection" below) before falling back to UNVERIFIABLE. With a validator: pass → DONE; fail → NOT DONE (cite validator output). No validator available: classify UNVERIFIABLE and cite both the file path and the convention to confirm. + +**Path concreteness rule.** If a plan item names a *concrete filesystem path* (absolute, `~/...`, or `/`), it MUST be classified DONE or NOT DONE based on `[ -f ]`. UNVERIFIABLE is only valid when the path is genuinely abstract ("Cloudflare DNS", "Supabase allowlist") or the sibling root is unreachable on this machine. "I don't want to check" is not unreachable. + +**Validator detection.** Before falling back to UNVERIFIABLE on a CONTENT-SHAPE item, scan the target repo's `package.json` for any script matching `validate-*`, `lint-wiki`, `check-docs`, or similar. If found, invoke it with the relevant path argument (e.g., `npm run validate-wiki -- `). For multi-target validators (e.g., `validate-wiki --all`), run once and reconcile per-item from the output. A passing validator promotes the item from UNVERIFIABLE to DONE; a failing one demotes to NOT DONE. + +**Honesty rule.** Do NOT classify an item as DONE just because related code shipped. Code that *handles* a deliverable is not the deliverable. Shipping a markdown-extraction library is not the same as shipping the markdown file. When in doubt between DONE and UNVERIFIABLE, prefer UNVERIFIABLE — better to surface a confirmation prompt than silently miss a deliverable. + ### Cross-Reference Against Diff Run `git diff origin/...HEAD` and `git log origin/..HEAD --oneline` to understand what was implemented. -For each extracted plan item, check the diff and classify: +For each extracted plan item, run the verification dispatch from the previous section, then classify: -- **DONE** — Clear evidence in the diff that this item was implemented. Cite the specific file(s) changed. -- **PARTIAL** — Some work toward this item exists in the diff but it's incomplete (e.g., model created but controller missing, function exists but edge cases not handled). -- **NOT DONE** — No evidence in the diff that this item was addressed. +- **DONE** — Clear evidence the item shipped. Cite the specific file(s) changed in the diff for DIFF-VERIFIABLE items, or the verified path that exists for CROSS-REPO items with a reachable sibling repo. +- **PARTIAL** — Some work toward this item exists but is incomplete (e.g., model created but controller missing, function exists but edge cases not handled). +- **NOT DONE** — Verification ran and produced negative evidence (file missing, code absent in diff, sibling-repo file confirmed absent). - **CHANGED** — The item was implemented using a different approach than the plan described, but the same goal is achieved. Note the difference. +- **UNVERIFIABLE** — The diff and any reachable sibling-repo checks cannot prove or disprove this. Always applies to EXTERNAL-STATE items and to CROSS-REPO items where the sibling repo isn't reachable. Cite the specific manual verification the user must perform (e.g., "check Cloudflare DNS shows DNS-only mode for dashboard.example.com", "confirm /docs/dashboard.md exists in domain-hq repo"). -**Be conservative with DONE** — require clear evidence in the diff. A file being touched is not enough; the specific functionality described must be present. +**Be conservative with DONE** — require clear evidence. A file being touched is not enough; the specific functionality described must be present. **Be generous with CHANGED** — if the goal is met by different means, that counts as addressed. +**Be honest with UNVERIFIABLE** — better to surface 5 items the user must manually confirm than silently classify them DONE. ### Output Format @@ -1625,56 +1648,81 @@ PLAN COMPLETION AUDIT Plan: {plan file path} ## Implementation Items - [DONE] Create UserService — src/services/user_service.rb (+142 lines) - [PARTIAL] Add validation — model validates but missing controller checks - [NOT DONE] Add caching layer — no cache-related changes in diff - [CHANGED] "Redis queue" → implemented with Sidekiq instead + [DONE] Create UserService — src/services/user_service.rb (+142 lines) + [PARTIAL] Add validation — model validates but missing controller checks + [NOT DONE] Add caching layer — no cache-related changes in diff + [CHANGED] "Redis queue" → implemented with Sidekiq instead ## Test Items - [DONE] Unit tests for UserService — test/services/user_service_test.rb - [NOT DONE] E2E test for signup flow + [DONE] Unit tests for UserService — test/services/user_service_test.rb + [NOT DONE] E2E test for signup flow ## Migration Items - [DONE] Create users table — db/migrate/20240315_create_users.rb + [DONE] Create users table — db/migrate/20240315_create_users.rb + +## Cross-Repo / External Items + [DONE] sibling-repo has /docs/dashboard.md — verified at ~/Development/sibling-repo/docs/dashboard.md + [UNVERIFIABLE] Cloudflare DNS-only on api.example.com — external system, manual check required + [UNVERIFIABLE] Supabase auth allowlist contains user email — external system, confirm in Supabase dashboard ───────────────────────────────── -COMPLETION: 4/7 DONE, 1 PARTIAL, 1 NOT DONE, 1 CHANGED +COMPLETION: 5/9 DONE, 1 PARTIAL, 1 NOT DONE, 1 CHANGED, 2 UNVERIFIABLE ───────────────────────────────── ``` ### Gate Logic -After producing the completion checklist: +After producing the completion checklist, evaluate in priority order: -- **All DONE or CHANGED:** Pass. "Plan completion: PASS — all items addressed." Continue. -- **Only PARTIAL items (no NOT DONE):** Continue with a note in the PR body. Not blocking. -- **Any NOT DONE items:** Use AskUserQuestion: - - Show the completion checklist above - - "{N} items from the plan are NOT DONE. These were part of the original plan but are missing from the implementation." - - RECOMMENDATION: depends on item count and severity. If 1-2 minor items (docs, config), recommend B. If core functionality is missing, recommend A. - - Options: - A) Stop — implement the missing items before shipping - B) Ship anyway — defer these to a follow-up (will create P1 TODOs in Step 5.5) - C) These items were intentionally dropped — remove from scope - - If A: STOP. List the missing items for the user to implement. - - If B: Continue. For each NOT DONE item, create a P1 TODO in Step 5.5 with "Deferred from plan: {plan file path}". - - If C: Continue. Note in PR body: "Plan items intentionally dropped: {list}." +1. **Any NOT DONE items** (highest priority — known missing work). Use AskUserQuestion: + - Show the completion checklist above + - "{N} items from the plan are NOT DONE. These were part of the original plan but are missing from the implementation." + - RECOMMENDATION: depends on item count and severity. If 1-2 minor items (docs, config), recommend B. If core functionality is missing, recommend A. + - Options: + A) Stop — implement the missing items before shipping + B) Ship anyway — defer these to a follow-up (will create P1 TODOs in Step 5.5) + C) These items were intentionally dropped — remove from scope + - If A: STOP. List the missing items for the user to implement. + - If B: Continue. For each NOT DONE item, create a P1 TODO in Step 5.5 with "Deferred from plan: {plan file path}". + - If C: Continue. Note in PR body: "Plan items intentionally dropped: {list}." + +2. **Any UNVERIFIABLE items** (silent gaps — the diff cannot prove them either way). Only fires after NOT DONE is resolved or absent. + + **Per-item confirmation is mandatory.** Do NOT use a single AskUserQuestion to blanket-confirm all UNVERIFIABLE items. Blanket confirmation is the failure mode that surfaced in VAS-449 (user clicks A without opening any file). Instead: + + - Loop through UNVERIFIABLE items one at a time. + - For each item, use AskUserQuestion with the item's *specific* manual check (e.g., "Confirm: does `~/Development/domain-hq/docs/dashboard.md` exist?", not "Have you checked all items?"). + - Options per item: + Y) Confirmed done — cite what you verified (free-text, embedded in PR body) + N) Not done — block ship; treat as NOT DONE and re-enter the priority-1 gate + D) Intentionally dropped — note in PR body: "Plan item intentionally dropped: {item}" + - RECOMMENDATION per item: Y if the item is concrete and easily verified; N if it's critical-path (auth, DNS, deliverables to other repos) and the user shows hesitation. + + **Exit conditions:** + - Any N: STOP. Surface the missing items, suggest re-running /ship after they're addressed. + - All Y or D: Continue. Embed `## Plan Completion — Manual Verifications` section in PR body listing each Y'd item with the user's free-text evidence and each D'd item with "intentionally dropped". + + **Cap.** If there are more than 5 UNVERIFIABLE items, present them as a numbered list first and ask whether the user wants to (1) confirm each individually, (2) stop and reduce scope, or (3) explicitly accept blanket-confirmation with the warning that this is the VAS-449 failure shape. Default and recommended option is (1). + +3. **Only PARTIAL items (no NOT DONE, no UNVERIFIABLE):** Continue with a note in the PR body. Not blocking. + +4. **All DONE or CHANGED:** Pass. "Plan completion: PASS — all items addressed." Continue. **No plan file found:** Skip entirely. "No plan file detected — skipping plan completion audit." **Include in PR body (Step 8):** Add a `## Plan Completion` section with the checklist summary. > > After your analysis, output a single JSON object on the LAST LINE of your response (no other text after it): -> `{"total_items":N,"done":N,"changed":N,"deferred":N,"summary":""}` +> `{"total_items":N,"done":N,"changed":N,"deferred":N,"unverifiable":N,"summary":""}` **Parent processing:** 1. Parse the LAST line of the subagent's output as JSON. -2. Store `done`, `deferred` for Step 20 metrics; use `summary` in PR body. -3. If `deferred > 0` and no user override, present the deferred items via AskUserQuestion before continuing. -4. Embed `summary` in PR body's `## Plan Completion` section (Step 19). +2. Store `done`, `deferred`, `unverifiable` for Step 20 metrics; use `summary` in PR body. +3. If `deferred > 0` or `unverifiable > 0` and no user override, present the items via the appropriate AskUserQuestion (see Gate Logic priority order above) before continuing. +4. Embed `summary` in PR body's `## Plan Completion` section (Step 19). If `unverifiable > 0` and the user picked option A in the UNVERIFIABLE gate, also embed `## Plan Completion — Manual Verifications` listing each user-confirmed item. -**If the subagent fails or returns invalid JSON:** Fall back to running the audit inline. Never block /ship on subagent failure. +**If the subagent fails or returns invalid JSON:** Fall back to running the audit inline (parent processes the same plan-extraction + classification logic). If the inline fallback also fails (e.g., plan file unreadable, parser error), do NOT silently pass — surface the failure as an explicit AskUserQuestion: "Plan Completion audit could not run ({reason}). Options: (A) Skip audit and ship anyway — record that the audit was skipped in PR body and Step 20 metrics; (B) Stop and fix the audit." Default and recommended option is (B). Silent fail-open is the failure shape that VAS-449 surfaced. --- @@ -2010,6 +2058,11 @@ already knows. A good test: would this insight save time in a future session? If **Idempotency check:** Before bumping, classify the state by comparing `VERSION` against the base branch AND against `package.json`'s `version` field. Four states: FRESH (do bump), ALREADY_BUMPED (skip bump), DRIFT_STALE_PKG (sync pkg only, no re-bump), DRIFT_UNEXPECTED (stop and ask). ```bash +if ! git rev-parse --verify origin/ >/dev/null 2>&1; then + echo "ERROR: Unable to resolve origin/. Run 'git fetch origin' or verify the base branch exists." + exit 1 +fi + BASE_VERSION=$(git show origin/:VERSION 2>/dev/null | tr -d '\r\n[:space:]' || echo "0.0.0.0") CURRENT_VERSION=$(cat VERSION 2>/dev/null | tr -d '\r\n[:space:]' || echo "0.0.0.0") [ -z "$BASE_VERSION" ] && BASE_VERSION="0.0.0.0" diff --git a/test/fixtures/golden/factory-ship-SKILL.md b/test/fixtures/golden/factory-ship-SKILL.md index b1d7ffe075..71ae2119f8 100644 --- a/test/fixtures/golden/factory-ship-SKILL.md +++ b/test/fixtures/golden/factory-ship-SKILL.md @@ -341,30 +341,29 @@ _BRAIN_SYNC_BIN="$GSTACK_BIN/gstack-brain-sync" _BRAIN_CONFIG_BIN="$GSTACK_BIN/gstack-config" # /sync-gbrain context-load: teach the agent to use gbrain when it's available. -# Mutually exclusive variants per /plan-eng-review §4. Empty string when gbrain -# is not configured (zero context cost for non-gbrain users). +# Per-worktree pin: post-spike redesign uses kubectl-style `.gbrain-source` in the +# git toplevel to scope queries. Look for the pin in the worktree (not a global +# state file) so that opening worktree B without a pin doesn't claim "indexed" +# just because worktree A was synced. Empty string when gbrain is not +# configured (zero context cost for non-gbrain users). _GBRAIN_CONFIG="$HOME/.gbrain/config.json" if [ -f "$_GBRAIN_CONFIG" ] && command -v gbrain >/dev/null 2>&1; then _GBRAIN_VERSION_OK=$(gbrain --version 2>/dev/null | grep -c '^gbrain ' || echo 0) if [ "$_GBRAIN_VERSION_OK" -gt 0 ] 2>/dev/null; then - _SYNC_STATE="$_GSTACK_HOME/.gbrain-sync-state.json" - _CWD_PAGES=0 - if [ -f "$_SYNC_STATE" ]; then - # Flatten newlines so the regex works against pretty-printed JSON too. - _CWD_PAGES=$(tr -d '\n' < "$_SYNC_STATE" 2>/dev/null \ - | grep -o '"name": *"code"[^}]*"detail": *{[^}]*"page_count": *[0-9]*' \ - | grep -o '"page_count": *[0-9]*' | grep -o '[0-9]\+' | head -1) - _CWD_PAGES=${_CWD_PAGES:-0} + _GBRAIN_PIN_PATH="" + _REPO_TOP=$(git rev-parse --show-toplevel 2>/dev/null || echo "") + if [ -n "$_REPO_TOP" ] && [ -f "$_REPO_TOP/.gbrain-source" ]; then + _GBRAIN_PIN_PATH="$_REPO_TOP/.gbrain-source" fi - if [ "$_CWD_PAGES" -gt 0 ] 2>/dev/null; then + if [ -n "$_GBRAIN_PIN_PATH" ]; then echo "GBrain configured. Prefer \`gbrain search\`/\`gbrain query\` over Grep for" echo "semantic questions; use \`gbrain code-def\`/\`code-refs\`/\`code-callers\` for" echo "symbol-aware code lookup. See \"## GBrain Search Guidance\" in CLAUDE.md." echo "Run /sync-gbrain to refresh." else - echo "GBrain configured but this repo isn't indexed yet. Run \`/sync-gbrain --full\`" - echo "before relying on \`gbrain search\` for code questions in this repo." - echo "Falls back to Grep until indexed." + echo "GBrain configured but this worktree isn't pinned yet. Run \`/sync-gbrain --full\`" + echo "before relying on \`gbrain search\` for code questions in this worktree." + echo "Falls back to Grep until pinned." fi fi fi @@ -1605,19 +1604,43 @@ For each item, note: - The item text (verbatim or concise summary) - Its category: CODE | TEST | MIGRATION | CONFIG | DOCS +### Verification Mode + +Before judging completion, classify HOW each item can be verified. The diff alone cannot prove every kind of work. Items outside the current repo or system are structurally invisible to `git diff`. + +- **DIFF-VERIFIABLE** — A code change in this repo would manifest in `git diff ...HEAD`. Examples: "add UserService" (file appears), "validate input X" (validation logic appears), "create users table" (migration file appears). +- **CROSS-REPO** — Item names a file or change in a sibling repo (e.g., `domain-hq/docs/dashboard.md`, `~/Development//...`). The current diff CANNOT prove this. +- **EXTERNAL-STATE** — Item names state in an external system: Supabase config/RLS, Cloudflare DNS, Vercel env vars, OAuth provider allowlists, third-party SaaS, DNS records. The current diff CANNOT prove this. +- **CONTENT-SHAPE** — Item requires a file to follow a specific convention. If the file is in this repo: diff-verifiable. If in another repo or system: see CROSS-REPO / EXTERNAL-STATE. + +**Verification dispatch:** + +- **DIFF-VERIFIABLE** → cross-reference against diff (next section). +- **CROSS-REPO** → if the sibling repo is reachable on disk (try `~/Development//`, `~/code//`, the parent of the current repo), run `[ -f ]` to check file existence. File exists → DONE (cite path). File missing → NOT DONE (cite path). Path unreachable → UNVERIFIABLE (cite what needs manual check). +- **EXTERNAL-STATE** → UNVERIFIABLE. Cite the system and the specific check the user must perform. +- **CONTENT-SHAPE in another repo** → if the file exists, run any project-detected validator (see "Validator detection" below) before falling back to UNVERIFIABLE. With a validator: pass → DONE; fail → NOT DONE (cite validator output). No validator available: classify UNVERIFIABLE and cite both the file path and the convention to confirm. + +**Path concreteness rule.** If a plan item names a *concrete filesystem path* (absolute, `~/...`, or `/`), it MUST be classified DONE or NOT DONE based on `[ -f ]`. UNVERIFIABLE is only valid when the path is genuinely abstract ("Cloudflare DNS", "Supabase allowlist") or the sibling root is unreachable on this machine. "I don't want to check" is not unreachable. + +**Validator detection.** Before falling back to UNVERIFIABLE on a CONTENT-SHAPE item, scan the target repo's `package.json` for any script matching `validate-*`, `lint-wiki`, `check-docs`, or similar. If found, invoke it with the relevant path argument (e.g., `npm run validate-wiki -- `). For multi-target validators (e.g., `validate-wiki --all`), run once and reconcile per-item from the output. A passing validator promotes the item from UNVERIFIABLE to DONE; a failing one demotes to NOT DONE. + +**Honesty rule.** Do NOT classify an item as DONE just because related code shipped. Code that *handles* a deliverable is not the deliverable. Shipping a markdown-extraction library is not the same as shipping the markdown file. When in doubt between DONE and UNVERIFIABLE, prefer UNVERIFIABLE — better to surface a confirmation prompt than silently miss a deliverable. + ### Cross-Reference Against Diff Run `git diff origin/...HEAD` and `git log origin/..HEAD --oneline` to understand what was implemented. -For each extracted plan item, check the diff and classify: +For each extracted plan item, run the verification dispatch from the previous section, then classify: -- **DONE** — Clear evidence in the diff that this item was implemented. Cite the specific file(s) changed. -- **PARTIAL** — Some work toward this item exists in the diff but it's incomplete (e.g., model created but controller missing, function exists but edge cases not handled). -- **NOT DONE** — No evidence in the diff that this item was addressed. +- **DONE** — Clear evidence the item shipped. Cite the specific file(s) changed in the diff for DIFF-VERIFIABLE items, or the verified path that exists for CROSS-REPO items with a reachable sibling repo. +- **PARTIAL** — Some work toward this item exists but is incomplete (e.g., model created but controller missing, function exists but edge cases not handled). +- **NOT DONE** — Verification ran and produced negative evidence (file missing, code absent in diff, sibling-repo file confirmed absent). - **CHANGED** — The item was implemented using a different approach than the plan described, but the same goal is achieved. Note the difference. +- **UNVERIFIABLE** — The diff and any reachable sibling-repo checks cannot prove or disprove this. Always applies to EXTERNAL-STATE items and to CROSS-REPO items where the sibling repo isn't reachable. Cite the specific manual verification the user must perform (e.g., "check Cloudflare DNS shows DNS-only mode for dashboard.example.com", "confirm /docs/dashboard.md exists in domain-hq repo"). -**Be conservative with DONE** — require clear evidence in the diff. A file being touched is not enough; the specific functionality described must be present. +**Be conservative with DONE** — require clear evidence. A file being touched is not enough; the specific functionality described must be present. **Be generous with CHANGED** — if the goal is met by different means, that counts as addressed. +**Be honest with UNVERIFIABLE** — better to surface 5 items the user must manually confirm than silently classify them DONE. ### Output Format @@ -1627,56 +1650,81 @@ PLAN COMPLETION AUDIT Plan: {plan file path} ## Implementation Items - [DONE] Create UserService — src/services/user_service.rb (+142 lines) - [PARTIAL] Add validation — model validates but missing controller checks - [NOT DONE] Add caching layer — no cache-related changes in diff - [CHANGED] "Redis queue" → implemented with Sidekiq instead + [DONE] Create UserService — src/services/user_service.rb (+142 lines) + [PARTIAL] Add validation — model validates but missing controller checks + [NOT DONE] Add caching layer — no cache-related changes in diff + [CHANGED] "Redis queue" → implemented with Sidekiq instead ## Test Items - [DONE] Unit tests for UserService — test/services/user_service_test.rb - [NOT DONE] E2E test for signup flow + [DONE] Unit tests for UserService — test/services/user_service_test.rb + [NOT DONE] E2E test for signup flow ## Migration Items - [DONE] Create users table — db/migrate/20240315_create_users.rb + [DONE] Create users table — db/migrate/20240315_create_users.rb + +## Cross-Repo / External Items + [DONE] sibling-repo has /docs/dashboard.md — verified at ~/Development/sibling-repo/docs/dashboard.md + [UNVERIFIABLE] Cloudflare DNS-only on api.example.com — external system, manual check required + [UNVERIFIABLE] Supabase auth allowlist contains user email — external system, confirm in Supabase dashboard ───────────────────────────────── -COMPLETION: 4/7 DONE, 1 PARTIAL, 1 NOT DONE, 1 CHANGED +COMPLETION: 5/9 DONE, 1 PARTIAL, 1 NOT DONE, 1 CHANGED, 2 UNVERIFIABLE ───────────────────────────────── ``` ### Gate Logic -After producing the completion checklist: +After producing the completion checklist, evaluate in priority order: -- **All DONE or CHANGED:** Pass. "Plan completion: PASS — all items addressed." Continue. -- **Only PARTIAL items (no NOT DONE):** Continue with a note in the PR body. Not blocking. -- **Any NOT DONE items:** Use AskUserQuestion: - - Show the completion checklist above - - "{N} items from the plan are NOT DONE. These were part of the original plan but are missing from the implementation." - - RECOMMENDATION: depends on item count and severity. If 1-2 minor items (docs, config), recommend B. If core functionality is missing, recommend A. - - Options: - A) Stop — implement the missing items before shipping - B) Ship anyway — defer these to a follow-up (will create P1 TODOs in Step 5.5) - C) These items were intentionally dropped — remove from scope - - If A: STOP. List the missing items for the user to implement. - - If B: Continue. For each NOT DONE item, create a P1 TODO in Step 5.5 with "Deferred from plan: {plan file path}". - - If C: Continue. Note in PR body: "Plan items intentionally dropped: {list}." +1. **Any NOT DONE items** (highest priority — known missing work). Use AskUserQuestion: + - Show the completion checklist above + - "{N} items from the plan are NOT DONE. These were part of the original plan but are missing from the implementation." + - RECOMMENDATION: depends on item count and severity. If 1-2 minor items (docs, config), recommend B. If core functionality is missing, recommend A. + - Options: + A) Stop — implement the missing items before shipping + B) Ship anyway — defer these to a follow-up (will create P1 TODOs in Step 5.5) + C) These items were intentionally dropped — remove from scope + - If A: STOP. List the missing items for the user to implement. + - If B: Continue. For each NOT DONE item, create a P1 TODO in Step 5.5 with "Deferred from plan: {plan file path}". + - If C: Continue. Note in PR body: "Plan items intentionally dropped: {list}." + +2. **Any UNVERIFIABLE items** (silent gaps — the diff cannot prove them either way). Only fires after NOT DONE is resolved or absent. + + **Per-item confirmation is mandatory.** Do NOT use a single AskUserQuestion to blanket-confirm all UNVERIFIABLE items. Blanket confirmation is the failure mode that surfaced in VAS-449 (user clicks A without opening any file). Instead: + + - Loop through UNVERIFIABLE items one at a time. + - For each item, use AskUserQuestion with the item's *specific* manual check (e.g., "Confirm: does `~/Development/domain-hq/docs/dashboard.md` exist?", not "Have you checked all items?"). + - Options per item: + Y) Confirmed done — cite what you verified (free-text, embedded in PR body) + N) Not done — block ship; treat as NOT DONE and re-enter the priority-1 gate + D) Intentionally dropped — note in PR body: "Plan item intentionally dropped: {item}" + - RECOMMENDATION per item: Y if the item is concrete and easily verified; N if it's critical-path (auth, DNS, deliverables to other repos) and the user shows hesitation. + + **Exit conditions:** + - Any N: STOP. Surface the missing items, suggest re-running /ship after they're addressed. + - All Y or D: Continue. Embed `## Plan Completion — Manual Verifications` section in PR body listing each Y'd item with the user's free-text evidence and each D'd item with "intentionally dropped". + + **Cap.** If there are more than 5 UNVERIFIABLE items, present them as a numbered list first and ask whether the user wants to (1) confirm each individually, (2) stop and reduce scope, or (3) explicitly accept blanket-confirmation with the warning that this is the VAS-449 failure shape. Default and recommended option is (1). + +3. **Only PARTIAL items (no NOT DONE, no UNVERIFIABLE):** Continue with a note in the PR body. Not blocking. + +4. **All DONE or CHANGED:** Pass. "Plan completion: PASS — all items addressed." Continue. **No plan file found:** Skip entirely. "No plan file detected — skipping plan completion audit." **Include in PR body (Step 8):** Add a `## Plan Completion` section with the checklist summary. > > After your analysis, output a single JSON object on the LAST LINE of your response (no other text after it): -> `{"total_items":N,"done":N,"changed":N,"deferred":N,"summary":""}` +> `{"total_items":N,"done":N,"changed":N,"deferred":N,"unverifiable":N,"summary":""}` **Parent processing:** 1. Parse the LAST line of the subagent's output as JSON. -2. Store `done`, `deferred` for Step 20 metrics; use `summary` in PR body. -3. If `deferred > 0` and no user override, present the deferred items via AskUserQuestion before continuing. -4. Embed `summary` in PR body's `## Plan Completion` section (Step 19). +2. Store `done`, `deferred`, `unverifiable` for Step 20 metrics; use `summary` in PR body. +3. If `deferred > 0` or `unverifiable > 0` and no user override, present the items via the appropriate AskUserQuestion (see Gate Logic priority order above) before continuing. +4. Embed `summary` in PR body's `## Plan Completion` section (Step 19). If `unverifiable > 0` and the user picked option A in the UNVERIFIABLE gate, also embed `## Plan Completion — Manual Verifications` listing each user-confirmed item. -**If the subagent fails or returns invalid JSON:** Fall back to running the audit inline. Never block /ship on subagent failure. +**If the subagent fails or returns invalid JSON:** Fall back to running the audit inline (parent processes the same plan-extraction + classification logic). If the inline fallback also fails (e.g., plan file unreadable, parser error), do NOT silently pass — surface the failure as an explicit AskUserQuestion: "Plan Completion audit could not run ({reason}). Options: (A) Skip audit and ship anyway — record that the audit was skipped in PR body and Step 20 metrics; (B) Stop and fix the audit." Default and recommended option is (B). Silent fail-open is the failure shape that VAS-449 surfaced. --- @@ -2386,6 +2434,11 @@ already knows. A good test: would this insight save time in a future session? If **Idempotency check:** Before bumping, classify the state by comparing `VERSION` against the base branch AND against `package.json`'s `version` field. Four states: FRESH (do bump), ALREADY_BUMPED (skip bump), DRIFT_STALE_PKG (sync pkg only, no re-bump), DRIFT_UNEXPECTED (stop and ask). ```bash +if ! git rev-parse --verify origin/ >/dev/null 2>&1; then + echo "ERROR: Unable to resolve origin/. Run 'git fetch origin' or verify the base branch exists." + exit 1 +fi + BASE_VERSION=$(git show origin/:VERSION 2>/dev/null | tr -d '\r\n[:space:]' || echo "0.0.0.0") CURRENT_VERSION=$(cat VERSION 2>/dev/null | tr -d '\r\n[:space:]' || echo "0.0.0.0") [ -z "$BASE_VERSION" ] && BASE_VERSION="0.0.0.0" diff --git a/test/gstack-memory-ingest.test.ts b/test/gstack-memory-ingest.test.ts index 5fb6ebbfa5..fec698271d 100644 --- a/test/gstack-memory-ingest.test.ts +++ b/test/gstack-memory-ingest.test.ts @@ -205,6 +205,52 @@ describe("gstack-memory-ingest state file", () => { }); }); +// ── Security: cwd in transcript JSONL must not reach a shell ───────────── + +describe("gstack-memory-ingest security: untrusted cwd cannot trigger shell substitution", () => { + it("does not invoke /bin/sh when a transcript record contains $() in cwd", () => { + // Transcript JSONL is an untrusted surface — a record's `.cwd` value + // can be set by anyone who can write to ~/.claude/projects (cross-machine + // share, prompt-injection appending to the active session log, etc.). + // resolveGitRemote() must use execFileSync, not execSync with template + // interpolation, or `cwd="$(...)"` triggers command substitution under + // /bin/sh -c on the next ingest run. + const home = makeTestHome(); + const gstackHome = join(home, ".gstack"); + mkdirSync(gstackHome, { recursive: true }); + + const markerDir = mkdtempSync(join(tmpdir(), "gstack-mi-cwd-marker-")); + const marker = join(markerDir, "PWNED"); + // Plain $(...) — what an attacker would write into a transcript record. + // execFileSync passes this verbatim to git as a -C argument; execSync + // (the prior code path) wrapped it in a /bin/sh -c template that ran + // the substitution. + const malicious = "$(touch " + marker + ")"; + + const record = JSON.stringify({ + type: "user", + uuid: "11111111-1111-1111-1111-111111111111", + sessionId: "abc", + cwd: malicious, + timestamp: new Date().toISOString(), + message: { role: "user", content: "hi" }, + }); + writeClaudeCodeSession(home, "-tmp-target", "abc", record + "\n"); + + const r = runScript(["--incremental", "--quiet"], { + HOME: home, + GSTACK_HOME: gstackHome, + GSTACK_MEMORY_INGEST_NO_WRITE: "1", + }); + + expect(r.exitCode).toBe(0); + expect(existsSync(marker)).toBe(false); + + rmSync(home, { recursive: true, force: true }); + rmSync(markerDir, { recursive: true, force: true }); + }); +}); + // ── Transcript parser via re-import of the source module ─────────────────── describe("internal: parseTranscriptJsonl + buildTranscriptPage shape", () => { diff --git a/test/helpers/providers/gemini.ts b/test/helpers/providers/gemini.ts index 4395470397..5e7abba13a 100644 --- a/test/helpers/providers/gemini.ts +++ b/test/helpers/providers/gemini.ts @@ -22,8 +22,10 @@ export class GeminiAdapter implements ProviderAdapter { if (res.status !== 0) { return { ok: false, reason: 'gemini CLI not found on PATH. Install per https://github.com/google-gemini/gemini-cli' }; } - const cfgDir = path.join(os.homedir(), '.config', 'gemini'); - const hasCfg = fs.existsSync(cfgDir); + const legacyCfgDir = path.join(os.homedir(), '.config', 'gemini'); + const newCfgDir = path.join(os.homedir(), '.gemini'); + const newOauth = path.join(newCfgDir, 'oauth_creds.json'); + const hasCfg = fs.existsSync(legacyCfgDir) || fs.existsSync(newOauth); const hasKey = !!process.env.GOOGLE_API_KEY; if (!hasCfg && !hasKey) { return { ok: false, reason: 'No Gemini auth found. Log in via `gemini login` or export GOOGLE_API_KEY.' }; diff --git a/test/ship-plan-completion-invariants.test.ts b/test/ship-plan-completion-invariants.test.ts new file mode 100644 index 0000000000..64f6b2481b --- /dev/null +++ b/test/ship-plan-completion-invariants.test.ts @@ -0,0 +1,38 @@ +import { describe, test, expect } from 'bun:test'; +import * as fs from 'fs'; +import * as path from 'path'; + +const SHIP_SKILL = path.join(__dirname, '..', 'ship', 'SKILL.md'); + +describe('ship/SKILL.md — Plan Completion gate invariants (VAS-449 remediation)', () => { + const skill = fs.readFileSync(SHIP_SKILL, 'utf8'); + + test('Path concreteness rule: filesystem-pathed items must be test -f checked', () => { + expect(skill).toContain('**Path concreteness rule.**'); + expect(skill).toMatch(/concrete filesystem path/); + expect(skill).toMatch(/MUST be classified DONE or NOT DONE based on `\[ -f/); + }); + + test('Validator detection: project package.json validate-* scripts are auto-run', () => { + expect(skill).toContain('**Validator detection.**'); + expect(skill).toMatch(/package\.json/); + expect(skill).toMatch(/validate-\*/); + }); + + test('Per-item UNVERIFIABLE confirmation: blanket-confirm is forbidden', () => { + expect(skill).toContain('**Per-item confirmation is mandatory.**'); + expect(skill).toMatch(/Do NOT use a single AskUserQuestion to blanket-confirm/); + expect(skill).toMatch(/VAS-449/); + }); + + test('Subagent failure: fail-closed, not silent fail-open', () => { + expect(skill).not.toMatch(/Never block \/ship on subagent failure\.\s*$/m); + expect(skill).toMatch(/Silent fail-open is the failure shape that VAS-449 surfaced/); + expect(skill).toMatch(/Stop and fix the audit/); + }); + + test('CONTENT-SHAPE dispatch invokes validator before falling back to UNVERIFIABLE', () => { + expect(skill).toMatch(/CONTENT-SHAPE in another repo.*validator/s); + expect(skill).toMatch(/passing validator promotes the item from UNVERIFIABLE to DONE/); + }); +}); diff --git a/test/skill-validation.test.ts b/test/skill-validation.test.ts index b46028b518..53c7c33aac 100644 --- a/test/skill-validation.test.ts +++ b/test/skill-validation.test.ts @@ -1297,6 +1297,16 @@ describe('Codex skill', () => { expect(content).toContain('codex exec resume'); }); + test('codex/SKILL.md resume command only uses resume-supported flags', () => { + const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8'); + const match = content.match(/codex exec resume[^\n]+/); + expect(match).not.toBeNull(); + const resumeCommand = match![0]; + expect(resumeCommand).not.toContain(' -C '); + expect(resumeCommand).not.toContain(' -s read-only'); + expect(resumeCommand).toContain("-c 'sandbox_mode=\"read-only\"'"); + }); + test('codex/SKILL.md contains cost tracking', () => { const content = fs.readFileSync(path.join(ROOT, 'codex', 'SKILL.md'), 'utf-8'); expect(content).toContain('tokens used'); @@ -1332,6 +1342,17 @@ describe('Codex skill', () => { expect(content).toContain('mktemp'); }); + test('codex JSON stream parser uses portable Python discovery', () => { + const files = ['codex/SKILL.md.tmpl', 'codex/SKILL.md']; + + for (const rel of files) { + const content = fs.readFileSync(path.join(ROOT, rel), 'utf-8'); + expect(content).toContain('PYTHON_CMD=$(command -v python3 2>/dev/null || command -v python 2>/dev/null || true)'); + expect(content).toContain('PYTHONUNBUFFERED=1 "$PYTHON_CMD" -u -c'); + expect(content).not.toContain('PYTHONUNBUFFERED=1 python3 -u -c'); + } + }); + test('adversarial review in /review always runs both passes', () => { const content = fs.readFileSync(path.join(ROOT, 'review', 'SKILL.md'), 'utf-8'); expect(content).toContain('Adversarial review (always-on)');