diff --git a/CHANGELOG.md b/CHANGELOG.md index 4593d3f1..be473577 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,28 @@ # Changelog +## 0.6.1 — 2026-03-17 — Boil the Lake + +Every gstack skill now follows the **Completeness Principle**: always recommend the +full implementation when AI makes the marginal cost near-zero. No more "Choose B +because it's 90% of the value" when option A is 70 lines more code. + +Read the philosophy: https://garryslist.org/posts/boil-the-ocean + +- **Completeness scoring**: every AskUserQuestion option now shows a completeness + score (1-10), biasing toward the complete solution +- **Dual time estimates**: effort estimates show both human-team and CC+gstack time + (e.g., "human: ~2 weeks / CC: ~1 hour") with a task-type compression reference table +- **Anti-pattern examples**: concrete "don't do this" gallery in the preamble so the + principle isn't abstract +- **First-time onboarding**: new users see a one-time introduction linking to the + essay, with option to open in browser +- **Review completeness gaps**: `/review` now flags shortcut implementations where the + complete version costs <30 min CC time +- **Lake Score**: CEO and Eng review completion summaries show how many recommendations + chose the complete option vs shortcuts +- **CEO + Eng review dual-time**: temporal interrogation, effort estimates, and delight + opportunities all show both human and CC time scales + ## 0.6.0.1 — 2026-03-17 - **`/gstack-upgrade` now catches stale vendored copies automatically.** If your global gstack is up to date but the vendored copy in your project is behind, `/gstack-upgrade` detects the mismatch and syncs it. No more manually asking "did we vendor it?" — it just tells you and offers to update. @@ -63,7 +86,7 @@ ## 0.5.2 — 2026-03-17 - **Your design consultant now takes creative risks.** `/design-consultation` doesn't just propose a safe, coherent system — it explicitly breaks down SAFE CHOICES (category baseline) vs. RISKS (where your product stands out). You pick which rules to break. Every risk comes with a rationale for why it works and what it costs. -- **See the competition before you choose.** When you opt into research, the agent browses competitor sites with screenshots and accessibility tree analysis — not just web search results. You see what the landscape looks like before making design decisions. +- **See the landscape before you choose.** When you opt into research, the agent browses real sites in your space with screenshots and accessibility tree analysis — not just web search results. You see what's out there before making design decisions. - **Preview pages that look like your product.** The preview page now renders realistic product mockups — dashboards with sidebar nav and data tables, marketing pages with hero sections, settings pages with forms — not just font swatches and color palettes. ## 0.5.1 — 2026-03-17 diff --git a/CLAUDE.md b/CLAUDE.md index 27523c7b..34868b0a 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -121,6 +121,23 @@ CHANGELOG.md is **for users**, not contributors. Write it like product release n - No jargon: say "every question now tells you which project and branch you're in" not "AskUserQuestion format standardized across skill templates via preamble resolver." +## AI effort compression + +When estimating or discussing effort, always show both human-team and CC+gstack time: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +Completeness is cheap. Don't recommend shortcuts when the complete implementation +is a "lake" (achievable) not an "ocean" (multi-quarter migration). See the +Completeness Principle in the skill preamble for the full philosophy. + ## Local plans Contributors can store long-range vision docs and design documents in `~/.gstack-dev/plans/`. diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 4af2e889..2fce6375 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -54,6 +54,10 @@ the issue, fix it, and open a PR. This is the best way to contribute: fix gstack while doing your real work, in the project where you actually felt the pain. +### Session awareness + +When you have 3+ gstack sessions open simultaneously, every question tells you which project, which branch, and what's happening. No more staring at a question thinking "wait, which window is this?" The format is consistent across all 13 skills. + ## Working on gstack inside the gstack repo When you're editing gstack skills and want to test them by actually using gstack diff --git a/README.md b/README.md index fd4b684f..c0a5f0b6 100644 --- a/README.md +++ b/README.md @@ -1,136 +1,136 @@ # gstack -**gstack turns Claude Code from one generic assistant into a team of specialists you can summon on demand.** +Hi, I'm [Garry Tan](https://x.com/garrytan). I'm President & CEO of [Y Combinator](https://www.ycombinator.com/), where I've worked with thousands of startups including Coinbase, Instacart, and Rippling when the founders were just one or two people in a garage — companies now worth tens of billions of dollars. Before YC, I designed the Palantir logo and was one of the first eng manager/PM/designers there. I cofounded Posterous, a blog platform we sold to Twitter. I built Bookface, YC's internal social network, back in 2013. I've been building products as a designer, PM, and eng manager for a long time. -Thirteen opinionated workflow skills for [Claude Code](https://docs.anthropic.com/en/docs/claude-code). Plan review, design review, code review, one-command shipping, browser automation, QA testing, engineering retrospectives, and post-ship documentation — all as slash commands. +And right now I am in the middle of something that feels like a new era entirely. -### Without gstack +In the last 60 days I have written **over 600,000 lines of production code** — 35% tests — and I am doing **10,000 to 20,000 usable lines of code per day** as a part-time part of my day while doing all my duties as CEO of YC. That is not a typo. My last `/retro` (developer stats from the last 7 days) across 3 projects: **140,751 lines added, 362 commits, ~115k net LOC**. The models are getting dramatically better every week. We are at the dawn of something real — one person shipping at a scale that used to require a team of twenty. -- The agent takes your request literally — it never asks if you're building the right thing -- It will implement exactly what you said, even when the real product is something bigger -- "Review my PR" gives inconsistent depth every time -- "Ship this" turns into a long back-and-forth about what to do -- The agent can write code but can't see your app — it's half blind -- You still do QA by hand: open browser, click around, check pages, squint at layouts +**2026 — 1,237 contributions and counting:** -### With gstack +![GitHub contributions 2026 — 1,237 contributions, massive acceleration in Jan-Mar](docs/images/github-2026.png) -| Skill | Mode | What it does | -|-------|------|--------------| -| `/plan-ceo-review` | Founder / CEO | Rethink the problem. Find the 10-star product hiding inside the request. | -| `/plan-eng-review` | Eng manager / tech lead | Lock in architecture, data flow, diagrams, edge cases, and tests. | -| `/plan-design-review` | Senior product designer | Designer's eye audit. 80-item checklist, letter grades, AI Slop detection, DESIGN.md inference. Report only — never touches code. | -| `/design-consultation` | Design consultant | Build a complete design system from scratch. Browses competitors to get in the ballpark, proposes safe choices AND creative risks, generates realistic product mockups, and writes DESIGN.md. | -| `/review` | Paranoid staff engineer | Find the bugs that pass CI but blow up in production. Triages Greptile review comments. | -| `/ship` | Release engineer | Sync main, run tests, resolve Greptile reviews, push, open PR. For a ready branch, not for deciding what to build. | -| `/browse` | QA engineer | Give the agent eyes. It logs in, clicks through your app, takes screenshots, catches breakage. Full QA pass in 60 seconds. | -| `/qa` | QA + fix engineer | Test app, find bugs, fix them with atomic commits, re-verify. Before/after health scores and ship-readiness summary. Three tiers: Quick, Standard, Exhaustive. | -| `/qa-only` | QA reporter | Report-only QA testing. Same methodology as /qa but never fixes anything. Use when you want a pure bug report without code changes. | -| `/qa-design-review` | Designer + frontend engineer | Same design audit as /plan-design-review, then fixes what it finds. Atomic `style(design):` commits, before/after screenshots, CSS-safe self-regulation. | -| `/setup-browser-cookies` | Session manager | Import cookies from your real browser (Comet, Chrome, Arc, Brave, Edge) into the headless session. Test authenticated pages without logging in manually. | -| `/retro` | Engineering manager | Team-aware retro: your deep-dive + per-person praise and growth opportunities for every contributor. | -| `/document-release` | Technical writer | Update README, ARCHITECTURE, CONTRIBUTING, and project docs to match what you just shipped. | +**2013 — when I built Bookface at YC (772 contributions):** -## Demo: one feature, five modes +![GitHub contributions 2013 — 772 contributions building Bookface at YC](docs/images/github-2013.png) -I start almost every feature in plan mode. I describe what I want to build, then use `/plan-ceo-review` to pressure-test whether I am even building the right thing. Only after the product direction is locked do I shift into engineering, review, ship, and test. +Same person. Different era. The difference is the tooling. -``` -You: [enter plan mode, describe the feature] - I want to add seller photo upload to the listing app. Sellers - should be able to upload a photo of their item and we create - the listing from it. - -You: /plan-ceo-review - -Claude: "Photo upload" is not the feature. The real job is helping sellers - create listings that actually sell. Here's what the 10-star version - looks like: auto-identify the product from the photo, pull specs and - pricing comps from the web, draft the title and description, suggest - the best hero image... +**gstack is how I do it.** It is my open source software factory. It turns Claude Code into a virtual engineering team you actually manage — a CEO who rethinks the product, an eng manager who locks the architecture, a designer who catches AI slop, a paranoid reviewer who finds production bugs, a QA lead who opens a real browser and clicks through your app, and a release engineer who ships the PR. Thirteen specialists, all as slash commands, all Markdown, **all free, MIT license, available right now.** -You: /plan-eng-review +I am learning how to get to the edge of what agentic systems can do as of March 2026, and this is my live experiment. I am sharing it because I want the whole world on this journey with me. -Claude: [Architecture diagram, state machine for upload → classify → enrich - → draft pipeline, async job boundaries, failure modes, test matrix] +Fork it. Improve it. Make it yours. Don't player hate, appreciate. -You: [exit plan mode, implement the plan] +--- -You: /review +## See it work -Claude: Race condition: two tabs can overwrite cover-photo selection. - Trust boundary: pulling web data into draft generation creates - a prompt injection vector. Missing: orphan cleanup for failed uploads. +``` +You: I want to add photo upload for sellers. +You: /plan-ceo-review +Claude: "Photo upload" is not the feature. The real job is helping + sellers create listings that actually sell. What if we + auto-identify the product, pull specs and comps from the + web, and draft the listing automatically? That's 10 stars. + "Upload a photo" is 3 stars. Which are we building? + [8 expansion proposals, you cherry-pick 5, defer 3 to backlog] -You: (fix the issues) +You: /plan-design-review +Claude: Design Score: B | AI Slop Score: C + "Upload flow looks like a default Bootstrap form." + [80-item audit, infers your design system, exports DESIGN.md] + [flags 3 AI slop patterns: gradient hero, icon grid, uniform radius] -You: /ship +You: /plan-eng-review +Claude: ┌─────────┐ ┌──────────┐ ┌──────────┐ ┌─────────┐ + │ Upload │───▶│ Classify │───▶│ Enrich │───▶│ Draft │ + │ (sync) │ │ (async) │ │ (async) │ │ (async) │ + └─────────┘ └──────────┘ └──────────┘ └─────────┘ + [ASCII diagrams for every data flow, state machine, error path] + [14-case test matrix, 6 failure modes mapped, 3 security concerns] -Claude: [Syncs main, runs tests, pushes branch, opens PR — 6 tool calls, done] +You: Approve plan. Exit plan mode. + [Claude writes 2,400 lines across 11 files — models, services, + controllers, views, migrations, and tests. ~8 minutes.] -You: /qa +You: /review +Claude: [AUTO-FIXED] Orphan S3 cleanup on failed upload + [AUTO-FIXED] Missing index on listings.status + [ASK] Race condition on hero image selection → You: yes + [traces every new enum value through all switch statements] + 3 issues — 2 auto-fixed, 1 fixed. -Claude: Analyzing branch diff... 8 files changed, 3 routes affected. - [Tests /listings/new, /listings/:id, /api/listings against localhost:3000] - All 3 routes working. Upload + enrichment flow passes end to end. +You: /qa https://staging.myapp.com +Claude: [opens real browser, logs in, uploads photos, clicks through flows] + Upload → classify → enrich → draft: end to end ✓ + Mobile: ✓ | Slow connection: ✓ | Bad image: ✓ + [finds bug: preview doesn't clear on second upload — fixes it] + Regression test generated. -You: /setup-browser-cookies staging.myapp.com +You: /ship +Claude: Tests: 42 → 51 (+9 new) + Coverage: 14/14 code paths (100%) + PR: github.com/you/app/pull/42 +``` -Claude: Imported 8 cookies for staging.myapp.com from Chrome. +One feature. Seven commands. The agent reframed the product, ran an 80-item design audit, drew the architecture, wrote 2,400 lines of code, found a race condition I would have missed, auto-fixed two issues, opened a real browser to QA test, found and fixed a bug I didn't know about, wrote 9 tests, and generated a regression test. That is not a copilot. That is a team. -You: /qa https://staging.myapp.com --quick +--- -Claude: [Smoke test: homepage + 5 pages, 30 seconds] - Health Score: 91/100. No critical issues. 1 medium: mobile nav overlap. +## The team -You: /plan-design-review https://staging.myapp.com +| Skill | Your specialist | What they do | +|-------|----------------|--------------| +| `/plan-ceo-review` | **CEO / Founder** | Rethink the problem. Find the 10-star product hiding inside the request. Four modes: Expansion, Selective Expansion, Hold Scope, Reduction. | +| `/plan-eng-review` | **Eng Manager** | Lock in architecture, data flow, diagrams, edge cases, and tests. Forces hidden assumptions into the open. | +| `/plan-design-review` | **Senior Designer** | 80-item design audit with letter grades. AI Slop detection. Infers your design system. Report only — never touches code. | +| `/design-consultation` | **Design Partner** | Build a complete design system from scratch. Knows the landscape, proposes creative risks, generates realistic product mockups. Design at the heart of all other phases. | +| `/review` | **Staff Engineer** | Find the bugs that pass CI but blow up in production. Auto-fixes the obvious ones. Flags completeness gaps. | +| `/ship` | **Release Engineer** | Sync main, run tests, audit coverage, push, open PR. Bootstraps test frameworks if you don't have one. One command. | +| `/browse` | **QA Engineer** | Give the agent eyes. Real Chromium browser, real clicks, real screenshots. ~100ms per command. | +| `/qa` | **QA Lead** | Test your app, find bugs, fix them with atomic commits, re-verify. Auto-generates regression tests for every fix. | +| `/qa-only` | **QA Reporter** | Same methodology as /qa but report only. Use when you want a pure bug report without code changes. | +| `/qa-design-review` | **Designer Who Codes** | Same audit as /plan-design-review, then fixes what it finds. Atomic commits, before/after screenshots. | +| `/setup-browser-cookies` | **Session Manager** | Import cookies from your real browser (Chrome, Arc, Brave, Edge) into the headless session. Test authenticated pages. | +| `/retro` | **Eng Manager** | Team-aware weekly retro. Per-person breakdowns, shipping streaks, test health trends, growth opportunities. | +| `/document-release` | **Technical Writer** | Update all project docs to match what you just shipped. Catches stale READMEs automatically. | -Claude: Design Score: B | AI Slop Score: C +**[Deep dives with examples and philosophy for every skill →](docs/skills.md)** - "The site communicates competence but not confidence." - "I notice the spacing between sections is uniform — nothing feels important." - "The first 3 things my eye goes to are: the gradient hero, a 3-column - feature grid with icons in circles, and a cookie-cutter CTA." - "If I had to describe this in one word: template." +--- - Top issues: generic typography (Inter everywhere), AI slop patterns - (3-column icon grid, purple gradient), flat heading scale. +## What's new and why it matters - [Full report with letter grades, 12 findings, inferred design system] - Want me to save this as your DESIGN.md? +**Design is at the heart.** `/design-consultation` doesn't just pick fonts. It researches what's out there in your space, proposes safe choices AND creative risks, generates realistic mockups of your actual product, and writes `DESIGN.md` — and then `/qa-design-review` and `/plan-eng-review` read what you chose. Design decisions flow through the whole system. -You: /qa-design-review +**`/qa` was a massive unlock.** It let me go from 6 to 12 parallel workers. Claude Code saying *"I SEE THE ISSUE"* and then actually fixing it, generating a regression test, and verifying the fix — that changed how I work. The agent has eyes now. -Claude: [Runs the same audit, then fixes 8 design issues] - style(design): FINDING-001 — replace Inter with system font stack - style(design): FINDING-003 — add heading scale contrast - style(design): FINDING-007 — break 3-column grid into asymmetric layout - Design Score: B → A- | AI Slop Score: C → A -``` +**Smart review routing.** Just like at a well-run startup: CEO doesn't have to look at infra bug fixes, design review isn't needed for backend changes. gstack tracks what reviews are run, figures out what's appropriate, and just does the smart thing. The Review Readiness Dashboard tells you where you stand before you ship. -## Who this is for +**Test everything.** `/ship` bootstraps test frameworks from scratch if your project doesn't have one. Every `/ship` run produces a coverage audit. Every `/qa` bug fix generates a regression test. 100% test coverage is the goal — tests make vibe coding safe instead of yolo coding. -You already use Claude Code heavily and want consistent, high-rigor workflows instead of one mushy generic mode. You want to tell the model what kind of brain to use right now — founder taste, engineering rigor, paranoid review, or fast execution. +**`/document-release` is the engineer you never had.** It reads every doc file in your project, cross-references the diff, and updates everything that drifted. README, ARCHITECTURE, CONTRIBUTING, CLAUDE.md, TODOS — all kept current automatically. -This is not a prompt pack for beginners. It is an operating system for people who ship. +--- -## How to fly: 10 sessions at once +## 10 sessions at once -gstack is powerful with one Claude Code session. It is transformative with ten. +gstack is powerful with one session. It is transformative with ten. -[Conductor](https://conductor.build) runs multiple Claude Code sessions in parallel — each in its own isolated workspace. That means you can have one session running `/qa` on staging, another doing `/review` on a PR, a third implementing a feature, and seven more working on other branches. All at the same time. +[Conductor](https://conductor.build) runs multiple Claude Code sessions in parallel — each in its own isolated workspace. One session running `/qa` on staging, another doing `/review` on a PR, a third implementing a feature, and seven more on other branches. All at the same time. -Each workspace gets its own isolated browser instance automatically — separate Chromium process, cookies, tabs, and logs stored in `.gstack/` inside each project root. No port collisions, no shared state, no configuration needed. `/browse` and `/qa` sessions never interfere with each other, even across ten parallel workspaces. +One person, ten parallel agents, each with the right cognitive mode. That is a different way of building software. -This is the setup I use. One person, ten parallel agents, each with the right cognitive mode for its task. That is not incremental improvement. That is a different way of building software. +--- -## Install +## Install — takes 30 seconds -**Requirements:** [Claude Code](https://docs.anthropic.com/en/docs/claude-code), [Git](https://git-scm.com/), [Bun](https://bun.sh/) v1.0+. `/browse` compiles a native binary — works on macOS and Linux (x64 and arm64). +**Requirements:** [Claude Code](https://docs.anthropic.com/en/docs/claude-code), [Git](https://git-scm.com/), [Bun](https://bun.sh/) v1.0+ ### Step 1: Install on your machine -Open Claude Code and paste this. Claude will do the rest. +Open Claude Code and paste this. Claude does the rest. > Install gstack: run `git clone https://github.com/garrytan/gstack.git ~/.claude/skills/gstack && cd ~/.claude/skills/gstack && ./setup` then add a "gstack" section to CLAUDE.md that says to use the /browse skill from gstack for all web browsing, never use mcp\_\_claude-in-chrome\_\_\* tools, and lists the available skills: /plan-ceo-review, /plan-eng-review, /plan-design-review, /design-consultation, /review, /ship, /browse, /qa, /qa-only, /qa-design-review, /setup-browser-cookies, /retro, /document-release. Then ask the user if they also want to add gstack to the current project so teammates get it. @@ -138,25 +138,14 @@ Open Claude Code and paste this. Claude will do the rest. > Add gstack to this project: run `cp -Rf ~/.claude/skills/gstack .claude/skills/gstack && rm -rf .claude/skills/gstack/.git && cd .claude/skills/gstack && ./setup` then add a "gstack" section to this project's CLAUDE.md that says to use the /browse skill from gstack for all web browsing, never use mcp\_\_claude-in-chrome\_\_\* tools, lists the available skills: /plan-ceo-review, /plan-eng-review, /plan-design-review, /design-consultation, /review, /ship, /browse, /qa, /qa-only, /qa-design-review, /setup-browser-cookies, /retro, /document-release, and tells Claude that if gstack skills aren't working, run `cd .claude/skills/gstack && ./setup` to build the binary and register skills. -Real files get committed to your repo (not a submodule), so `git clone` just works. The binary and node\_modules are gitignored — teammates just need to run `cd .claude/skills/gstack && ./setup` once to build (or `/browse` handles it automatically on first use). - -### What gets installed - -- Skill files (Markdown prompts) in `~/.claude/skills/gstack/` (or `.claude/skills/gstack/` for project installs) -- Symlinks at `~/.claude/skills/browse`, `~/.claude/skills/qa`, `~/.claude/skills/review`, etc. pointing into the gstack directory -- Browser binary at `browse/dist/browse` (~58MB, gitignored) -- `node_modules/` (gitignored) -- `/retro` saves JSON snapshots to `.context/retros/` in your project for trend tracking - -Everything lives inside `.claude/`. Nothing touches your PATH or runs in the background. +Real files get committed to your repo (not a submodule), so `git clone` just works. Everything lives inside `.claude/`. Nothing touches your PATH or runs in the background. --- ``` +----------------------------------------------------------------------------+ | | -| Are you a great software engineer who loves to write 10K LOC/day | -| and land 10 PRs a day like Garry? | +| Are you a great software engineer who wants to ship 10K+ LOC/day? | | | | Come work at YC: ycombinator.com/software | | | @@ -169,694 +158,36 @@ Everything lives inside `.claude/`. Nothing touches your PATH or runs in the bac --- -## How I use these skills +## Come ride the wave -Created by [Garry Tan](https://x.com/garrytan), President & CEO of [Y Combinator](https://www.ycombinator.com/). +This is **free, MIT licensed, open source, available now.** No premium tier. No waitlist. No strings. -I built gstack because I do not want AI coding tools stuck in one mushy mode. +I open sourced how I do development and I am actively upgrading my own software factory here. You can fork it and make it your own. That's the whole point. I want everyone on this journey. -Planning is not review. Review is not shipping. Founder taste is not engineering rigor. If you blur all of that together, you usually get a mediocre blend of all four. +The models are getting better fast. The people who figure out how to work with them now — really work with them, not just dabble — are going to have a massive advantage. This is that window. Let's go. -I want explicit gears. - -These skills let me tell the model what kind of brain I want right now. I can switch cognitive modes on demand — founder, eng manager, paranoid reviewer, release machine. That is the unlock. +**[github.com/garrytan/gstack](https://github.com/garrytan/gstack)** — MIT License --- -## `/plan-ceo-review` - -This is my **founder mode**. - -This is where I want the model to think with taste, ambition, user empathy, and a long time horizon. I do not want it taking the request literally. I want it asking a more important question first: - -**What is this product actually for?** - -I think of this as **Brian Chesky mode**. - -The point is not to implement the obvious ticket. The point is to rethink the problem from the user's point of view and find the version that feels inevitable, delightful, and maybe even a little magical. - -### Example - -Say I am building a Craigslist-style listing app and I say: - -> "Let sellers upload a photo for their item." - -A weak assistant will add a file picker and save an image. - -That is not the real product. - -In `/plan-ceo-review`, I want the model to ask whether "photo upload" is even the feature. Maybe the real feature is helping someone create a listing that actually sells. - -If that is the real job, the whole plan changes. - -Now the model should ask: - -* Can we identify the product from the photo? -* Can we infer the SKU or model number? -* Can we search the web and draft the title and description automatically? -* Can we pull specs, category, and pricing comps? -* Can we suggest which photo will convert best as the hero image? -* Can we detect when the uploaded photo is ugly, dark, cluttered, or low-trust? -* Can we make the experience feel premium instead of like a dead form from 2007? - -That is what `/plan-ceo-review` does for me. - -It does not just ask, "how do I add this feature?" -It asks, **"what is the 10-star product hiding inside this request?"** - -That is a very different kind of power. - ---- - -## `/plan-eng-review` - -This is my **eng manager mode**. - -Once the product direction is right, I want a different kind of intelligence entirely. I do not want more sprawling ideation. I do not want more "wouldn't it be cool if." I want the model to become my best technical lead. - -This mode should nail: - -* architecture -* system boundaries -* data flow -* state transitions -* failure modes -* edge cases -* trust boundaries -* test coverage - -And one surprisingly big unlock for me: **diagrams**. - -LLMs get way more complete when you force them to draw the system. Sequence diagrams, state diagrams, component diagrams, data-flow diagrams, even test matrices. Diagrams force hidden assumptions into the open. They make hand-wavy planning much harder. - -So `/plan-eng-review` is where I want the model to build the technical spine that can carry the product vision. - -### Example - -Take the same listing app example. - -Let's say `/plan-ceo-review` already did its job. We decided the real feature is not just photo upload. It is a smart listing flow that: - -* uploads photos -* identifies the product -* enriches the listing from the web -* drafts a strong title and description -* suggests the best hero image - -Now `/plan-eng-review` takes over. - -Now I want the model to answer questions like: - -* What is the architecture for upload, classification, enrichment, and draft generation? -* Which steps happen synchronously, and which go to background jobs? -* Where are the boundaries between app server, object storage, vision model, search/enrichment APIs, and the listing database? -* What happens if upload succeeds but enrichment fails? -* What happens if product identification is low-confidence? -* How do retries work? -* How do we prevent duplicate jobs? -* What gets persisted when, and what can be safely recomputed? - -And this is where I want diagrams — architecture diagrams, state models, data-flow diagrams, test matrices. Diagrams force hidden assumptions into the open. They make hand-wavy planning much harder. - -That is `/plan-eng-review`. - -Not "make the idea smaller." -**Make the idea buildable.** - ---- - -## `/plan-design-review` - -This is my **senior designer mode**. - -Most developers cannot tell whether their site looks AI-generated. I could not, until I started paying attention. There is a growing class of sites that are functional but soulless — they work fine but scream "an AI built this and nobody with taste looked at it." Purple gradients, 3-column icon grids, uniform bubbly border-radius on everything, centered text on every section, decorative blobs floating in the background. The ChatGPT aesthetic. - -`/plan-design-review` gives the agent a designer's eye. - -It opens your site and reacts to it the way a Stripe or Linear designer would — immediately, viscerally, with opinions. The first output is a structured gut reaction: what the site communicates at a glance, what the eye is drawn to, and a one-word verdict. That is the most valuable part. Everything after is supporting evidence. - -Then it extracts your actual design system by reading the CSS: how many fonts are you really using? What is your heading scale? How many colors? Is your spacing on a grid or arbitrary? It structures this as an **Inferred Design System** and offers to save it as a `DESIGN.md` — so now you have a design system you never had before, derived from what you actually shipped. - -The audit covers 80 items across 10 categories. Typography, spacing, hierarchy, color, responsive design, interaction states, motion, content quality, performance feel, and — the headline feature — **AI Slop Detection**. Ten specific anti-patterns that mark a site as machine-generated. - -Every finding gets a letter grade (A through F) per category. The report opens with two headline scores: **Design Score** and **AI Slop Score**. "Your site got a B on design but a D on AI Slop" tells you exactly where to focus. - -### Example - -``` -You: /plan-design-review https://myapp.com - -Claude: Design Score: C | AI Slop Score: D - - "The site communicates generic SaaS energy." - "I notice the hero section uses a blue-to-purple gradient - with a 3-column feature grid below it — the single most - recognizable AI-generated layout pattern." - "The first 3 things my eye goes to are: the gradient, the - symmetrical icon grid, and a 'Get Started' button that - looks like every other 'Get Started' button." - "If I had to describe this in one word: template." - - Inferred Design System: - - Fonts: Inter (body), Inter (headings) — 1 font family - - Colors: 14 unique non-gray colors (flag: >12) - - Heading scale: 36/24/18/16 — flat, no drama between sizes - - Spacing: mostly 16px and 24px — no rhythm - - Top 5 Design Improvements: - 1. Break the 3-column icon grid (High, AI Slop) - 2. Add heading scale contrast — at least 2:1 ratio (High, Typography) - 3. Replace uniform border-radius with intentional hierarchy (Medium, Spacing) - 4. Kill the gradient hero — use a real image or bold typography (High, AI Slop) - 5. Add a second font for headings — Inter-only reads as generic (Medium, Typography) - - [Full report saved to .gstack/design-reports/] - Want me to save this inferred design system as your DESIGN.md? -``` - -This is report only — it never touches your code. Use `/qa-design-review` when you want it to fix what it finds. - ---- - -## `/design-consultation` - -This is my **design partner mode**. - -`/plan-design-review` audits a site that already exists. `/design-consultation` is for when you have nothing yet — no design system, no font choices, no color palette. You are starting from zero and you want a senior designer to sit down with you and build the whole visual identity together. - -It is a conversation, not a form. The agent asks about your product, your users, and your audience. It thinks about what your product needs to communicate — trust, speed, craft, warmth, whatever fits — and works backward from that to concrete choices. Then it proposes a complete, coherent design system: aesthetic direction, typography (3+ fonts with specific roles), color palette with hex values, spacing scale, layout approach, and motion strategy. Every recommendation comes with a rationale. Every choice reinforces every other choice. - -But coherence is table stakes. Every dev tool dashboard looks the same — clean sans-serif, muted grays, a blue accent. They are all coherent. They are all forgettable. The difference between a product that looks "nice" and one that people actually recognize is the **deliberate creative risks**: the unexpected serif for headings, the bold accent nobody else in your category uses, the tighter spacing that makes your data feel authoritative instead of airy. - -That is what `/design-consultation` is really about. It does not just propose a safe system. It proposes safe choices AND risks — and tells you which is which. "Here are the choices that keep you literate in your category. And here is where I think you should break from convention, and why." You pick which risks to take. The agent checks that the whole system still coheres either way. - -If you want, the agent will browse your competitors' actual sites — take screenshots, analyze their fonts and colors and spacing — so you can see the landscape before you make choices. This is not about copying. It is about getting in the ballpark so you know what the conventions are, and then deciding which ones are worth breaking. - -After you agree on the system, it generates an interactive HTML preview page — not just swatches and font samples, but realistic product pages. If you are building a dashboard, you see a dashboard with a sidebar, data tables, and stat cards. If you are building a marketing site, you see a hero section with real copy and a CTA. Everything rendered in your design system, with your product name, in light and dark mode. You see what your product could feel like before a single line of production code is written. - -Then it writes `DESIGN.md` to your repo root — your project's design source of truth — and updates `CLAUDE.md` so every future Claude Code session respects the system. From that point on, `/qa-design-review` can audit against it, and any agent working on your frontend knows the rules. - -``` -You: /design-consultation - -Claude: From the README, this looks like a developer analytics dashboard - for engineering teams. Sound right? Tell me about the feeling - you want — and would you like me to browse some competitors - first so we can see the landscape? - -You: Yes, it's for eng managers. Show me what's out there. - -Claude: [browses Linear, Vercel, Datadog, Posthog — takes screenshots] - - Here's the landscape. They all converge on clean sans-serifs - and muted blue-gray palettes. Linear stands out with dark-first - and tight spacing. The rest feel interchangeable. - - Here's my proposal — and where I'd play it safe vs. take risks: - - AESTHETIC: Industrial/Utilitarian — data-first, every pixel earns its place - TYPOGRAPHY: Geist (body + UI), Instrument Serif (display), JetBrains Mono (data) - COLOR: Restrained — slate neutrals, single teal accent (#0D9488) - SPACING: 4px base, compact density - LAYOUT: Grid-disciplined, 12-column - MOTION: Minimal-functional — only transitions that aid comprehension - - SAFE CHOICES (match category expectations): - Geist for body — your users already know this font from Vercel - Slate neutrals — warm grays feel right for data-heavy UIs - Grid-disciplined layout — dashboards need predictable alignment - - RISKS (where you'd stand out): - Instrument Serif for display headings — nobody in dev tools - uses a serif. It creates instant hierarchy contrast and - signals "we care about craft" in a space full of sans-serifs. - Single teal accent instead of blue — every competitor uses blue. - Teal is close enough to feel trustworthy, different enough - to be recognizable. - 4px compact density — tighter than Linear, says "we respect - your screen real estate." - - The safe choices keep you literate. The risks are where your - product gets its own face. Which risks appeal to you? - -You: Love the serif risk. Let's see it. - -Claude: Preview opened in your browser. - [/tmp/design-consultation-preview-1710000000.html] - - The page shows a realistic dashboard: sidebar nav, header with - avatar, stat cards, data table — all in your design system. - Instrument Serif at hero scale, Geist in body text, JetBrains - Mono in the data table. Teal accent on buttons and active states. - Light and dark mode toggle in the corner. - - How does it feel? Want to adjust anything before I write DESIGN.md? - -You: Ship it. - -Claude: Wrote DESIGN.md (typography, color, spacing, layout, motion). - Updated CLAUDE.md with design system guidelines. - All future sessions will respect these choices. -``` - ---- - -## `/qa-design-review` - -This is my **designer who codes mode**. - -`/plan-design-review` tells you what is wrong. `/qa-design-review` fixes it. - -It runs the same 80-item audit, then enters a fix loop: for each design finding, it locates the source file, makes the minimal CSS/styling change, commits with `style(design): FINDING-NNN`, re-navigates to verify, and takes before/after screenshots. One commit per fix, fully bisectable. - -The self-regulation heuristic is tuned for design work — CSS-only changes get a free pass (they are inherently safe and reversible), but changes to component JSX/TSX files count against the risk budget. Hard cap at 30 fixes. If the risk score exceeds 20%, it stops and asks. - -``` -You: /qa-design-review https://myapp.com - -Claude: [Runs full design audit — same output as /plan-design-review] - Design Score: C | AI Slop Score: D - 12 findings (4 high, 5 medium, 3 polish) - - Fixing 9 design issues... - - style(design): FINDING-001 — replace 3-column icon grid with asymmetric layout - style(design): FINDING-002 — add heading scale 48/32/24/18/16 - style(design): FINDING-003 — remove gradient hero, use bold typography - style(design): FINDING-004 — add second font for headings - style(design): FINDING-005 — vary border-radius by element role - style(design): FINDING-006 — left-align body text, reserve center for headings - style(design): FINDING-007 — add hover/focus states to all interactive elements - style(design): FINDING-008 — add prefers-reduced-motion media query - style(design): FINDING-009 — set max content width to 680px for body text - - Final audit: - Design Score: C → B+ | AI Slop Score: D → A - 9 fixes applied (8 verified, 1 best-effort). 3 deferred. - - [Report with before/after screenshots saved to .gstack/design-reports/] -``` - -Nine commits, each touching one concern. The AI Slop score went from D to A because the three most recognizable patterns (gradient hero, 3-column grid, uniform radius) are gone. The design score improved two grades because the typography now has a scale, the spacing has hierarchy, and interactive elements have proper states. - ---- - -## `/review` - -This is my **paranoid staff engineer mode**. - -Passing tests do not mean the branch is safe. - -`/review` exists because there is a whole class of bugs that can survive CI and still punch you in the face in production. This mode is not about dreaming bigger. It is not about making the plan prettier. It is about asking: - -**What can still break?** - -This is a structural audit, not a style nitpick pass. I want the model to look for things like: - -* N+1 queries -* stale reads -* race conditions -* bad trust boundaries -* missing indexes -* escaping bugs -* broken invariants -* bad retry logic -* tests that pass while missing the real failure mode - -### Example - -Suppose the smart listing flow is implemented and the tests are green. - -`/review` should still ask: - -* Did I introduce an N+1 query when rendering listing photos or draft suggestions? -* Am I trusting client-provided file metadata instead of validating the actual file? -* Can two tabs race and overwrite cover-photo selection or item details? -* Do failed uploads leave orphaned files in storage forever? -* Can the "exactly one hero image" rule break under concurrency? -* If enrichment APIs partially fail, do I degrade gracefully or save garbage? -* Did I accidentally create a prompt injection or trust-boundary problem by pulling web data into draft generation? - -That is the point of `/review`. - -I do not want flattery here. -I want the model imagining the production incident before it happens. - ---- - -## `/ship` - -This is my **release machine mode**. - -Once I have decided what to build, nailed the technical plan, and run a serious review, I do not want more talking. I want execution. - -`/ship` is for the final mile. It is for a ready branch, not for deciding what to build. - -This is where the model should stop behaving like a brainstorm partner and start behaving like a disciplined release engineer: sync with main, run the right tests, make sure the branch state is sane, update changelog or versioning if the repo expects it, push, and create or update the PR. - -Momentum matters here. - -A lot of branches die when the interesting work is done and only the boring release work is left. Humans procrastinate that part. AI should not. - -### Example - -Suppose the smart listing flow is finished. - -The product thinking is done. -The architecture is done. -The review pass is done. -Now the branch just needs to get landed. +## Docs -That is what `/ship` is for. - -It takes care of the repetitive release hygiene so I do not bleed energy on: - -* syncing with main -* rerunning tests -* checking for weird branch state -* updating changelog/version metadata -* pushing the branch -* opening or updating the PR - -At this point I do not want more ideation. -I want the plane landed. - ---- - -## Greptile integration - -[Greptile](https://greptile.com) is a YC company that reviews your PRs automatically. It catches real bugs — race conditions, security issues, things that pass CI and blow up in production. It has genuinely saved my ass more than once. I love these guys. - -**Setup:** Install Greptile on your GitHub repo at [greptile.com](https://greptile.com) — it takes about 30 seconds. Once it's reviewing your PRs, gstack picks up its comments automatically. No additional configuration. - -The problem with any automated reviewer is triage. Greptile is good, but not every comment is a real issue. Some are false positives. Some flag things you already fixed three commits ago. Without a triage layer, the comments pile up and you start ignoring them — which defeats the purpose. - -gstack solves this. `/review` and `/ship` are now Greptile-aware. They read Greptile's comments, classify each one, and take action: - -- **Valid issues** get added to the critical findings and fixed before shipping -- **Already-fixed issues** get an auto-reply acknowledging the catch -- **False positives** get pushed back — you confirm, and a reply goes out explaining why it's wrong - -The result is a two-layer review: Greptile catches things asynchronously on the PR, then `/review` and `/ship` triage those findings as part of the normal workflow. Nothing falls through the cracks. - -It also learns. Every false positive you confirm gets saved to `~/.gstack/greptile-history.md`. Future runs auto-skip known FP patterns for your codebase. And `/retro` tracks Greptile's batting average over time — so you can see whether the signal-to-noise ratio is improving. - -### Example - -``` -You: /ship - -Claude: [syncs main, runs tests, pre-landing review...] - - Greptile found 3 comments on this PR: - - [VALID] app/services/payment_service.rb:47 — Race condition: - concurrent charges can double-debit. Recommend DB-level advisory - lock around the charge block. - → https://github.com/you/app/pull/42#discussion_r123 - - [ALREADY FIXED] app/models/user.rb:12 — Missing null check - on email before downcase. - → Already fixed in commit abc1234. Auto-replying. - - [FALSE POSITIVE] lib/auth.rb:88 — "Token comparison should - use constant-time comparison." We already use - ActiveSupport::SecurityUtils.secure_compare here. - → https://github.com/you/app/pull/42#discussion_r125 - - Fixing the race condition in payment_service.rb... - Applied advisory lock. Committed: fix: address Greptile - review — advisory lock on concurrent charges - - The auth token comment is a false positive — we're already - using secure_compare. - - Options for the false positive: - A) Reply to Greptile explaining (recommended) - B) Fix it anyway - C) Ignore - -You: A - -Claude: Replied to Greptile. Re-running tests after the fix... - All tests pass. Continuing to version bump. - - [creates PR with Greptile Review section in body] - PR: https://github.com/you/app/pull/42 -``` - -Three Greptile comments. One real fix. One auto-acknowledged. One false positive pushed back with a reply. Total extra time: about 30 seconds. And the PR body has the full audit trail. - ---- - -## `/browse` - -This is my **QA engineer mode**. - -`/browse` is the skill that closes the loop. Before it, the agent could think and code but was still half blind. It had to guess about UI state, auth flows, redirects, console errors, empty states, and broken layouts. Now it can just go look. - -That is a real step-change. The full cycle becomes: plan, code, run the app, inspect the UI, reproduce the bug, verify the fix, ship. QA stops being "go click around manually" and becomes part of the same agentic loop. It gives the agent eyes. Now it can do the boring, high-context QA work itself: click through the app, catch breakage, verify the fix, and keep going. That feels like having a real operator in the machine. - -It is a compiled binary that talks to a persistent Chromium daemon — built on [Playwright](https://playwright.dev/) by Microsoft. First call starts the browser (~3s). Every call after that: ~100-200ms. The browser stays running between commands, so cookies, tabs, and localStorage carry over. - -After I push a branch, I tell Claude to go check staging. It does the entire QA pass for me: - -``` -You: /browse staging.myapp.com — log in, test the signup flow, and check - every page I changed in this branch - -Claude: [18 tool calls, ~60 seconds] - - > browse goto https://staging.myapp.com/signup - > browse snapshot -i - > browse fill @e2 "test@example.com" - > browse fill @e3 "password123" - > browse click @e5 (Submit) - > browse screenshot /tmp/signup.png - > Read /tmp/signup.png - - Signup works. Redirected to onboarding. Now checking changed pages. - - > browse goto https://staging.myapp.com/dashboard - > browse screenshot /tmp/dashboard.png - > Read /tmp/dashboard.png - > browse console - - Dashboard loads. No console errors. Charts render with sample data. - - > browse goto https://staging.myapp.com/settings/billing - > browse screenshot /tmp/billing.png - > Read /tmp/billing.png - - Billing page loads. Plan selector shows Free/Pro/Enterprise. - Stripe elements render correctly. - - > browse goto https://staging.myapp.com/api/health - > browse text - - API health endpoint returns {"status":"ok","version":"2.4.1"}. - - All 4 pages load correctly. No console errors. No broken layouts. - Signup → onboarding → dashboard flow works end to end. -``` - -18 tool calls, about a minute. It signed up a test user, navigated every page I changed, took screenshots, read them, checked the console for errors, and verified the API. That is a full QA pass. I did not open a browser. - -It works for reading documentation, verifying deployments, filling out forms, taking screenshots, comparing pages across environments, and anything else where Claude needs eyes on a live URL. - -**Security note:** `/browse` runs a persistent Chromium session. Cookies, localStorage, and session state carry over between commands. Do not use it against sensitive production environments unless you intend to — it is a real browser with real state. The session auto-shuts down after 30 minutes of idle time. - -For the full command reference, technical internals, and architecture details, see [BROWSER.md](BROWSER.md). - ---- - -## `/qa` - -This is my **QA lead mode**. - -`/browse` gives the agent eyes. `/qa` gives it a testing methodology. - -The most common use case: you're on a feature branch, you just finished coding, and you want to verify everything works. Just say `/qa` — it reads your git diff, identifies which pages and routes your changes affect, spins up the browser, and tests each one. No URL required. No manual test plan. It figures out what to test from the code you changed. - -``` -You: /qa - -Claude: Analyzing branch diff against main... - 12 files changed: 3 controllers, 2 views, 4 services, 3 tests - - Affected routes: /listings/new, /listings/:id, /api/listings - Detected app running on localhost:3000. - - [Tests each affected page — navigates, fills forms, clicks buttons, - screenshots, checks console errors] - - QA Report: 3 routes tested, all working. - - /listings/new: upload + enrichment flow works end to end - - /listings/:id: detail page renders correctly - - /api/listings: returns 200 with expected shape - No console errors. No regressions on adjacent pages. -``` - -Four modes: - -- **Diff-aware** (automatic on feature branches) — reads `git diff main`, identifies affected pages, tests them specifically. The fastest path from "I just wrote code" to "it works." -- **Full** — systematic exploration of the entire app. 5-15 minutes depending on app size. Documents 5-10 well-evidenced issues. -- **Quick** (`--quick`) — 30-second smoke test. Homepage + top 5 nav targets. Loads? Console errors? Broken links? -- **Regression** (`--regression baseline.json`) — run full mode, then diff against a previous baseline. Which issues are fixed? Which are new? What's the score delta? - -``` -You: /qa https://staging.myapp.com - -Claude: [Explores 12 pages, fills 3 forms, tests 2 flows] - - QA Report: staging.myapp.com — Health Score: 72/100 - - Top 3 Issues: - 1. CRITICAL: Checkout form submits with empty required fields - 2. HIGH: Mobile nav menu doesn't close after selecting an item - 3. MEDIUM: Dashboard chart overlaps sidebar below 1024px - - [Full report with screenshots saved to .gstack/qa-reports/] -``` - -Reports and screenshots accumulate in `.gstack/qa-reports/` so you can track quality over time and compare runs. - -**Testing authenticated pages:** Use `/setup-browser-cookies` first to import your real browser sessions, then `/qa` can test pages behind login. - ---- - -## `/setup-browser-cookies` - -This is my **session manager mode**. - -Before `/qa` or `/browse` can test authenticated pages, they need cookies. Instead of manually logging in through the headless browser every time, `/setup-browser-cookies` imports your real sessions directly from your daily browser. - -It auto-detects installed Chromium browsers (Comet, Chrome, Arc, Brave, Edge), decrypts cookies via the macOS Keychain, and loads them into the Playwright session. An interactive picker UI lets you choose exactly which domains to import — no cookie values are ever displayed. - -``` -You: /setup-browser-cookies - -Claude: Cookie picker opened — select the domains you want to import - in your browser, then tell me when you're done. - - [You pick github.com, myapp.com in the browser UI] - -You: done - -Claude: Imported 2 domains (47 cookies). Session is ready. -``` - -Or skip the UI entirely: - -``` -You: /setup-browser-cookies github.com - -Claude: Imported 12 cookies for github.com from Comet. -``` - -First import per browser triggers a macOS Keychain prompt — click "Allow" or "Always Allow." - ---- - -## `/retro` - -This is my **engineering manager mode**. - -At the end of the week I want to know what actually happened. Not vibes — data. `/retro` analyzes commit history, work patterns, and shipping velocity and writes a candid retrospective. - -It is team-aware. It identifies who is running the command, gives you the deepest treatment on your own work, then breaks down every contributor with specific praise and growth opportunities — the kind of feedback you would actually give in a 1:1. It computes metrics like commits, LOC, test ratio, PR sizes, and fix ratio. It detects coding sessions from commit timestamps, finds hotspot files, tracks shipping streaks, and identifies the biggest ship of the week. - -``` -You: /retro - -Claude: Week of Mar 1: 47 commits (3 contributors), 3.2k LOC, 38% tests, 12 PRs, peak: 10pm | Streak: 47d - - ## Your Week - 32 commits, +2.4k LOC, 41% tests. Peak hours: 9-11pm. - Biggest ship: cookie import system (browser decryption + picker UI). - What you did well: shipped a complete feature with encryption, UI, and - 18 unit tests in one focused push... - - ## Team Breakdown - - ### Alice - 12 commits focused on app/services/. Every PR under 200 LOC — disciplined. - Opportunity: test ratio at 12% — worth investing before payment gets more complex. - - ### Bob - 3 commits — fixed the N+1 query on dashboard. Small but high-impact. - Opportunity: only 1 active day this week — check if blocked on anything. - - [Top 3 team wins, 3 things to improve, 3 habits for next week] -``` - -It saves a JSON snapshot to `.context/retros/` so the next run can show trends. Run `/retro compare` to see this week vs last week side by side. - ---- - -## `/document-release` - -This is my **technical writer mode**. - -After `/ship` creates the PR but before it merges, `/document-release` reads every documentation file in the project and cross-references it against the diff. It updates file paths, command lists, project structure trees, and anything else that drifted. Risky or subjective changes get surfaced as questions — everything else is handled automatically. - -``` -You: /document-release - -Claude: Analyzing 21 files changed across 3 commits. Found 8 documentation files. - - README.md: updated skill count from 9 to 10, added new skill to table - CLAUDE.md: added new directory to project structure - CONTRIBUTING.md: current — no changes needed - TODOS.md: marked 2 items complete, added 1 new item - - All docs updated and committed. PR body updated with doc diff. -``` - -It also polishes CHANGELOG voice (without ever overwriting entries), cleans up completed TODOS, checks cross-doc consistency, and asks about VERSION bumps only when appropriate. - ---- +| Doc | What it covers | +|-----|---------------| +| [Skill Deep Dives](docs/skills.md) | Philosophy, examples, and workflow for every skill (includes Greptile integration) | +| [Architecture](ARCHITECTURE.md) | Design decisions and system internals | +| [Browser Reference](BROWSER.md) | Full command reference for `/browse` | +| [Contributing](CONTRIBUTING.md) | Dev setup, testing, contributor mode, and dev mode | +| [Changelog](CHANGELOG.md) | What's new in every version | ## Troubleshooting -**Skill not showing up in Claude Code?** -Run `cd ~/.claude/skills/gstack && ./setup` (or `cd .claude/skills/gstack && ./setup` for project installs). This rebuilds symlinks so Claude can discover the skills. - -**`/browse` fails or binary not found?** -Run `cd ~/.claude/skills/gstack && bun install && bun run build`. This compiles the browser binary. Requires Bun v1.0+. - -**Project copy is stale?** -Run `/gstack-upgrade` — it updates both the global install and any vendored project copy automatically. - -**`bun` not installed?** -Install it: `curl -fsSL https://bun.sh/install | bash` - -## Upgrading - -Run `/gstack-upgrade` in Claude Code. It detects your install type (global or vendored), upgrades, syncs any project copies, and shows what's new. +**Skill not showing up?** `cd ~/.claude/skills/gstack && ./setup` -Or set `auto_upgrade: true` in `~/.gstack/config.yaml` to upgrade automatically whenever a new version is available. - -## Uninstalling - -Paste this into Claude Code: - -> Uninstall gstack: remove the skill symlinks by running `for s in browse plan-ceo-review plan-eng-review plan-design-review design-consultation review ship retro qa qa-only qa-design-review setup-browser-cookies document-release; do rm -f ~/.claude/skills/$s; done` then run `rm -rf ~/.claude/skills/gstack` and remove the gstack section from CLAUDE.md. If this project also has gstack at .claude/skills/gstack, remove it by running `for s in browse plan-ceo-review plan-eng-review plan-design-review review ship retro qa qa-only qa-design-review setup-browser-cookies document-release; do rm -f .claude/skills/$s; done && rm -rf .claude/skills/gstack` and remove the gstack section from the project CLAUDE.md too. - -## Development - -See [CONTRIBUTING.md](CONTRIBUTING.md) for setup, testing, and dev mode. See [ARCHITECTURE.md](ARCHITECTURE.md) for design decisions and system internals. See [BROWSER.md](BROWSER.md) for the browse command reference. - -### Testing - -```bash -bun test # free static tests (<5s) -EVALS=1 bun run test:evals # full E2E + LLM evals (~$4, ~20min) -bun run eval:watch # live dashboard during E2E runs -``` +**`/browse` fails?** `cd ~/.claude/skills/gstack && bun install && bun run build` -E2E tests stream real-time progress, write machine-readable diagnostics, and persist partial results that survive kills. See CONTRIBUTING.md for the full eval infrastructure. +**Stale install?** Run `/gstack-upgrade` — or set `auto_upgrade: true` in `~/.gstack/config.yaml` ## License -MIT +MIT. Free forever. Go build something. diff --git a/SKILL.md b/SKILL.md index c0d6e267..72bdb682 100644 --- a/SKILL.md +++ b/SKILL.md @@ -28,22 +28,61 @@ find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true _CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true) _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") echo "BRANCH: $_BRANCH" +_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") +echo "LAKE_INTRO: $_LAKE_SEEN" ``` If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. +If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle. +Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete +thing when AI makes the marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean" +Then offer to open the essay in their default browser: + +```bash +open https://garryslist.org/posts/boil-the-ocean +touch ~/.gstack/.completeness-intro-seen +``` + +Only run `open` if the user says yes. Always run `touch` to mark as seen. This only happens once. + ## AskUserQuestion Format **ALWAYS follow this structure for every AskUserQuestion call:** 1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences) 2. **Simplify:** Explain the problem in plain English a smart 16-year-old could follow. No raw function names, no internal jargon, no implementation details. Use concrete examples and analogies. Say what it DOES, not what it's called. -3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` -4. **Options:** Lettered options: `A) ... B) ... C) ...` +3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` — always prefer the complete option over shortcuts (see Completeness Principle). Include `Completeness: X/10` for each option. Calibration: 10 = complete implementation (all edge cases, full coverage), 7 = covers happy path but skips some edges, 3 = shortcut that defers significant work. If both options are 8+, pick the higher; if one is ≤5, flag it. +4. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)` Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex. Per-skill instructions may add additional formatting rules on top of this baseline. +## Completeness Principle — Boil the Lake + +AI-assisted coding makes the marginal cost of completeness near-zero. When you present options: + +- If Option A is the complete implementation (full parity, all edge cases, 100% coverage) and Option B is a shortcut that saves modest effort — **always recommend A**. The delta between 80 lines and 150 lines is meaningless with CC+gstack. "Good enough" is the wrong instinct when "complete" costs minutes more. +- **Lake vs. ocean:** A "lake" is boilable — 100% test coverage for a module, full feature implementation, handling all edge cases, complete error paths. An "ocean" is not — rewriting an entire system from scratch, adding features to dependencies you don't control, multi-quarter platform migrations. Recommend boiling lakes. Flag oceans as out of scope. +- **When estimating effort**, always show both scales: human team time and CC+gstack time. The compression ratio varies by task type — use this reference: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +- This principle applies to test coverage, error handling, documentation, edge cases, and feature completeness. Don't skip the last 10% to "save time" — with AI, that 10% costs seconds. + +**Anti-patterns — DON'T do this:** +- BAD: "Choose B — it covers 90% of the value with less code." (If A is only 70 lines more, choose A.) +- BAD: "We can skip edge case handling to save time." (Edge case handling costs minutes with CC.) +- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.) +- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.") + ## Contributor Mode If `_CONTRIB` is `true`: you are in **contributor mode**. You're a gstack user who also helps make it better. diff --git a/TODOS.md b/TODOS.md index a0801d85..8616f906 100644 --- a/TODOS.md +++ b/TODOS.md @@ -386,6 +386,16 @@ **Priority:** P2 **Depends on:** None +### Cross-platform URL open helper + +**What:** `gstack-open-url` helper script — detect platform, use `open` (macOS) or `xdg-open` (Linux). + +**Why:** The first-time Completeness Principle intro uses macOS `open` to launch the essay. If gstack ever supports Linux, this silently fails. + +**Effort:** S (human: ~30 min / CC: ~2 min) +**Priority:** P4 +**Depends on:** Nothing + ### CDP-based DOM mutation detection for ref staleness **What:** Use Chrome DevTools Protocol `DOM.documentUpdated` / MutationObserver events to proactively invalidate stale refs when the DOM changes, without requiring an explicit `snapshot` call. @@ -458,6 +468,20 @@ Shipped as `/design-consultation` on garrytan/design branch. Renamed from `/setu **Priority:** P2 **Depends on:** Ship Confidence Dashboard (shipped) +## Completeness + +### Completeness metrics dashboard + +**What:** Track how often Claude chooses the complete option vs shortcut across gstack sessions. Aggregate into a dashboard showing completeness trend over time. + +**Why:** Without measurement, we can't know if the Completeness Principle is working. Could surface patterns (e.g., certain skills still bias toward shortcuts). + +**Context:** Would require logging choices (e.g., append to a JSONL file when AskUserQuestion resolves), parsing them, and displaying trends. Similar pattern to eval persistence. + +**Effort:** M (human) / S (CC) +**Priority:** P3 +**Depends on:** Boil the Lake shipped (v0.6.1) + ## Completed ### Phase 1: Foundations (v0.2.0) diff --git a/VERSION b/VERSION index 758efdb4..ee6cdce3 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -0.6.0.1 +0.6.1 diff --git a/browse/SKILL.md b/browse/SKILL.md index 33accc09..50198d5f 100644 --- a/browse/SKILL.md +++ b/browse/SKILL.md @@ -28,22 +28,61 @@ find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true _CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true) _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") echo "BRANCH: $_BRANCH" +_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") +echo "LAKE_INTRO: $_LAKE_SEEN" ``` If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. +If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle. +Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete +thing when AI makes the marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean" +Then offer to open the essay in their default browser: + +```bash +open https://garryslist.org/posts/boil-the-ocean +touch ~/.gstack/.completeness-intro-seen +``` + +Only run `open` if the user says yes. Always run `touch` to mark as seen. This only happens once. + ## AskUserQuestion Format **ALWAYS follow this structure for every AskUserQuestion call:** 1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences) 2. **Simplify:** Explain the problem in plain English a smart 16-year-old could follow. No raw function names, no internal jargon, no implementation details. Use concrete examples and analogies. Say what it DOES, not what it's called. -3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` -4. **Options:** Lettered options: `A) ... B) ... C) ...` +3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` — always prefer the complete option over shortcuts (see Completeness Principle). Include `Completeness: X/10` for each option. Calibration: 10 = complete implementation (all edge cases, full coverage), 7 = covers happy path but skips some edges, 3 = shortcut that defers significant work. If both options are 8+, pick the higher; if one is ≤5, flag it. +4. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)` Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex. Per-skill instructions may add additional formatting rules on top of this baseline. +## Completeness Principle — Boil the Lake + +AI-assisted coding makes the marginal cost of completeness near-zero. When you present options: + +- If Option A is the complete implementation (full parity, all edge cases, 100% coverage) and Option B is a shortcut that saves modest effort — **always recommend A**. The delta between 80 lines and 150 lines is meaningless with CC+gstack. "Good enough" is the wrong instinct when "complete" costs minutes more. +- **Lake vs. ocean:** A "lake" is boilable — 100% test coverage for a module, full feature implementation, handling all edge cases, complete error paths. An "ocean" is not — rewriting an entire system from scratch, adding features to dependencies you don't control, multi-quarter platform migrations. Recommend boiling lakes. Flag oceans as out of scope. +- **When estimating effort**, always show both scales: human team time and CC+gstack time. The compression ratio varies by task type — use this reference: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +- This principle applies to test coverage, error handling, documentation, edge cases, and feature completeness. Don't skip the last 10% to "save time" — with AI, that 10% costs seconds. + +**Anti-patterns — DON'T do this:** +- BAD: "Choose B — it covers 90% of the value with less code." (If A is only 70 lines more, choose A.) +- BAD: "We can skip edge case handling to save time." (Edge case handling costs minutes with CC.) +- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.) +- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.") + ## Contributor Mode If `_CONTRIB` is `true`: you are in **contributor mode**. You're a gstack user who also helps make it better. diff --git a/design-consultation/SKILL.md b/design-consultation/SKILL.md index f04e2213..8fd9cb4b 100644 --- a/design-consultation/SKILL.md +++ b/design-consultation/SKILL.md @@ -2,7 +2,7 @@ name: design-consultation version: 1.0.0 description: | - Design consultation: understands your product, researches competitors, proposes a + Design consultation: understands your product, researches the landscape, proposes a complete design system (aesthetic, typography, color, layout, spacing, motion), and generates font+color preview pages. Creates DESIGN.md as your project's design source of truth. For existing sites, use /plan-design-review to infer the system instead. @@ -31,22 +31,61 @@ find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true _CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true) _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") echo "BRANCH: $_BRANCH" +_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") +echo "LAKE_INTRO: $_LAKE_SEEN" ``` If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. +If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle. +Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete +thing when AI makes the marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean" +Then offer to open the essay in their default browser: + +```bash +open https://garryslist.org/posts/boil-the-ocean +touch ~/.gstack/.completeness-intro-seen +``` + +Only run `open` if the user says yes. Always run `touch` to mark as seen. This only happens once. + ## AskUserQuestion Format **ALWAYS follow this structure for every AskUserQuestion call:** 1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences) 2. **Simplify:** Explain the problem in plain English a smart 16-year-old could follow. No raw function names, no internal jargon, no implementation details. Use concrete examples and analogies. Say what it DOES, not what it's called. -3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` -4. **Options:** Lettered options: `A) ... B) ... C) ...` +3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` — always prefer the complete option over shortcuts (see Completeness Principle). Include `Completeness: X/10` for each option. Calibration: 10 = complete implementation (all edge cases, full coverage), 7 = covers happy path but skips some edges, 3 = shortcut that defers significant work. If both options are 8+, pick the higher; if one is ≤5, flag it. +4. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)` Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex. Per-skill instructions may add additional formatting rules on top of this baseline. +## Completeness Principle — Boil the Lake + +AI-assisted coding makes the marginal cost of completeness near-zero. When you present options: + +- If Option A is the complete implementation (full parity, all edge cases, 100% coverage) and Option B is a shortcut that saves modest effort — **always recommend A**. The delta between 80 lines and 150 lines is meaningless with CC+gstack. "Good enough" is the wrong instinct when "complete" costs minutes more. +- **Lake vs. ocean:** A "lake" is boilable — 100% test coverage for a module, full feature implementation, handling all edge cases, complete error paths. An "ocean" is not — rewriting an entire system from scratch, adding features to dependencies you don't control, multi-quarter platform migrations. Recommend boiling lakes. Flag oceans as out of scope. +- **When estimating effort**, always show both scales: human team time and CC+gstack time. The compression ratio varies by task type — use this reference: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +- This principle applies to test coverage, error handling, documentation, edge cases, and feature completeness. Don't skip the last 10% to "save time" — with AI, that 10% costs seconds. + +**Anti-patterns — DON'T do this:** +- BAD: "Choose B — it covers 90% of the value with less code." (If A is only 70 lines more, choose A.) +- BAD: "We can skip edge case handling to save time." (Edge case handling costs minutes with CC.) +- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.) +- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.") + ## Contributor Mode If `_CONTRIB` is `true`: you are in **contributor mode**. You're a gstack user who also helps make it better. @@ -158,7 +197,7 @@ Ask the user a single question that covers everything you need to know. Pre-fill 3. "Want me to research what top products in your space are doing for design, or should I work from my design knowledge?" 4. **Explicitly say:** "At any point you can just drop into chat and we'll talk through anything — this isn't a rigid form, it's a conversation." -If the README or brainstorm gives you enough context, pre-fill and confirm: *"From what I can see, this is [X] for [Y] in the [Z] space. Sound right? And would you like me to research competitors, or should I work from what I know?"* +If the README or brainstorm gives you enough context, pre-fill and confirm: *"From what I can see, this is [X] for [Y] in the [Z] space. Sound right? And would you like me to research what's out there in this space, or should I work from what I know?"* --- @@ -166,7 +205,7 @@ If the README or brainstorm gives you enough context, pre-fill and confirm: *"Fr If the user wants competitive research: -**Step 1: Identify competitors via WebSearch** +**Step 1: Identify what's out there via WebSearch** Use WebSearch to find 5-10 products in their space. Search for: - "[product category] website design" @@ -175,17 +214,17 @@ Use WebSearch to find 5-10 products in their space. Search for: **Step 2: Visual research via browse (if available)** -If the browse binary is available (`$B` is set), visit the top 3-5 competitor sites and capture visual evidence: +If the browse binary is available (`$B` is set), visit the top 3-5 sites in the space and capture visual evidence: ```bash -$B goto "https://competitor-site.com" -$B screenshot "/tmp/design-research-competitor-name.png" +$B goto "https://example-site.com" +$B screenshot "/tmp/design-research-site-name.png" $B snapshot ``` -For each competitor, analyze: fonts actually used, color palette, layout approach, spacing density, aesthetic direction. The screenshot gives you the feel; the snapshot gives you structural data. +For each site, analyze: fonts actually used, color palette, layout approach, spacing density, aesthetic direction. The screenshot gives you the feel; the snapshot gives you structural data. -If a competitor site blocks the headless browser or requires login, skip it and note why. +If a site blocks the headless browser or requires login, skip it and note why. If browse is not available, rely on WebSearch results and your built-in design knowledge — this is fine. @@ -194,7 +233,7 @@ If browse is not available, rely on WebSearch results and your built-in design k The goal of research is NOT to copy. It is to get in the ballpark — to understand the visual language users in this category already expect. This gives you the baseline. The interesting design work starts after you have the baseline: deciding where to follow conventions (so the product feels literate) and where to break from them (so the product is memorable). Summarize conversationally: -> "I looked at [competitors]. Here's the landscape: they converge on [patterns]. Most of them feel [observation — e.g., interchangeable, polished but generic, etc.]. The opportunity to stand out is [gap]. Here's where I'd play it safe and where I'd take a risk..." +> "I looked at what's out there. Here's the landscape: they converge on [patterns]. Most of them feel [observation — e.g., interchangeable, polished but generic, etc.]. The opportunity to stand out is [gap]. Here's where I'd play it safe and where I'd take a risk..." **Graceful degradation:** - Browse available → screenshots + snapshots + WebSearch (richest research) diff --git a/design-consultation/SKILL.md.tmpl b/design-consultation/SKILL.md.tmpl index 0ffb121c..141a9190 100644 --- a/design-consultation/SKILL.md.tmpl +++ b/design-consultation/SKILL.md.tmpl @@ -2,7 +2,7 @@ name: design-consultation version: 1.0.0 description: | - Design consultation: understands your product, researches competitors, proposes a + Design consultation: understands your product, researches the landscape, proposes a complete design system (aesthetic, typography, color, layout, spacing, motion), and generates font+color preview pages. Creates DESIGN.md as your project's design source of truth. For existing sites, use /plan-design-review to infer the system instead. @@ -76,7 +76,7 @@ Ask the user a single question that covers everything you need to know. Pre-fill 3. "Want me to research what top products in your space are doing for design, or should I work from my design knowledge?" 4. **Explicitly say:** "At any point you can just drop into chat and we'll talk through anything — this isn't a rigid form, it's a conversation." -If the README or brainstorm gives you enough context, pre-fill and confirm: *"From what I can see, this is [X] for [Y] in the [Z] space. Sound right? And would you like me to research competitors, or should I work from what I know?"* +If the README or brainstorm gives you enough context, pre-fill and confirm: *"From what I can see, this is [X] for [Y] in the [Z] space. Sound right? And would you like me to research what's out there in this space, or should I work from what I know?"* --- @@ -84,7 +84,7 @@ If the README or brainstorm gives you enough context, pre-fill and confirm: *"Fr If the user wants competitive research: -**Step 1: Identify competitors via WebSearch** +**Step 1: Identify what's out there via WebSearch** Use WebSearch to find 5-10 products in their space. Search for: - "[product category] website design" @@ -93,17 +93,17 @@ Use WebSearch to find 5-10 products in their space. Search for: **Step 2: Visual research via browse (if available)** -If the browse binary is available (`$B` is set), visit the top 3-5 competitor sites and capture visual evidence: +If the browse binary is available (`$B` is set), visit the top 3-5 sites in the space and capture visual evidence: ```bash -$B goto "https://competitor-site.com" -$B screenshot "/tmp/design-research-competitor-name.png" +$B goto "https://example-site.com" +$B screenshot "/tmp/design-research-site-name.png" $B snapshot ``` -For each competitor, analyze: fonts actually used, color palette, layout approach, spacing density, aesthetic direction. The screenshot gives you the feel; the snapshot gives you structural data. +For each site, analyze: fonts actually used, color palette, layout approach, spacing density, aesthetic direction. The screenshot gives you the feel; the snapshot gives you structural data. -If a competitor site blocks the headless browser or requires login, skip it and note why. +If a site blocks the headless browser or requires login, skip it and note why. If browse is not available, rely on WebSearch results and your built-in design knowledge — this is fine. @@ -112,7 +112,7 @@ If browse is not available, rely on WebSearch results and your built-in design k The goal of research is NOT to copy. It is to get in the ballpark — to understand the visual language users in this category already expect. This gives you the baseline. The interesting design work starts after you have the baseline: deciding where to follow conventions (so the product feels literate) and where to break from them (so the product is memorable). Summarize conversationally: -> "I looked at [competitors]. Here's the landscape: they converge on [patterns]. Most of them feel [observation — e.g., interchangeable, polished but generic, etc.]. The opportunity to stand out is [gap]. Here's where I'd play it safe and where I'd take a risk..." +> "I looked at what's out there. Here's the landscape: they converge on [patterns]. Most of them feel [observation — e.g., interchangeable, polished but generic, etc.]. The opportunity to stand out is [gap]. Here's where I'd play it safe and where I'd take a risk..." **Graceful degradation:** - Browse available → screenshots + snapshots + WebSearch (richest research) diff --git a/docs/images/github-2013.png b/docs/images/github-2013.png new file mode 100644 index 00000000..186324ba Binary files /dev/null and b/docs/images/github-2013.png differ diff --git a/docs/images/github-2026.png b/docs/images/github-2026.png new file mode 100644 index 00000000..b908c8a4 Binary files /dev/null and b/docs/images/github-2026.png differ diff --git a/docs/skills.md b/docs/skills.md new file mode 100644 index 00000000..16045860 --- /dev/null +++ b/docs/skills.md @@ -0,0 +1,649 @@ +# Skill Deep Dives + +Detailed guides for every gstack skill — philosophy, workflow, and examples. + +| Skill | Your specialist | What they do | +|-------|----------------|--------------| +| [`/plan-ceo-review`](#plan-ceo-review) | **CEO / Founder** | Rethink the problem. Find the 10-star product hiding inside the request. Four modes: Expansion, Selective Expansion, Hold Scope, Reduction. | +| [`/plan-eng-review`](#plan-eng-review) | **Eng Manager** | Lock in architecture, data flow, diagrams, edge cases, and tests. Forces hidden assumptions into the open. | +| [`/plan-design-review`](#plan-design-review) | **Senior Designer** | 80-item design audit with letter grades. AI Slop detection. Infers your design system. Report only — never touches code. | +| [`/design-consultation`](#design-consultation) | **Design Partner** | Build a complete design system from scratch. Knows the landscape, proposes creative risks, generates realistic product mockups. Design at the heart of all other phases. | +| [`/review`](#review) | **Staff Engineer** | Find the bugs that pass CI but blow up in production. Auto-fixes the obvious ones. Flags completeness gaps. | +| [`/ship`](#ship) | **Release Engineer** | Sync main, run tests, audit coverage, push, open PR. Bootstraps test frameworks if you don't have one. One command. | +| [`/browse`](#browse) | **QA Engineer** | Give the agent eyes. Real Chromium browser, real clicks, real screenshots. ~100ms per command. | +| [`/qa`](#qa) | **QA Lead** | Test your app, find bugs, fix them with atomic commits, re-verify. Auto-generates regression tests for every fix. | +| [`/qa-only`](#qa) | **QA Reporter** | Same methodology as /qa but report only. Use when you want a pure bug report without code changes. | +| [`/qa-design-review`](#qa-design-review) | **Designer Who Codes** | Same audit as /plan-design-review, then fixes what it finds. Atomic commits, before/after screenshots. | +| [`/setup-browser-cookies`](#setup-browser-cookies) | **Session Manager** | Import cookies from your real browser (Chrome, Arc, Brave, Edge) into the headless session. Test authenticated pages. | +| [`/retro`](#retro) | **Eng Manager** | Team-aware weekly retro. Per-person breakdowns, shipping streaks, test health trends, growth opportunities. | +| [`/document-release`](#document-release) | **Technical Writer** | Update all project docs to match what you just shipped. Catches stale READMEs automatically. | + +--- + +## `/plan-ceo-review` + +This is my **founder mode**. + +This is where I want the model to think with taste, ambition, user empathy, and a long time horizon. I do not want it taking the request literally. I want it asking a more important question first: + +**What is this product actually for?** + +I think of this as **Brian Chesky mode**. + +The point is not to implement the obvious ticket. The point is to rethink the problem from the user's point of view and find the version that feels inevitable, delightful, and maybe even a little magical. + +### Example + +Say I am building a Craigslist-style listing app and I say: + +> "Let sellers upload a photo for their item." + +A weak assistant will add a file picker and save an image. + +That is not the real product. + +In `/plan-ceo-review`, I want the model to ask whether "photo upload" is even the feature. Maybe the real feature is helping someone create a listing that actually sells. + +If that is the real job, the whole plan changes. + +Now the model should ask: + +* Can we identify the product from the photo? +* Can we infer the SKU or model number? +* Can we search the web and draft the title and description automatically? +* Can we pull specs, category, and pricing comps? +* Can we suggest which photo will convert best as the hero image? +* Can we detect when the uploaded photo is ugly, dark, cluttered, or low-trust? +* Can we make the experience feel premium instead of like a dead form from 2007? + +That is what `/plan-ceo-review` does for me. + +It does not just ask, "how do I add this feature?" +It asks, **"what is the 10-star product hiding inside this request?"** + +### Four modes + +- **SCOPE EXPANSION** — dream big. The agent proposes the ambitious version. Every expansion is presented as an individual decision you opt into. Recommends enthusiastically. +- **SELECTIVE EXPANSION** — hold your current scope as the baseline, but see what else is possible. The agent surfaces opportunities one by one with neutral recommendations — you cherry-pick the ones worth doing. +- **HOLD SCOPE** — maximum rigor on the existing plan. No expansions surfaced. +- **SCOPE REDUCTION** — find the minimum viable version. Cut everything else. + +Visions and decisions are persisted to `~/.gstack/projects/` so they survive beyond the conversation. Exceptional visions can be promoted to `docs/designs/` in your repo for the team. + +--- + +## `/plan-eng-review` + +This is my **eng manager mode**. + +Once the product direction is right, I want a different kind of intelligence entirely. I do not want more sprawling ideation. I do not want more "wouldn't it be cool if." I want the model to become my best technical lead. + +This mode should nail: + +* architecture +* system boundaries +* data flow +* state transitions +* failure modes +* edge cases +* trust boundaries +* test coverage + +And one surprisingly big unlock for me: **diagrams**. + +LLMs get way more complete when you force them to draw the system. Sequence diagrams, state diagrams, component diagrams, data-flow diagrams, even test matrices. Diagrams force hidden assumptions into the open. They make hand-wavy planning much harder. + +So `/plan-eng-review` is where I want the model to build the technical spine that can carry the product vision. + +### Example + +Take the same listing app example. + +Let's say `/plan-ceo-review` already did its job. We decided the real feature is not just photo upload. It is a smart listing flow that: + +* uploads photos +* identifies the product +* enriches the listing from the web +* drafts a strong title and description +* suggests the best hero image + +Now `/plan-eng-review` takes over. + +Now I want the model to answer questions like: + +* What is the architecture for upload, classification, enrichment, and draft generation? +* Which steps happen synchronously, and which go to background jobs? +* Where are the boundaries between app server, object storage, vision model, search/enrichment APIs, and the listing database? +* What happens if upload succeeds but enrichment fails? +* What happens if product identification is low-confidence? +* How do retries work? +* How do we prevent duplicate jobs? +* What gets persisted when, and what can be safely recomputed? + +And this is where I want diagrams — architecture diagrams, state models, data-flow diagrams, test matrices. Diagrams force hidden assumptions into the open. They make hand-wavy planning much harder. + +That is `/plan-eng-review`. + +Not "make the idea smaller." +**Make the idea buildable.** + +### Review Readiness Dashboard + +Every review (CEO, Eng, Design) logs its result. At the end of each review, you see a dashboard: + +``` ++====================================================================+ +| REVIEW READINESS DASHBOARD | ++====================================================================+ +| Review | Runs | Last Run | Status | Required | +|-----------------|------|---------------------|-----------|----------| +| Eng Review | 1 | 2026-03-16 15:00 | CLEAR | YES | +| CEO Review | 1 | 2026-03-16 14:30 | CLEAR | no | +| Design Review | 0 | — | — | no | ++--------------------------------------------------------------------+ +| VERDICT: CLEARED — Eng Review passed | ++====================================================================+ +``` + +Eng Review is the only required gate (disable with `gstack-config set skip_eng_review true`). CEO and Design are informational — recommended for product and UI changes respectively. + +### Plan-to-QA flow + +When `/plan-eng-review` finishes the test review section, it writes a test plan artifact to `~/.gstack/projects/`. When you later run `/qa`, it picks up that test plan automatically — your engineering review feeds directly into QA testing with no manual copy-paste. + +--- + +## `/plan-design-review` + +This is my **senior designer mode**. + +Most developers cannot tell whether their site looks AI-generated. I could not, until I started paying attention. There is a growing class of sites that are functional but soulless — they work fine but scream "an AI built this and nobody with taste looked at it." Purple gradients, 3-column icon grids, uniform bubbly border-radius on everything, centered text on every section, decorative blobs floating in the background. The ChatGPT aesthetic. + +`/plan-design-review` gives the agent a designer's eye. + +It opens your site and reacts to it the way a Stripe or Linear designer would — immediately, viscerally, with opinions. The first output is a structured gut reaction: what the site communicates at a glance, what the eye is drawn to, and a one-word verdict. That is the most valuable part. Everything after is supporting evidence. + +Then it extracts your actual design system by reading the CSS: how many fonts are you really using? What is your heading scale? How many colors? Is your spacing on a grid or arbitrary? It structures this as an **Inferred Design System** and offers to save it as a `DESIGN.md` — so now you have a design system you never had before, derived from what you actually shipped. + +The audit covers 80 items across 10 categories. Typography, spacing, hierarchy, color, responsive design, interaction states, motion, content quality, performance feel, and — the headline feature — **AI Slop Detection**. Ten specific anti-patterns that mark a site as machine-generated. + +Every finding gets a letter grade (A through F) per category. The report opens with two headline scores: **Design Score** and **AI Slop Score**. "Your site got a B on design but a D on AI Slop" tells you exactly where to focus. + +### Example + +``` +You: /plan-design-review https://myapp.com + +Claude: Design Score: C | AI Slop Score: D + + "The site communicates generic SaaS energy." + "I notice the hero section uses a blue-to-purple gradient + with a 3-column feature grid below it — the single most + recognizable AI-generated layout pattern." + "The first 3 things my eye goes to are: the gradient, the + symmetrical icon grid, and a 'Get Started' button that + looks like every other 'Get Started' button." + "If I had to describe this in one word: template." + + Inferred Design System: + - Fonts: Inter (body), Inter (headings) — 1 font family + - Colors: 14 unique non-gray colors (flag: >12) + - Heading scale: 36/24/18/16 — flat, no drama between sizes + - Spacing: mostly 16px and 24px — no rhythm + + Top 5 Design Improvements: + 1. Break the 3-column icon grid (High, AI Slop) + 2. Add heading scale contrast — at least 2:1 ratio (High, Typography) + 3. Replace uniform border-radius with intentional hierarchy (Medium, Spacing) + 4. Kill the gradient hero — use a real image or bold typography (High, AI Slop) + 5. Add a second font for headings — Inter-only reads as generic (Medium, Typography) + + [Full report saved to .gstack/design-reports/] + Want me to save this inferred design system as your DESIGN.md? +``` + +This is report only — it never touches your code. Use `/qa-design-review` when you want it to fix what it finds. + +--- + +## `/design-consultation` + +This is my **design partner mode**. + +`/plan-design-review` audits a site that already exists. `/design-consultation` is for when you have nothing yet — no design system, no font choices, no color palette. You are starting from zero and you want a senior designer to sit down with you and build the whole visual identity together. + +It is a conversation, not a form. The agent asks about your product, your users, and your audience. It thinks about what your product needs to communicate — trust, speed, craft, warmth, whatever fits — and works backward from that to concrete choices. Then it proposes a complete, coherent design system: aesthetic direction, typography (3+ fonts with specific roles), color palette with hex values, spacing scale, layout approach, and motion strategy. Every recommendation comes with a rationale. Every choice reinforces every other choice. + +But coherence is table stakes. Every dev tool dashboard looks the same — clean sans-serif, muted grays, a blue accent. They are all coherent. They are all forgettable. The difference between a product that looks "nice" and one that people actually recognize is the **deliberate creative risks**: the unexpected serif for headings, the bold accent nobody else in your category uses, the tighter spacing that makes your data feel authoritative instead of airy. + +That is what `/design-consultation` is really about. It does not just propose a safe system. It proposes safe choices AND risks — and tells you which is which. "Here are the choices that keep you literate in your category. And here is where I think you should break from convention, and why." You pick which risks to take. The agent checks that the whole system still coheres either way. + +If you want, the agent will research what's already out there in your space — take screenshots of real sites, analyze their fonts and colors and spacing — so you can see the landscape before you make choices. This is not about copying. It is about getting in the ballpark so you know what the conventions are, and then deciding which ones are worth breaking. + +After you agree on the system, it generates an interactive HTML preview page — not just swatches and font samples, but realistic product pages. If you are building a dashboard, you see a dashboard with a sidebar, data tables, and stat cards. If you are building a marketing site, you see a hero section with real copy and a CTA. Everything rendered in your design system, with your product name, in light and dark mode. You see what your product could feel like before a single line of production code is written. + +Then it writes `DESIGN.md` to your repo root — your project's design source of truth — and updates `CLAUDE.md` so every future Claude Code session respects the system. From that point on, `/qa-design-review` can audit against it, and any agent working on your frontend knows the rules. + +### Example + +``` +You: /design-consultation + +Claude: From the README, this looks like a developer analytics dashboard + for engineering teams. Sound right? Tell me about the feeling + you want — and would you like me to look at what's out there + first so we can see the landscape? + +You: Yes, it's for eng managers. Show me what's out there. + +Claude: [browses Linear, Vercel, Datadog, Posthog — takes screenshots] + + Here's the landscape. They all converge on clean sans-serifs + and muted blue-gray palettes. Linear stands out with dark-first + and tight spacing. The rest feel interchangeable. + + Here's my proposal — and where I'd play it safe vs. take risks: + + AESTHETIC: Industrial/Utilitarian — data-first, every pixel earns its place + TYPOGRAPHY: Geist (body + UI), Instrument Serif (display), JetBrains Mono (data) + COLOR: Restrained — slate neutrals, single teal accent (#0D9488) + SPACING: 4px base, compact density + LAYOUT: Grid-disciplined, 12-column + MOTION: Minimal-functional — only transitions that aid comprehension + + SAFE CHOICES (match category expectations): + Geist for body — your users already know this font from Vercel + Slate neutrals — warm grays feel right for data-heavy UIs + Grid-disciplined layout — dashboards need predictable alignment + + RISKS (where you'd stand out): + Instrument Serif for display headings — nobody in dev tools + uses a serif. It creates instant hierarchy contrast and + signals "we care about craft" in a space full of sans-serifs. + Single teal accent instead of blue — everyone else in this space uses blue. + Teal is close enough to feel trustworthy, different enough + to be recognizable. + 4px compact density — tighter than Linear, says "we respect + your screen real estate." + + The safe choices keep you literate. The risks are where your + product gets its own face. Which risks appeal to you? + +You: Love the serif risk. Let's see it. + +Claude: Preview opened in your browser. + [/tmp/design-consultation-preview-1710000000.html] + + The page shows a realistic dashboard: sidebar nav, header with + avatar, stat cards, data table — all in your design system. + Instrument Serif at hero scale, Geist in body text, JetBrains + Mono in the data table. Teal accent on buttons and active states. + Light and dark mode toggle in the corner. + + How does it feel? Want to adjust anything before I write DESIGN.md? + +You: Ship it. + +Claude: Wrote DESIGN.md (typography, color, spacing, layout, motion). + Updated CLAUDE.md with design system guidelines. + All future sessions will respect these choices. +``` + +--- + +## `/qa-design-review` + +This is my **designer who codes mode**. + +`/plan-design-review` tells you what is wrong. `/qa-design-review` fixes it. + +It runs the same 80-item audit, then enters a fix loop: for each design finding, it locates the source file, makes the minimal CSS/styling change, commits with `style(design): FINDING-NNN`, re-navigates to verify, and takes before/after screenshots. One commit per fix, fully bisectable. + +The self-regulation heuristic is tuned for design work — CSS-only changes get a free pass (they are inherently safe and reversible), but changes to component JSX/TSX files count against the risk budget. Hard cap at 30 fixes. If the risk score exceeds 20%, it stops and asks. + +### Example + +``` +You: /qa-design-review https://myapp.com + +Claude: [Runs full design audit — same output as /plan-design-review] + Design Score: C | AI Slop Score: D + 12 findings (4 high, 5 medium, 3 polish) + + Fixing 9 design issues... + + style(design): FINDING-001 — replace 3-column icon grid with asymmetric layout + style(design): FINDING-002 — add heading scale 48/32/24/18/16 + style(design): FINDING-003 — remove gradient hero, use bold typography + style(design): FINDING-004 — add second font for headings + style(design): FINDING-005 — vary border-radius by element role + style(design): FINDING-006 — left-align body text, reserve center for headings + style(design): FINDING-007 — add hover/focus states to all interactive elements + style(design): FINDING-008 — add prefers-reduced-motion media query + style(design): FINDING-009 — set max content width to 680px for body text + + Final audit: + Design Score: C → B+ | AI Slop Score: D → A + 9 fixes applied (8 verified, 1 best-effort). 3 deferred. + + [Report with before/after screenshots saved to .gstack/design-reports/] +``` + +Nine commits, each touching one concern. The AI Slop score went from D to A because the three most recognizable patterns (gradient hero, 3-column grid, uniform radius) are gone. + +--- + +## `/review` + +This is my **paranoid staff engineer mode**. + +Passing tests do not mean the branch is safe. + +`/review` exists because there is a whole class of bugs that can survive CI and still punch you in the face in production. This mode is not about dreaming bigger. It is not about making the plan prettier. It is about asking: + +**What can still break?** + +This is a structural audit, not a style nitpick pass. I want the model to look for things like: + +* N+1 queries +* stale reads +* race conditions +* bad trust boundaries +* missing indexes +* escaping bugs +* broken invariants +* bad retry logic +* tests that pass while missing the real failure mode +* forgotten enum handlers — add a new status or type constant, and `/review` traces it through every switch statement and allowlist in your codebase, not just the files you changed + +### Fix-First + +Findings get action, not just listed. Obvious mechanical fixes (dead code, stale comments, N+1 queries) are applied automatically — you see `[AUTO-FIXED] file:line Problem → what was done` for each one. Genuinely ambiguous issues (security, race conditions, design decisions) get surfaced for your call. + +### Completeness gaps + +`/review` now flags shortcut implementations where the complete version costs less than 30 minutes of CC time. If you chose the 80% solution and the 100% solution is a lake, not an ocean, the review will call it out. + +### Example + +Suppose the smart listing flow is implemented and the tests are green. + +`/review` should still ask: + +* Did I introduce an N+1 query when rendering listing photos or draft suggestions? +* Am I trusting client-provided file metadata instead of validating the actual file? +* Can two tabs race and overwrite cover-photo selection or item details? +* Do failed uploads leave orphaned files in storage forever? +* Can the "exactly one hero image" rule break under concurrency? +* If enrichment APIs partially fail, do I degrade gracefully or save garbage? +* Did I accidentally create a prompt injection or trust-boundary problem by pulling web data into draft generation? + +That is the point of `/review`. + +I do not want flattery here. +I want the model imagining the production incident before it happens. + +--- + +## `/ship` + +This is my **release machine mode**. + +Once I have decided what to build, nailed the technical plan, and run a serious review, I do not want more talking. I want execution. + +`/ship` is for the final mile. It is for a ready branch, not for deciding what to build. + +This is where the model should stop behaving like a brainstorm partner and start behaving like a disciplined release engineer: sync with main, run the right tests, make sure the branch state is sane, update changelog or versioning if the repo expects it, push, and create or update the PR. + +### Test bootstrap + +If your project doesn't have a test framework, `/ship` sets one up — detects your runtime, researches the best framework, installs it, writes 3-5 real tests for your actual code, sets up CI/CD (GitHub Actions), and creates TESTING.md. 100% test coverage is the goal — tests make vibe coding safe instead of yolo coding. + +### Coverage audit + +Every `/ship` run builds a code path map from your diff, searches for corresponding tests, and produces an ASCII coverage diagram with quality stars. Gaps get tests auto-generated. Your PR body shows the coverage: `Tests: 42 → 47 (+5 new)`. + +### Review gate + +`/ship` checks the [Review Readiness Dashboard](#review-readiness-dashboard) before creating the PR. If the Eng Review is missing, it asks — but won't block you. Decisions are saved per-branch so you're never re-asked. + +A lot of branches die when the interesting work is done and only the boring release work is left. Humans procrastinate that part. AI should not. + +--- + +## `/browse` + +This is my **QA engineer mode**. + +`/browse` is the skill that closes the loop. Before it, the agent could think and code but was still half blind. It had to guess about UI state, auth flows, redirects, console errors, empty states, and broken layouts. Now it can just go look. + +It is a compiled binary that talks to a persistent Chromium daemon — built on [Playwright](https://playwright.dev/) by Microsoft. First call starts the browser (~3s). Every call after that: ~100-200ms. The browser stays running between commands, so cookies, tabs, and localStorage carry over. + +### Example + +``` +You: /browse staging.myapp.com — log in, test the signup flow, and check + every page I changed in this branch + +Claude: [18 tool calls, ~60 seconds] + + > browse goto https://staging.myapp.com/signup + > browse snapshot -i + > browse fill @e2 "test@example.com" + > browse fill @e3 "password123" + > browse click @e5 (Submit) + > browse screenshot /tmp/signup.png + > Read /tmp/signup.png + + Signup works. Redirected to onboarding. Now checking changed pages. + + > browse goto https://staging.myapp.com/dashboard + > browse screenshot /tmp/dashboard.png + > Read /tmp/dashboard.png + > browse console + + Dashboard loads. No console errors. Charts render with sample data. + + All 4 pages load correctly. No console errors. No broken layouts. + Signup → onboarding → dashboard flow works end to end. +``` + +18 tool calls, about a minute. Full QA pass. No browser opened. + +**Security note:** `/browse` runs a persistent Chromium session. Cookies, localStorage, and session state carry over between commands. Do not use it against sensitive production environments unless you intend to — it is a real browser with real state. The session auto-shuts down after 30 minutes of idle time. + +For the full command reference, see [BROWSER.md](../BROWSER.md). + +--- + +## `/qa` + +This is my **QA lead mode**. + +`/browse` gives the agent eyes. `/qa` gives it a testing methodology. + +The most common use case: you're on a feature branch, you just finished coding, and you want to verify everything works. Just say `/qa` — it reads your git diff, identifies which pages and routes your changes affect, spins up the browser, and tests each one. No URL required. No manual test plan. + +Four modes: + +- **Diff-aware** (automatic on feature branches) — reads `git diff main`, identifies affected pages, tests them specifically +- **Full** — systematic exploration of the entire app. 5-15 minutes. Documents 5-10 well-evidenced issues. +- **Quick** (`--quick`) — 30-second smoke test. Homepage + top 5 nav targets. +- **Regression** (`--regression baseline.json`) — run full mode, then diff against a previous baseline. + +### Automatic regression tests + +When `/qa` fixes a bug and verifies it, it automatically generates a regression test that catches the exact scenario that broke. Tests include full attribution tracing back to the QA report. + +### Example + +``` +You: /qa https://staging.myapp.com + +Claude: [Explores 12 pages, fills 3 forms, tests 2 flows] + + QA Report: staging.myapp.com — Health Score: 72/100 + + Top 3 Issues: + 1. CRITICAL: Checkout form submits with empty required fields + 2. HIGH: Mobile nav menu doesn't close after selecting an item + 3. MEDIUM: Dashboard chart overlaps sidebar below 1024px + + [Full report with screenshots saved to .gstack/qa-reports/] +``` + +**Testing authenticated pages:** Use `/setup-browser-cookies` first to import your real browser sessions, then `/qa` can test pages behind login. + +--- + +## `/setup-browser-cookies` + +This is my **session manager mode**. + +Before `/qa` or `/browse` can test authenticated pages, they need cookies. Instead of manually logging in through the headless browser every time, `/setup-browser-cookies` imports your real sessions directly from your daily browser. + +It auto-detects installed Chromium browsers (Comet, Chrome, Arc, Brave, Edge), decrypts cookies via the macOS Keychain, and loads them into the Playwright session. An interactive picker UI lets you choose exactly which domains to import — no cookie values are ever displayed. + +``` +You: /setup-browser-cookies + +Claude: Cookie picker opened — select the domains you want to import + in your browser, then tell me when you're done. + + [You pick github.com, myapp.com in the browser UI] + +You: done + +Claude: Imported 2 domains (47 cookies). Session is ready. +``` + +Or skip the UI entirely: + +``` +You: /setup-browser-cookies github.com + +Claude: Imported 12 cookies for github.com from Comet. +``` + +--- + +## `/retro` + +This is my **engineering manager mode**. + +At the end of the week I want to know what actually happened. Not vibes — data. `/retro` analyzes commit history, work patterns, and shipping velocity and writes a candid retrospective. + +It is team-aware. It identifies who is running the command, gives you the deepest treatment on your own work, then breaks down every contributor with specific praise and growth opportunities. It computes metrics like commits, LOC, test ratio, PR sizes, and fix ratio. It detects coding sessions from commit timestamps, finds hotspot files, tracks shipping streaks, and identifies the biggest ship of the week. + +It also tracks test health: total test files, tests added this period, regression test commits, and trend deltas. If test ratio drops below 20%, it flags it as a growth area. + +### Example + +``` +You: /retro + +Claude: Week of Mar 1: 47 commits (3 contributors), 3.2k LOC, 38% tests, 12 PRs, peak: 10pm | Streak: 47d + + ## Your Week + 32 commits, +2.4k LOC, 41% tests. Peak hours: 9-11pm. + Biggest ship: cookie import system (browser decryption + picker UI). + What you did well: shipped a complete feature with encryption, UI, and + 18 unit tests in one focused push... + + ## Team Breakdown + + ### Alice + 12 commits focused on app/services/. Every PR under 200 LOC — disciplined. + Opportunity: test ratio at 12% — worth investing before payment gets more complex. + + ### Bob + 3 commits — fixed the N+1 query on dashboard. Small but high-impact. + Opportunity: only 1 active day this week — check if blocked on anything. + + [Top 3 team wins, 3 things to improve, 3 habits for next week] +``` + +It saves a JSON snapshot to `.context/retros/` so the next run can show trends. + +--- + +## `/document-release` + +This is my **technical writer mode**. + +After `/ship` creates the PR but before it merges, `/document-release` reads every documentation file in the project and cross-references it against the diff. It updates file paths, command lists, project structure trees, and anything else that drifted. Risky or subjective changes get surfaced as questions — everything else is handled automatically. + +``` +You: /document-release + +Claude: Analyzing 21 files changed across 3 commits. Found 8 documentation files. + + README.md: updated skill count from 9 to 10, added new skill to table + CLAUDE.md: added new directory to project structure + CONTRIBUTING.md: current — no changes needed + TODOS.md: marked 2 items complete, added 1 new item + + All docs updated and committed. PR body updated with doc diff. +``` + +It also polishes CHANGELOG voice (without ever overwriting entries), cleans up completed TODOS, checks cross-doc consistency, and asks about VERSION bumps only when appropriate. + +--- + +## Greptile integration + +[Greptile](https://greptile.com) is a YC company that reviews your PRs automatically. It catches real bugs — race conditions, security issues, things that pass CI and blow up in production. It has genuinely saved my ass more than once. I love these guys. + +### Setup + +Install Greptile on your GitHub repo at [greptile.com](https://greptile.com) — it takes about 30 seconds. Once it's reviewing your PRs, gstack picks up its comments automatically. No additional configuration. + +### How it works + +The problem with any automated reviewer is triage. Greptile is good, but not every comment is a real issue. Some are false positives. Some flag things you already fixed three commits ago. Without a triage layer, the comments pile up and you start ignoring them — which defeats the purpose. + +gstack solves this. `/review` and `/ship` are now Greptile-aware. They read Greptile's comments, classify each one, and take action: + +- **Valid issues** get added to the critical findings and fixed before shipping +- **Already-fixed issues** get an auto-reply acknowledging the catch +- **False positives** get pushed back — you confirm, and a reply goes out explaining why it's wrong + +The result is a two-layer review: Greptile catches things asynchronously on the PR, then `/review` and `/ship` triage those findings as part of the normal workflow. Nothing falls through the cracks. + +### Learning from history + +Every false positive you confirm gets saved to `~/.gstack/greptile-history.md`. Future runs auto-skip known FP patterns for your codebase. And `/retro` tracks Greptile's batting average over time — so you can see whether the signal-to-noise ratio is improving. + +### Example + +``` +You: /ship + +Claude: [syncs main, runs tests, pre-landing review...] + + Greptile found 3 comments on this PR: + + [VALID] app/services/payment_service.rb:47 — Race condition: + concurrent charges can double-debit. Recommend DB-level advisory + lock around the charge block. + + [ALREADY FIXED] app/models/user.rb:12 — Missing null check + on email before downcase. + → Already fixed in commit abc1234. Auto-replying. + + [FALSE POSITIVE] lib/auth.rb:88 — "Token comparison should + use constant-time comparison." We already use + ActiveSupport::SecurityUtils.secure_compare here. + + Fixing the race condition in payment_service.rb... + + The auth token comment is a false positive. + Options: A) Reply to Greptile explaining B) Fix anyway C) Ignore + +You: A + +Claude: Replied to Greptile. All tests pass. + PR: github.com/you/app/pull/42 +``` + +Three Greptile comments. One real fix. One auto-acknowledged. One false positive pushed back with a reply. Total extra time: about 30 seconds. diff --git a/document-release/SKILL.md b/document-release/SKILL.md index 694d64da..bc8ff4ca 100644 --- a/document-release/SKILL.md +++ b/document-release/SKILL.md @@ -29,22 +29,61 @@ find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true _CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true) _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") echo "BRANCH: $_BRANCH" +_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") +echo "LAKE_INTRO: $_LAKE_SEEN" ``` If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. +If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle. +Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete +thing when AI makes the marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean" +Then offer to open the essay in their default browser: + +```bash +open https://garryslist.org/posts/boil-the-ocean +touch ~/.gstack/.completeness-intro-seen +``` + +Only run `open` if the user says yes. Always run `touch` to mark as seen. This only happens once. + ## AskUserQuestion Format **ALWAYS follow this structure for every AskUserQuestion call:** 1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences) 2. **Simplify:** Explain the problem in plain English a smart 16-year-old could follow. No raw function names, no internal jargon, no implementation details. Use concrete examples and analogies. Say what it DOES, not what it's called. -3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` -4. **Options:** Lettered options: `A) ... B) ... C) ...` +3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` — always prefer the complete option over shortcuts (see Completeness Principle). Include `Completeness: X/10` for each option. Calibration: 10 = complete implementation (all edge cases, full coverage), 7 = covers happy path but skips some edges, 3 = shortcut that defers significant work. If both options are 8+, pick the higher; if one is ≤5, flag it. +4. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)` Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex. Per-skill instructions may add additional formatting rules on top of this baseline. +## Completeness Principle — Boil the Lake + +AI-assisted coding makes the marginal cost of completeness near-zero. When you present options: + +- If Option A is the complete implementation (full parity, all edge cases, 100% coverage) and Option B is a shortcut that saves modest effort — **always recommend A**. The delta between 80 lines and 150 lines is meaningless with CC+gstack. "Good enough" is the wrong instinct when "complete" costs minutes more. +- **Lake vs. ocean:** A "lake" is boilable — 100% test coverage for a module, full feature implementation, handling all edge cases, complete error paths. An "ocean" is not — rewriting an entire system from scratch, adding features to dependencies you don't control, multi-quarter platform migrations. Recommend boiling lakes. Flag oceans as out of scope. +- **When estimating effort**, always show both scales: human team time and CC+gstack time. The compression ratio varies by task type — use this reference: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +- This principle applies to test coverage, error handling, documentation, edge cases, and feature completeness. Don't skip the last 10% to "save time" — with AI, that 10% costs seconds. + +**Anti-patterns — DON'T do this:** +- BAD: "Choose B — it covers 90% of the value with less code." (If A is only 70 lines more, choose A.) +- BAD: "We can skip edge case handling to save time." (Edge case handling costs minutes with CC.) +- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.) +- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.") + ## Contributor Mode If `_CONTRIB` is `true`: you are in **contributor mode**. You're a gstack user who also helps make it better. diff --git a/plan-ceo-review/SKILL.md b/plan-ceo-review/SKILL.md index 7f48cbe7..d7953a92 100644 --- a/plan-ceo-review/SKILL.md +++ b/plan-ceo-review/SKILL.md @@ -28,22 +28,61 @@ find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true _CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true) _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") echo "BRANCH: $_BRANCH" +_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") +echo "LAKE_INTRO: $_LAKE_SEEN" ``` If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. +If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle. +Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete +thing when AI makes the marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean" +Then offer to open the essay in their default browser: + +```bash +open https://garryslist.org/posts/boil-the-ocean +touch ~/.gstack/.completeness-intro-seen +``` + +Only run `open` if the user says yes. Always run `touch` to mark as seen. This only happens once. + ## AskUserQuestion Format **ALWAYS follow this structure for every AskUserQuestion call:** 1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences) 2. **Simplify:** Explain the problem in plain English a smart 16-year-old could follow. No raw function names, no internal jargon, no implementation details. Use concrete examples and analogies. Say what it DOES, not what it's called. -3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` -4. **Options:** Lettered options: `A) ... B) ... C) ...` +3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` — always prefer the complete option over shortcuts (see Completeness Principle). Include `Completeness: X/10` for each option. Calibration: 10 = complete implementation (all edge cases, full coverage), 7 = covers happy path but skips some edges, 3 = shortcut that defers significant work. If both options are 8+, pick the higher; if one is ≤5, flag it. +4. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)` Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex. Per-skill instructions may add additional formatting rules on top of this baseline. +## Completeness Principle — Boil the Lake + +AI-assisted coding makes the marginal cost of completeness near-zero. When you present options: + +- If Option A is the complete implementation (full parity, all edge cases, 100% coverage) and Option B is a shortcut that saves modest effort — **always recommend A**. The delta between 80 lines and 150 lines is meaningless with CC+gstack. "Good enough" is the wrong instinct when "complete" costs minutes more. +- **Lake vs. ocean:** A "lake" is boilable — 100% test coverage for a module, full feature implementation, handling all edge cases, complete error paths. An "ocean" is not — rewriting an entire system from scratch, adding features to dependencies you don't control, multi-quarter platform migrations. Recommend boiling lakes. Flag oceans as out of scope. +- **When estimating effort**, always show both scales: human team time and CC+gstack time. The compression ratio varies by task type — use this reference: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +- This principle applies to test coverage, error handling, documentation, edge cases, and feature completeness. Don't skip the last 10% to "save time" — with AI, that 10% costs seconds. + +**Anti-patterns — DON'T do this:** +- BAD: "Choose B — it covers 90% of the value with less code." (If A is only 70 lines more, choose A.) +- BAD: "We can skip edge case handling to save time." (Edge case handling costs minutes with CC.) +- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.) +- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.") + ## Contributor Mode If `_CONTRIB` is `true`: you are in **contributor mode**. You're a gstack user who also helps make it better. @@ -109,6 +148,7 @@ But your posture depends on what the user needs: * SELECTIVE EXPANSION: You are a rigorous reviewer who also has taste. Hold the current scope as your baseline — make it bulletproof. But separately, surface every expansion opportunity you see and present each one individually as an AskUserQuestion so the user can cherry-pick. Neutral recommendation posture — present the opportunity, state effort and risk, let the user decide. Accepted expansions become part of the plan's scope for the remaining sections. Rejected ones go to "NOT in scope." * HOLD SCOPE: You are a rigorous reviewer. The plan's scope is accepted. Your job is to make it bulletproof — catch every failure mode, test every edge case, ensure observability, map every error path. Do not silently reduce OR expand. * SCOPE REDUCTION: You are a surgeon. Find the minimum viable version that achieves the core outcome. Cut everything else. Be ruthless. +* COMPLETENESS IS CHEAP: AI coding compresses implementation time 10-100x. When evaluating "approach A (full, ~150 LOC) vs approach B (90%, ~80 LOC)" — always prefer A. The 70-line delta costs seconds with CC. "Ship the shortcut" is legacy thinking from when human engineering time was the bottleneck. Boil the lake. Critical rule: In ALL modes, the user is 100% in control. Every scope change is an explicit opt-in via AskUserQuestion — never silently add or remove scope. Once the user selects a mode, COMMIT to it. Do not silently drift toward a different mode. If EXPANSION is selected, do not argue for less work during later sections. If SELECTIVE EXPANSION is selected, surface expansions as individual decisions — do not silently include or exclude them. If REDUCTION is selected, do not sneak scope back in. Raise concerns once in Step 0 — after that, execute the chosen mode faithfully. Do NOT make any code changes. Do NOT start implementation. Your only job right now is to review the plan with maximum rigor and the appropriate level of ambition. @@ -269,6 +309,11 @@ Think ahead to implementation: What decisions will need to be made during implem HOUR 4-5 (integration): What will surprise them? HOUR 6+ (polish/tests): What will they wish they'd planned for? ``` +NOTE: These represent human-team implementation hours. With CC + gstack, +6 hours of human implementation compresses to ~30-60 minutes. The decisions +are identical — the implementation speed is 10-20x faster. Always present +both scales when discussing effort. + Surface these as questions for the user NOW, not as "figure it out later." ### 0F. Mode Selection @@ -548,7 +593,7 @@ For each TODO, describe: * **Pros:** What you gain by doing this work. * **Cons:** Cost, complexity, or risks of doing it. * **Context:** Enough detail that someone picking this up in 3 months understands the motivation, the current state, and where to start. -* **Effort estimate:** S/M/L/XL +* **Effort estimate:** S/M/L/XL (human team) → with CC+gstack: S→S, M→S, L→M, XL→L * **Priority:** P1/P2/P3 * **Depends on / blocked by:** Any prerequisites or ordering constraints. @@ -598,6 +643,7 @@ List every ASCII diagram in files this plan touches. Still accurate? | TODOS.md updates | ___ items proposed | | Scope proposals | ___ proposed, ___ accepted (EXP + SEL) | | CEO plan | written / skipped (HOLD/REDUCTION) | + | Lake Score | X/Y recommendations chose complete option | | Diagrams produced | ___ (list types) | | Stale diagrams found | ___ | | Unresolved decisions | ___ (listed below) | diff --git a/plan-ceo-review/SKILL.md.tmpl b/plan-ceo-review/SKILL.md.tmpl index 3165dc5b..8695dd8b 100644 --- a/plan-ceo-review/SKILL.md.tmpl +++ b/plan-ceo-review/SKILL.md.tmpl @@ -27,6 +27,7 @@ But your posture depends on what the user needs: * SELECTIVE EXPANSION: You are a rigorous reviewer who also has taste. Hold the current scope as your baseline — make it bulletproof. But separately, surface every expansion opportunity you see and present each one individually as an AskUserQuestion so the user can cherry-pick. Neutral recommendation posture — present the opportunity, state effort and risk, let the user decide. Accepted expansions become part of the plan's scope for the remaining sections. Rejected ones go to "NOT in scope." * HOLD SCOPE: You are a rigorous reviewer. The plan's scope is accepted. Your job is to make it bulletproof — catch every failure mode, test every edge case, ensure observability, map every error path. Do not silently reduce OR expand. * SCOPE REDUCTION: You are a surgeon. Find the minimum viable version that achieves the core outcome. Cut everything else. Be ruthless. +* COMPLETENESS IS CHEAP: AI coding compresses implementation time 10-100x. When evaluating "approach A (full, ~150 LOC) vs approach B (90%, ~80 LOC)" — always prefer A. The 70-line delta costs seconds with CC. "Ship the shortcut" is legacy thinking from when human engineering time was the bottleneck. Boil the lake. Critical rule: In ALL modes, the user is 100% in control. Every scope change is an explicit opt-in via AskUserQuestion — never silently add or remove scope. Once the user selects a mode, COMMIT to it. Do not silently drift toward a different mode. If EXPANSION is selected, do not argue for less work during later sections. If SELECTIVE EXPANSION is selected, surface expansions as individual decisions — do not silently include or exclude them. If REDUCTION is selected, do not sneak scope back in. Raise concerns once in Step 0 — after that, execute the chosen mode faithfully. Do NOT make any code changes. Do NOT start implementation. Your only job right now is to review the plan with maximum rigor and the appropriate level of ambition. @@ -187,6 +188,11 @@ Think ahead to implementation: What decisions will need to be made during implem HOUR 4-5 (integration): What will surprise them? HOUR 6+ (polish/tests): What will they wish they'd planned for? ``` +NOTE: These represent human-team implementation hours. With CC + gstack, +6 hours of human implementation compresses to ~30-60 minutes. The decisions +are identical — the implementation speed is 10-20x faster. Always present +both scales when discussing effort. + Surface these as questions for the user NOW, not as "figure it out later." ### 0F. Mode Selection @@ -466,7 +472,7 @@ For each TODO, describe: * **Pros:** What you gain by doing this work. * **Cons:** Cost, complexity, or risks of doing it. * **Context:** Enough detail that someone picking this up in 3 months understands the motivation, the current state, and where to start. -* **Effort estimate:** S/M/L/XL +* **Effort estimate:** S/M/L/XL (human team) → with CC+gstack: S→S, M→S, L→M, XL→L * **Priority:** P1/P2/P3 * **Depends on / blocked by:** Any prerequisites or ordering constraints. @@ -516,6 +522,7 @@ List every ASCII diagram in files this plan touches. Still accurate? | TODOS.md updates | ___ items proposed | | Scope proposals | ___ proposed, ___ accepted (EXP + SEL) | | CEO plan | written / skipped (HOLD/REDUCTION) | + | Lake Score | X/Y recommendations chose complete option | | Diagrams produced | ___ (list types) | | Stale diagrams found | ___ | | Unresolved decisions | ___ (listed below) | diff --git a/plan-design-review/SKILL.md b/plan-design-review/SKILL.md index df5b43e8..f0b2fddd 100644 --- a/plan-design-review/SKILL.md +++ b/plan-design-review/SKILL.md @@ -29,22 +29,61 @@ find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true _CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true) _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") echo "BRANCH: $_BRANCH" +_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") +echo "LAKE_INTRO: $_LAKE_SEEN" ``` If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. +If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle. +Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete +thing when AI makes the marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean" +Then offer to open the essay in their default browser: + +```bash +open https://garryslist.org/posts/boil-the-ocean +touch ~/.gstack/.completeness-intro-seen +``` + +Only run `open` if the user says yes. Always run `touch` to mark as seen. This only happens once. + ## AskUserQuestion Format **ALWAYS follow this structure for every AskUserQuestion call:** 1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences) 2. **Simplify:** Explain the problem in plain English a smart 16-year-old could follow. No raw function names, no internal jargon, no implementation details. Use concrete examples and analogies. Say what it DOES, not what it's called. -3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` -4. **Options:** Lettered options: `A) ... B) ... C) ...` +3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` — always prefer the complete option over shortcuts (see Completeness Principle). Include `Completeness: X/10` for each option. Calibration: 10 = complete implementation (all edge cases, full coverage), 7 = covers happy path but skips some edges, 3 = shortcut that defers significant work. If both options are 8+, pick the higher; if one is ≤5, flag it. +4. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)` Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex. Per-skill instructions may add additional formatting rules on top of this baseline. +## Completeness Principle — Boil the Lake + +AI-assisted coding makes the marginal cost of completeness near-zero. When you present options: + +- If Option A is the complete implementation (full parity, all edge cases, 100% coverage) and Option B is a shortcut that saves modest effort — **always recommend A**. The delta between 80 lines and 150 lines is meaningless with CC+gstack. "Good enough" is the wrong instinct when "complete" costs minutes more. +- **Lake vs. ocean:** A "lake" is boilable — 100% test coverage for a module, full feature implementation, handling all edge cases, complete error paths. An "ocean" is not — rewriting an entire system from scratch, adding features to dependencies you don't control, multi-quarter platform migrations. Recommend boiling lakes. Flag oceans as out of scope. +- **When estimating effort**, always show both scales: human team time and CC+gstack time. The compression ratio varies by task type — use this reference: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +- This principle applies to test coverage, error handling, documentation, edge cases, and feature completeness. Don't skip the last 10% to "save time" — with AI, that 10% costs seconds. + +**Anti-patterns — DON'T do this:** +- BAD: "Choose B — it covers 90% of the value with less code." (If A is only 70 lines more, choose A.) +- BAD: "We can skip edge case handling to save time." (Edge case handling costs minutes with CC.) +- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.) +- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.") + ## Contributor Mode If `_CONTRIB` is `true`: you are in **contributor mode**. You're a gstack user who also helps make it better. diff --git a/plan-eng-review/SKILL.md b/plan-eng-review/SKILL.md index d4e9a83e..d2292af6 100644 --- a/plan-eng-review/SKILL.md +++ b/plan-eng-review/SKILL.md @@ -28,22 +28,61 @@ find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true _CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true) _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") echo "BRANCH: $_BRANCH" +_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") +echo "LAKE_INTRO: $_LAKE_SEEN" ``` If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. +If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle. +Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete +thing when AI makes the marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean" +Then offer to open the essay in their default browser: + +```bash +open https://garryslist.org/posts/boil-the-ocean +touch ~/.gstack/.completeness-intro-seen +``` + +Only run `open` if the user says yes. Always run `touch` to mark as seen. This only happens once. + ## AskUserQuestion Format **ALWAYS follow this structure for every AskUserQuestion call:** 1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences) 2. **Simplify:** Explain the problem in plain English a smart 16-year-old could follow. No raw function names, no internal jargon, no implementation details. Use concrete examples and analogies. Say what it DOES, not what it's called. -3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` -4. **Options:** Lettered options: `A) ... B) ... C) ...` +3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` — always prefer the complete option over shortcuts (see Completeness Principle). Include `Completeness: X/10` for each option. Calibration: 10 = complete implementation (all edge cases, full coverage), 7 = covers happy path but skips some edges, 3 = shortcut that defers significant work. If both options are 8+, pick the higher; if one is ≤5, flag it. +4. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)` Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex. Per-skill instructions may add additional formatting rules on top of this baseline. +## Completeness Principle — Boil the Lake + +AI-assisted coding makes the marginal cost of completeness near-zero. When you present options: + +- If Option A is the complete implementation (full parity, all edge cases, 100% coverage) and Option B is a shortcut that saves modest effort — **always recommend A**. The delta between 80 lines and 150 lines is meaningless with CC+gstack. "Good enough" is the wrong instinct when "complete" costs minutes more. +- **Lake vs. ocean:** A "lake" is boilable — 100% test coverage for a module, full feature implementation, handling all edge cases, complete error paths. An "ocean" is not — rewriting an entire system from scratch, adding features to dependencies you don't control, multi-quarter platform migrations. Recommend boiling lakes. Flag oceans as out of scope. +- **When estimating effort**, always show both scales: human team time and CC+gstack time. The compression ratio varies by task type — use this reference: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +- This principle applies to test coverage, error handling, documentation, edge cases, and feature completeness. Don't skip the last 10% to "save time" — with AI, that 10% costs seconds. + +**Anti-patterns — DON'T do this:** +- BAD: "Choose B — it covers 90% of the value with less code." (If A is only 70 lines more, choose A.) +- BAD: "We can skip edge case handling to save time." (Edge case handling costs minutes with CC.) +- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.) +- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.") + ## Contributor Mode If `_CONTRIB` is `true`: you are in **contributor mode**. You're a gstack user who also helps make it better. @@ -110,6 +149,8 @@ Before reviewing anything, answer these questions: 3. **Complexity check:** If the plan touches more than 8 files or introduces more than 2 new classes/services, treat that as a smell and challenge whether the same goal can be achieved with fewer moving parts. 4. **TODOS cross-reference:** Read `TODOS.md` if it exists. Are any deferred items blocking this plan? Can any deferred items be bundled into this PR without expanding scope? Does this plan create new work that should be captured as a TODO? +5. **Completeness check:** Is the plan doing the complete version or a shortcut? With AI-assisted coding, the cost of completeness (100% test coverage, full edge case handling, complete error paths) is 10-100x cheaper than with a human team. If the plan proposes a shortcut that saves human-hours but only saves minutes with CC+gstack, recommend the complete version. Boil the lake. + If the complexity check triggers (8+ files or 2+ new classes/services), proactively recommend scope reduction via AskUserQuestion — explain what's overbuilt, propose a minimal version that achieves the core goal, and ask whether to reduce or proceed as-is. If the complexity check does not trigger, present your Step 0 findings and proceed directly to Section 1. Always work through the full interactive review: one section at a time (Architecture → Code Quality → Tests → Performance) with at most 8 top issues per section. @@ -196,7 +237,7 @@ Follow the AskUserQuestion format from the Preamble above. Additional rules for * **One issue = one AskUserQuestion call.** Never combine multiple issues into one question. * Describe the problem concretely, with file and line references. * Present 2-3 options, including "do nothing" where that's reasonable. -* For each option, specify in one line: effort, risk, and maintenance burden. +* For each option, specify in one line: effort (human: ~X / CC: ~Y), risk, and maintenance burden. If the complete option is only marginally more effort than the shortcut with CC, recommend the complete option. * **Map the reasoning to my engineering preferences above.** One sentence connecting your recommendation to a specific preference (DRY, explicit > clever, minimal diff, etc.). * Label with issue NUMBER + option LETTER (e.g., "3A", "3B"). * **Escape hatch:** If a section has no issues, say so and move on. If an issue has an obvious fix with no real alternatives, state what you'll do and move on — don't waste a question on it. Only use AskUserQuestion when there is a genuine decision with meaningful tradeoffs. @@ -246,6 +287,7 @@ At the end of the review, fill in and display this summary so the user can see a - What already exists: written - TODOS.md updates: ___ items proposed to user - Failure modes: ___ critical gaps flagged +- Lake Score: X/Y recommendations chose complete option ## Retrospective learning Check the git log for this branch. If there are prior commits suggesting a previous review cycle (e.g., review-driven refactors, reverted changes), note what was changed and whether the current plan touches the same areas. Be more aggressive reviewing areas that were previously problematic. diff --git a/plan-eng-review/SKILL.md.tmpl b/plan-eng-review/SKILL.md.tmpl index 3433cb75..bf033606 100644 --- a/plan-eng-review/SKILL.md.tmpl +++ b/plan-eng-review/SKILL.md.tmpl @@ -45,6 +45,8 @@ Before reviewing anything, answer these questions: 3. **Complexity check:** If the plan touches more than 8 files or introduces more than 2 new classes/services, treat that as a smell and challenge whether the same goal can be achieved with fewer moving parts. 4. **TODOS cross-reference:** Read `TODOS.md` if it exists. Are any deferred items blocking this plan? Can any deferred items be bundled into this PR without expanding scope? Does this plan create new work that should be captured as a TODO? +5. **Completeness check:** Is the plan doing the complete version or a shortcut? With AI-assisted coding, the cost of completeness (100% test coverage, full edge case handling, complete error paths) is 10-100x cheaper than with a human team. If the plan proposes a shortcut that saves human-hours but only saves minutes with CC+gstack, recommend the complete version. Boil the lake. + If the complexity check triggers (8+ files or 2+ new classes/services), proactively recommend scope reduction via AskUserQuestion — explain what's overbuilt, propose a minimal version that achieves the core goal, and ask whether to reduce or proceed as-is. If the complexity check does not trigger, present your Step 0 findings and proceed directly to Section 1. Always work through the full interactive review: one section at a time (Architecture → Code Quality → Tests → Performance) with at most 8 top issues per section. @@ -131,7 +133,7 @@ Follow the AskUserQuestion format from the Preamble above. Additional rules for * **One issue = one AskUserQuestion call.** Never combine multiple issues into one question. * Describe the problem concretely, with file and line references. * Present 2-3 options, including "do nothing" where that's reasonable. -* For each option, specify in one line: effort, risk, and maintenance burden. +* For each option, specify in one line: effort (human: ~X / CC: ~Y), risk, and maintenance burden. If the complete option is only marginally more effort than the shortcut with CC, recommend the complete option. * **Map the reasoning to my engineering preferences above.** One sentence connecting your recommendation to a specific preference (DRY, explicit > clever, minimal diff, etc.). * Label with issue NUMBER + option LETTER (e.g., "3A", "3B"). * **Escape hatch:** If a section has no issues, say so and move on. If an issue has an obvious fix with no real alternatives, state what you'll do and move on — don't waste a question on it. Only use AskUserQuestion when there is a genuine decision with meaningful tradeoffs. @@ -181,6 +183,7 @@ At the end of the review, fill in and display this summary so the user can see a - What already exists: written - TODOS.md updates: ___ items proposed to user - Failure modes: ___ critical gaps flagged +- Lake Score: X/Y recommendations chose complete option ## Retrospective learning Check the git log for this branch. If there are prior commits suggesting a previous review cycle (e.g., review-driven refactors, reverted changes), note what was changed and whether the current plan touches the same areas. Be more aggressive reviewing areas that were previously problematic. diff --git a/qa-design-review/SKILL.md b/qa-design-review/SKILL.md index 7044c560..1d6200c5 100644 --- a/qa-design-review/SKILL.md +++ b/qa-design-review/SKILL.md @@ -31,22 +31,61 @@ find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true _CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true) _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") echo "BRANCH: $_BRANCH" +_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") +echo "LAKE_INTRO: $_LAKE_SEEN" ``` If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. +If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle. +Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete +thing when AI makes the marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean" +Then offer to open the essay in their default browser: + +```bash +open https://garryslist.org/posts/boil-the-ocean +touch ~/.gstack/.completeness-intro-seen +``` + +Only run `open` if the user says yes. Always run `touch` to mark as seen. This only happens once. + ## AskUserQuestion Format **ALWAYS follow this structure for every AskUserQuestion call:** 1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences) 2. **Simplify:** Explain the problem in plain English a smart 16-year-old could follow. No raw function names, no internal jargon, no implementation details. Use concrete examples and analogies. Say what it DOES, not what it's called. -3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` -4. **Options:** Lettered options: `A) ... B) ... C) ...` +3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` — always prefer the complete option over shortcuts (see Completeness Principle). Include `Completeness: X/10` for each option. Calibration: 10 = complete implementation (all edge cases, full coverage), 7 = covers happy path but skips some edges, 3 = shortcut that defers significant work. If both options are 8+, pick the higher; if one is ≤5, flag it. +4. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)` Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex. Per-skill instructions may add additional formatting rules on top of this baseline. +## Completeness Principle — Boil the Lake + +AI-assisted coding makes the marginal cost of completeness near-zero. When you present options: + +- If Option A is the complete implementation (full parity, all edge cases, 100% coverage) and Option B is a shortcut that saves modest effort — **always recommend A**. The delta between 80 lines and 150 lines is meaningless with CC+gstack. "Good enough" is the wrong instinct when "complete" costs minutes more. +- **Lake vs. ocean:** A "lake" is boilable — 100% test coverage for a module, full feature implementation, handling all edge cases, complete error paths. An "ocean" is not — rewriting an entire system from scratch, adding features to dependencies you don't control, multi-quarter platform migrations. Recommend boiling lakes. Flag oceans as out of scope. +- **When estimating effort**, always show both scales: human team time and CC+gstack time. The compression ratio varies by task type — use this reference: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +- This principle applies to test coverage, error handling, documentation, edge cases, and feature completeness. Don't skip the last 10% to "save time" — with AI, that 10% costs seconds. + +**Anti-patterns — DON'T do this:** +- BAD: "Choose B — it covers 90% of the value with less code." (If A is only 70 lines more, choose A.) +- BAD: "We can skip edge case handling to save time." (Edge case handling costs minutes with CC.) +- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.) +- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.") + ## Contributor Mode If `_CONTRIB` is `true`: you are in **contributor mode**. You're a gstack user who also helps make it better. diff --git a/qa-only/SKILL.md b/qa-only/SKILL.md index 4fa0cf04..594979b9 100644 --- a/qa-only/SKILL.md +++ b/qa-only/SKILL.md @@ -27,22 +27,61 @@ find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true _CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true) _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") echo "BRANCH: $_BRANCH" +_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") +echo "LAKE_INTRO: $_LAKE_SEEN" ``` If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. +If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle. +Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete +thing when AI makes the marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean" +Then offer to open the essay in their default browser: + +```bash +open https://garryslist.org/posts/boil-the-ocean +touch ~/.gstack/.completeness-intro-seen +``` + +Only run `open` if the user says yes. Always run `touch` to mark as seen. This only happens once. + ## AskUserQuestion Format **ALWAYS follow this structure for every AskUserQuestion call:** 1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences) 2. **Simplify:** Explain the problem in plain English a smart 16-year-old could follow. No raw function names, no internal jargon, no implementation details. Use concrete examples and analogies. Say what it DOES, not what it's called. -3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` -4. **Options:** Lettered options: `A) ... B) ... C) ...` +3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` — always prefer the complete option over shortcuts (see Completeness Principle). Include `Completeness: X/10` for each option. Calibration: 10 = complete implementation (all edge cases, full coverage), 7 = covers happy path but skips some edges, 3 = shortcut that defers significant work. If both options are 8+, pick the higher; if one is ≤5, flag it. +4. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)` Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex. Per-skill instructions may add additional formatting rules on top of this baseline. +## Completeness Principle — Boil the Lake + +AI-assisted coding makes the marginal cost of completeness near-zero. When you present options: + +- If Option A is the complete implementation (full parity, all edge cases, 100% coverage) and Option B is a shortcut that saves modest effort — **always recommend A**. The delta between 80 lines and 150 lines is meaningless with CC+gstack. "Good enough" is the wrong instinct when "complete" costs minutes more. +- **Lake vs. ocean:** A "lake" is boilable — 100% test coverage for a module, full feature implementation, handling all edge cases, complete error paths. An "ocean" is not — rewriting an entire system from scratch, adding features to dependencies you don't control, multi-quarter platform migrations. Recommend boiling lakes. Flag oceans as out of scope. +- **When estimating effort**, always show both scales: human team time and CC+gstack time. The compression ratio varies by task type — use this reference: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +- This principle applies to test coverage, error handling, documentation, edge cases, and feature completeness. Don't skip the last 10% to "save time" — with AI, that 10% costs seconds. + +**Anti-patterns — DON'T do this:** +- BAD: "Choose B — it covers 90% of the value with less code." (If A is only 70 lines more, choose A.) +- BAD: "We can skip edge case handling to save time." (Edge case handling costs minutes with CC.) +- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.) +- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.") + ## Contributor Mode If `_CONTRIB` is `true`: you are in **contributor mode**. You're a gstack user who also helps make it better. diff --git a/qa/SKILL.md b/qa/SKILL.md index c01514cf..10e50715 100644 --- a/qa/SKILL.md +++ b/qa/SKILL.md @@ -33,22 +33,61 @@ find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true _CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true) _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") echo "BRANCH: $_BRANCH" +_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") +echo "LAKE_INTRO: $_LAKE_SEEN" ``` If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. +If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle. +Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete +thing when AI makes the marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean" +Then offer to open the essay in their default browser: + +```bash +open https://garryslist.org/posts/boil-the-ocean +touch ~/.gstack/.completeness-intro-seen +``` + +Only run `open` if the user says yes. Always run `touch` to mark as seen. This only happens once. + ## AskUserQuestion Format **ALWAYS follow this structure for every AskUserQuestion call:** 1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences) 2. **Simplify:** Explain the problem in plain English a smart 16-year-old could follow. No raw function names, no internal jargon, no implementation details. Use concrete examples and analogies. Say what it DOES, not what it's called. -3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` -4. **Options:** Lettered options: `A) ... B) ... C) ...` +3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` — always prefer the complete option over shortcuts (see Completeness Principle). Include `Completeness: X/10` for each option. Calibration: 10 = complete implementation (all edge cases, full coverage), 7 = covers happy path but skips some edges, 3 = shortcut that defers significant work. If both options are 8+, pick the higher; if one is ≤5, flag it. +4. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)` Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex. Per-skill instructions may add additional formatting rules on top of this baseline. +## Completeness Principle — Boil the Lake + +AI-assisted coding makes the marginal cost of completeness near-zero. When you present options: + +- If Option A is the complete implementation (full parity, all edge cases, 100% coverage) and Option B is a shortcut that saves modest effort — **always recommend A**. The delta between 80 lines and 150 lines is meaningless with CC+gstack. "Good enough" is the wrong instinct when "complete" costs minutes more. +- **Lake vs. ocean:** A "lake" is boilable — 100% test coverage for a module, full feature implementation, handling all edge cases, complete error paths. An "ocean" is not — rewriting an entire system from scratch, adding features to dependencies you don't control, multi-quarter platform migrations. Recommend boiling lakes. Flag oceans as out of scope. +- **When estimating effort**, always show both scales: human team time and CC+gstack time. The compression ratio varies by task type — use this reference: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +- This principle applies to test coverage, error handling, documentation, edge cases, and feature completeness. Don't skip the last 10% to "save time" — with AI, that 10% costs seconds. + +**Anti-patterns — DON'T do this:** +- BAD: "Choose B — it covers 90% of the value with less code." (If A is only 70 lines more, choose A.) +- BAD: "We can skip edge case handling to save time." (Edge case handling costs minutes with CC.) +- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.) +- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.") + ## Contributor Mode If `_CONTRIB` is `true`: you are in **contributor mode**. You're a gstack user who also helps make it better. diff --git a/retro/SKILL.md b/retro/SKILL.md index e7cd3d2c..71eab988 100644 --- a/retro/SKILL.md +++ b/retro/SKILL.md @@ -27,22 +27,61 @@ find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true _CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true) _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") echo "BRANCH: $_BRANCH" +_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") +echo "LAKE_INTRO: $_LAKE_SEEN" ``` If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. +If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle. +Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete +thing when AI makes the marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean" +Then offer to open the essay in their default browser: + +```bash +open https://garryslist.org/posts/boil-the-ocean +touch ~/.gstack/.completeness-intro-seen +``` + +Only run `open` if the user says yes. Always run `touch` to mark as seen. This only happens once. + ## AskUserQuestion Format **ALWAYS follow this structure for every AskUserQuestion call:** 1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences) 2. **Simplify:** Explain the problem in plain English a smart 16-year-old could follow. No raw function names, no internal jargon, no implementation details. Use concrete examples and analogies. Say what it DOES, not what it's called. -3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` -4. **Options:** Lettered options: `A) ... B) ... C) ...` +3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` — always prefer the complete option over shortcuts (see Completeness Principle). Include `Completeness: X/10` for each option. Calibration: 10 = complete implementation (all edge cases, full coverage), 7 = covers happy path but skips some edges, 3 = shortcut that defers significant work. If both options are 8+, pick the higher; if one is ≤5, flag it. +4. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)` Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex. Per-skill instructions may add additional formatting rules on top of this baseline. +## Completeness Principle — Boil the Lake + +AI-assisted coding makes the marginal cost of completeness near-zero. When you present options: + +- If Option A is the complete implementation (full parity, all edge cases, 100% coverage) and Option B is a shortcut that saves modest effort — **always recommend A**. The delta between 80 lines and 150 lines is meaningless with CC+gstack. "Good enough" is the wrong instinct when "complete" costs minutes more. +- **Lake vs. ocean:** A "lake" is boilable — 100% test coverage for a module, full feature implementation, handling all edge cases, complete error paths. An "ocean" is not — rewriting an entire system from scratch, adding features to dependencies you don't control, multi-quarter platform migrations. Recommend boiling lakes. Flag oceans as out of scope. +- **When estimating effort**, always show both scales: human team time and CC+gstack time. The compression ratio varies by task type — use this reference: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +- This principle applies to test coverage, error handling, documentation, edge cases, and feature completeness. Don't skip the last 10% to "save time" — with AI, that 10% costs seconds. + +**Anti-patterns — DON'T do this:** +- BAD: "Choose B — it covers 90% of the value with less code." (If A is only 70 lines more, choose A.) +- BAD: "We can skip edge case handling to save time." (Edge case handling costs minutes with CC.) +- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.) +- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.") + ## Contributor Mode If `_CONTRIB` is `true`: you are in **contributor mode**. You're a gstack user who also helps make it better. diff --git a/review/SKILL.md b/review/SKILL.md index 3ad683d6..186978ef 100644 --- a/review/SKILL.md +++ b/review/SKILL.md @@ -28,22 +28,61 @@ find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true _CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true) _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") echo "BRANCH: $_BRANCH" +_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") +echo "LAKE_INTRO: $_LAKE_SEEN" ``` If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. +If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle. +Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete +thing when AI makes the marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean" +Then offer to open the essay in their default browser: + +```bash +open https://garryslist.org/posts/boil-the-ocean +touch ~/.gstack/.completeness-intro-seen +``` + +Only run `open` if the user says yes. Always run `touch` to mark as seen. This only happens once. + ## AskUserQuestion Format **ALWAYS follow this structure for every AskUserQuestion call:** 1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences) 2. **Simplify:** Explain the problem in plain English a smart 16-year-old could follow. No raw function names, no internal jargon, no implementation details. Use concrete examples and analogies. Say what it DOES, not what it's called. -3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` -4. **Options:** Lettered options: `A) ... B) ... C) ...` +3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` — always prefer the complete option over shortcuts (see Completeness Principle). Include `Completeness: X/10` for each option. Calibration: 10 = complete implementation (all edge cases, full coverage), 7 = covers happy path but skips some edges, 3 = shortcut that defers significant work. If both options are 8+, pick the higher; if one is ≤5, flag it. +4. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)` Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex. Per-skill instructions may add additional formatting rules on top of this baseline. +## Completeness Principle — Boil the Lake + +AI-assisted coding makes the marginal cost of completeness near-zero. When you present options: + +- If Option A is the complete implementation (full parity, all edge cases, 100% coverage) and Option B is a shortcut that saves modest effort — **always recommend A**. The delta between 80 lines and 150 lines is meaningless with CC+gstack. "Good enough" is the wrong instinct when "complete" costs minutes more. +- **Lake vs. ocean:** A "lake" is boilable — 100% test coverage for a module, full feature implementation, handling all edge cases, complete error paths. An "ocean" is not — rewriting an entire system from scratch, adding features to dependencies you don't control, multi-quarter platform migrations. Recommend boiling lakes. Flag oceans as out of scope. +- **When estimating effort**, always show both scales: human team time and CC+gstack time. The compression ratio varies by task type — use this reference: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +- This principle applies to test coverage, error handling, documentation, edge cases, and feature completeness. Don't skip the last 10% to "save time" — with AI, that 10% costs seconds. + +**Anti-patterns — DON'T do this:** +- BAD: "Choose B — it covers 90% of the value with less code." (If A is only 70 lines more, choose A.) +- BAD: "We can skip edge case handling to save time." (Edge case handling costs minutes with CC.) +- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.) +- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.") + ## Contributor Mode If `_CONTRIB` is `true`: you are in **contributor mode**. You're a gstack user who also helps make it better. diff --git a/review/checklist.md b/review/checklist.md index e2246050..282c9944 100644 --- a/review/checklist.md +++ b/review/checklist.md @@ -84,6 +84,12 @@ To do this: use Grep to find all references to the sibling values (e.g., grep fo - `.expects(:something).never` missing when a code path should explicitly NOT call an external service - Security enforcement features (blocking, rate limiting, auth) without integration tests verifying the enforcement path works end-to-end +#### Completeness Gaps +- Shortcut implementations where the complete version would cost <30 minutes CC time (e.g., partial enum handling, incomplete error paths, missing edge cases that are straightforward to add) +- Options presented with only human-team effort estimates — should show both human and CC+gstack time +- Test coverage gaps where adding the missing tests is a "lake" not an "ocean" (e.g., missing negative-path tests, missing edge case tests that mirror happy-path structure) +- Features implemented at 80-90% when 100% is achievable with modest additional code + #### Crypto & Entropy - Truncation of data instead of hashing (last N chars instead of SHA-256) — less entropy, easier collisions - `rand()` / `Random.rand` for security-sensitive values — use `SecureRandom` instead @@ -113,6 +119,7 @@ CRITICAL (highest severity): INFORMATIONAL (lower severity): ├─ LLM Output Trust Boundary ├─ Dead Code & Consistency └─ Enum & Value Completeness ├─ LLM Prompt Issues ├─ Test Gaps + ├─ Completeness Gaps ├─ Crypto & Entropy ├─ Time Window Safety ├─ Type Coercion at Boundaries diff --git a/scripts/gen-skill-docs.ts b/scripts/gen-skill-docs.ts index 31684e21..d2e86ecf 100644 --- a/scripts/gen-skill-docs.ts +++ b/scripts/gen-skill-docs.ts @@ -107,22 +107,61 @@ find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true _CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true) _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") echo "BRANCH: $_BRANCH" +_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") +echo "LAKE_INTRO: $_LAKE_SEEN" \`\`\` If output shows \`UPGRADE_AVAILABLE \`: read \`~/.claude/skills/gstack/gstack-upgrade/SKILL.md\` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If \`JUST_UPGRADED \`: tell user "Running gstack v{to} (just updated!)" and continue. +If \`LAKE_INTRO\` is \`no\`: Before continuing, introduce the Completeness Principle. +Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete +thing when AI makes the marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean" +Then offer to open the essay in their default browser: + +\`\`\`bash +open https://garryslist.org/posts/boil-the-ocean +touch ~/.gstack/.completeness-intro-seen +\`\`\` + +Only run \`open\` if the user says yes. Always run \`touch\` to mark as seen. This only happens once. + ## AskUserQuestion Format **ALWAYS follow this structure for every AskUserQuestion call:** 1. **Re-ground:** State the project, the current branch (use the \`_BRANCH\` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences) 2. **Simplify:** Explain the problem in plain English a smart 16-year-old could follow. No raw function names, no internal jargon, no implementation details. Use concrete examples and analogies. Say what it DOES, not what it's called. -3. **Recommend:** \`RECOMMENDATION: Choose [X] because [one-line reason]\` -4. **Options:** Lettered options: \`A) ... B) ... C) ...\` +3. **Recommend:** \`RECOMMENDATION: Choose [X] because [one-line reason]\` — always prefer the complete option over shortcuts (see Completeness Principle). Include \`Completeness: X/10\` for each option. Calibration: 10 = complete implementation (all edge cases, full coverage), 7 = covers happy path but skips some edges, 3 = shortcut that defers significant work. If both options are 8+, pick the higher; if one is ≤5, flag it. +4. **Options:** Lettered options: \`A) ... B) ... C) ...\` — when an option involves effort, show both scales: \`(human: ~X / CC: ~Y)\` Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex. Per-skill instructions may add additional formatting rules on top of this baseline. +## Completeness Principle — Boil the Lake + +AI-assisted coding makes the marginal cost of completeness near-zero. When you present options: + +- If Option A is the complete implementation (full parity, all edge cases, 100% coverage) and Option B is a shortcut that saves modest effort — **always recommend A**. The delta between 80 lines and 150 lines is meaningless with CC+gstack. "Good enough" is the wrong instinct when "complete" costs minutes more. +- **Lake vs. ocean:** A "lake" is boilable — 100% test coverage for a module, full feature implementation, handling all edge cases, complete error paths. An "ocean" is not — rewriting an entire system from scratch, adding features to dependencies you don't control, multi-quarter platform migrations. Recommend boiling lakes. Flag oceans as out of scope. +- **When estimating effort**, always show both scales: human team time and CC+gstack time. The compression ratio varies by task type — use this reference: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +- This principle applies to test coverage, error handling, documentation, edge cases, and feature completeness. Don't skip the last 10% to "save time" — with AI, that 10% costs seconds. + +**Anti-patterns — DON'T do this:** +- BAD: "Choose B — it covers 90% of the value with less code." (If A is only 70 lines more, choose A.) +- BAD: "We can skip edge case handling to save time." (Edge case handling costs minutes with CC.) +- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.) +- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.") + ## Contributor Mode If \`_CONTRIB\` is \`true\`: you are in **contributor mode**. You're a gstack user who also helps make it better. diff --git a/setup-browser-cookies/SKILL.md b/setup-browser-cookies/SKILL.md index faebe520..da64dec4 100644 --- a/setup-browser-cookies/SKILL.md +++ b/setup-browser-cookies/SKILL.md @@ -25,22 +25,61 @@ find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true _CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true) _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") echo "BRANCH: $_BRANCH" +_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") +echo "LAKE_INTRO: $_LAKE_SEEN" ``` If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. +If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle. +Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete +thing when AI makes the marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean" +Then offer to open the essay in their default browser: + +```bash +open https://garryslist.org/posts/boil-the-ocean +touch ~/.gstack/.completeness-intro-seen +``` + +Only run `open` if the user says yes. Always run `touch` to mark as seen. This only happens once. + ## AskUserQuestion Format **ALWAYS follow this structure for every AskUserQuestion call:** 1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences) 2. **Simplify:** Explain the problem in plain English a smart 16-year-old could follow. No raw function names, no internal jargon, no implementation details. Use concrete examples and analogies. Say what it DOES, not what it's called. -3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` -4. **Options:** Lettered options: `A) ... B) ... C) ...` +3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` — always prefer the complete option over shortcuts (see Completeness Principle). Include `Completeness: X/10` for each option. Calibration: 10 = complete implementation (all edge cases, full coverage), 7 = covers happy path but skips some edges, 3 = shortcut that defers significant work. If both options are 8+, pick the higher; if one is ≤5, flag it. +4. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)` Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex. Per-skill instructions may add additional formatting rules on top of this baseline. +## Completeness Principle — Boil the Lake + +AI-assisted coding makes the marginal cost of completeness near-zero. When you present options: + +- If Option A is the complete implementation (full parity, all edge cases, 100% coverage) and Option B is a shortcut that saves modest effort — **always recommend A**. The delta between 80 lines and 150 lines is meaningless with CC+gstack. "Good enough" is the wrong instinct when "complete" costs minutes more. +- **Lake vs. ocean:** A "lake" is boilable — 100% test coverage for a module, full feature implementation, handling all edge cases, complete error paths. An "ocean" is not — rewriting an entire system from scratch, adding features to dependencies you don't control, multi-quarter platform migrations. Recommend boiling lakes. Flag oceans as out of scope. +- **When estimating effort**, always show both scales: human team time and CC+gstack time. The compression ratio varies by task type — use this reference: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +- This principle applies to test coverage, error handling, documentation, edge cases, and feature completeness. Don't skip the last 10% to "save time" — with AI, that 10% costs seconds. + +**Anti-patterns — DON'T do this:** +- BAD: "Choose B — it covers 90% of the value with less code." (If A is only 70 lines more, choose A.) +- BAD: "We can skip edge case handling to save time." (Edge case handling costs minutes with CC.) +- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.) +- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.") + ## Contributor Mode If `_CONTRIB` is `true`: you are in **contributor mode**. You're a gstack user who also helps make it better. diff --git a/ship/SKILL.md b/ship/SKILL.md index 32582088..e2b524d9 100644 --- a/ship/SKILL.md +++ b/ship/SKILL.md @@ -28,22 +28,61 @@ find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true _CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true) _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") echo "BRANCH: $_BRANCH" +_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") +echo "LAKE_INTRO: $_LAKE_SEEN" ``` If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. +If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle. +Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete +thing when AI makes the marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean" +Then offer to open the essay in their default browser: + +```bash +open https://garryslist.org/posts/boil-the-ocean +touch ~/.gstack/.completeness-intro-seen +``` + +Only run `open` if the user says yes. Always run `touch` to mark as seen. This only happens once. + ## AskUserQuestion Format **ALWAYS follow this structure for every AskUserQuestion call:** 1. **Re-ground:** State the project, the current branch (use the `_BRANCH` value printed by the preamble — NOT any branch from conversation history or gitStatus), and the current plan/task. (1-2 sentences) 2. **Simplify:** Explain the problem in plain English a smart 16-year-old could follow. No raw function names, no internal jargon, no implementation details. Use concrete examples and analogies. Say what it DOES, not what it's called. -3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` -4. **Options:** Lettered options: `A) ... B) ... C) ...` +3. **Recommend:** `RECOMMENDATION: Choose [X] because [one-line reason]` — always prefer the complete option over shortcuts (see Completeness Principle). Include `Completeness: X/10` for each option. Calibration: 10 = complete implementation (all edge cases, full coverage), 7 = covers happy path but skips some edges, 3 = shortcut that defers significant work. If both options are 8+, pick the higher; if one is ≤5, flag it. +4. **Options:** Lettered options: `A) ... B) ... C) ...` — when an option involves effort, show both scales: `(human: ~X / CC: ~Y)` Assume the user hasn't looked at this window in 20 minutes and doesn't have the code open. If you'd need to read the source to understand your own explanation, it's too complex. Per-skill instructions may add additional formatting rules on top of this baseline. +## Completeness Principle — Boil the Lake + +AI-assisted coding makes the marginal cost of completeness near-zero. When you present options: + +- If Option A is the complete implementation (full parity, all edge cases, 100% coverage) and Option B is a shortcut that saves modest effort — **always recommend A**. The delta between 80 lines and 150 lines is meaningless with CC+gstack. "Good enough" is the wrong instinct when "complete" costs minutes more. +- **Lake vs. ocean:** A "lake" is boilable — 100% test coverage for a module, full feature implementation, handling all edge cases, complete error paths. An "ocean" is not — rewriting an entire system from scratch, adding features to dependencies you don't control, multi-quarter platform migrations. Recommend boiling lakes. Flag oceans as out of scope. +- **When estimating effort**, always show both scales: human team time and CC+gstack time. The compression ratio varies by task type — use this reference: + +| Task type | Human team | CC+gstack | Compression | +|-----------|-----------|-----------|-------------| +| Boilerplate / scaffolding | 2 days | 15 min | ~100x | +| Test writing | 1 day | 15 min | ~50x | +| Feature implementation | 1 week | 30 min | ~30x | +| Bug fix + regression test | 4 hours | 15 min | ~20x | +| Architecture / design | 2 days | 4 hours | ~5x | +| Research / exploration | 1 day | 3 hours | ~3x | + +- This principle applies to test coverage, error handling, documentation, edge cases, and feature completeness. Don't skip the last 10% to "save time" — with AI, that 10% costs seconds. + +**Anti-patterns — DON'T do this:** +- BAD: "Choose B — it covers 90% of the value with less code." (If A is only 70 lines more, choose A.) +- BAD: "We can skip edge case handling to save time." (Edge case handling costs minutes with CC.) +- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.) +- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.") + ## Contributor Mode If `_CONTRIB` is `true`: you are in **contributor mode**. You're a gstack user who also helps make it better. diff --git a/test/skill-e2e.test.ts b/test/skill-e2e.test.ts index 2ea56da9..021e41da 100644 --- a/test/skill-e2e.test.ts +++ b/test/skill-e2e.test.ts @@ -1886,7 +1886,7 @@ Write DESIGN.md and CLAUDE.md (or update it) in the working directory.`, This is a civic tech data platform called CivicPulse. Read the README.md. -DO research competitors before proposing — search for civic tech and government data platform designs. Skip the font preview page. Skip any AskUserQuestion calls — this is non-interactive. +DO research what's out there before proposing — search for civic tech and government data platform designs. Skip the font preview page. Skip any AskUserQuestion calls — this is non-interactive. Write DESIGN.md to the working directory.`, workingDirectory: designDir, diff --git a/test/skill-validation.test.ts b/test/skill-validation.test.ts index 54e03a4d..81d97d31 100644 --- a/test/skill-validation.test.ts +++ b/test/skill-validation.test.ts @@ -617,6 +617,43 @@ describe('Enum & Value Completeness in review checklist', () => { }); }); +// --- Completeness Principle spot-check --- + +describe('Completeness Principle in generated SKILL.md files', () => { + const skillsWithPreamble = [ + 'SKILL.md', 'browse/SKILL.md', 'qa/SKILL.md', + 'qa-only/SKILL.md', + 'setup-browser-cookies/SKILL.md', + 'ship/SKILL.md', 'review/SKILL.md', + 'plan-ceo-review/SKILL.md', 'plan-eng-review/SKILL.md', + 'retro/SKILL.md', + 'plan-design-review/SKILL.md', + 'qa-design-review/SKILL.md', + 'design-consultation/SKILL.md', + 'document-release/SKILL.md', + ]; + + for (const skill of skillsWithPreamble) { + test(`${skill} contains Completeness Principle section`, () => { + const content = fs.readFileSync(path.join(ROOT, skill), 'utf-8'); + expect(content).toContain('Completeness Principle'); + expect(content).toContain('Boil the Lake'); + }); + } + + test('Completeness Principle includes compression table', () => { + const content = fs.readFileSync(path.join(ROOT, 'SKILL.md'), 'utf-8'); + expect(content).toContain('CC+gstack'); + expect(content).toContain('Compression'); + }); + + test('Completeness Principle includes anti-patterns', () => { + const content = fs.readFileSync(path.join(ROOT, 'SKILL.md'), 'utf-8'); + expect(content).toContain('BAD:'); + expect(content).toContain('Anti-patterns'); + }); +}); + // --- Part 7: Planted-bug fixture validation (A4) --- describe('Planted-bug fixture validation', () => {