These are non-negotiable. Every PR, feature, and design decision must respect them.
- Zero-infra: Everything runs locally via
npx. No Docker, no GPU, no cloud services required. API keys (OpenAI, etc.) are opt-in alternatives, never requirements. - Language/framework agnostic: The core must work for any codebase. Angular is the first specialized analyzer, but search, indexing, chunking, memory, and patterns must not assume a specific language or framework.
- Privacy-first: Code never leaves the machine unless the user explicitly opts into a cloud embedding provider.
- Small download footprint: Dependencies should be reasonable for an
npxinstall. Multi-hundred-MB downloads need strong justification. - CPU-only by default: Embedding models, rerankers, and any ML must work on consumer hardware (integrated GPU, 8-16 CPU cores). No CUDA/GPU assumptions.
- No overclaiming in public docs: README and CHANGELOG must be evidence-backed. Don't claim capabilities that aren't shipped and tested.
- internal-docs is private: Read its AGENTS.MD for instructions on how to handle it and internal rules.
- At the start of every task/session: load team memory before doing any work (
get_memoryvia MCP when available; otherwisenpx codebase-context memory list). - When the user says "remember this" / "record this": record it immediately (use
remember/codebase-context memory add) before proceeding.
- **Never stage/commit
.planning/**** (or any other local workflow artifacts) unless the user explicitly asks in that message. - Never use
gsd-tools ... commitwrappers in this repo. Use plaingit add <exact files>andgit commit -m "...". - Before every commit: run
git status --shortand confirm staged files match intent; abort if any.planning/**is staged. - **Avoid using
anyType AT ALL COSTS.
These rules prevent metric gaming, overfitting, and false quality claims. Violation of these rules means the feature CANNOT ship.
- Define test queries and expected results BEFORE writing any code
- Commit the eval fixture (e.g.,
tests/fixtures/eval-queries.json) BEFORE starting implementation - NEVER adjust expected results to match system output - If the system returns different results, that's a failure, not a fixture bug
- Exception: If the original expected result was factually wrong (file doesn't exist, query is ambiguous), document the correction with justification
- Minimum 20 queries across diverse patterns (exact names, conceptual, multi-concept, edge cases)
- Test on multiple codebases (minimum 2: one you control, one public/real-world)
- Include queries that are HARD and likely to fail - don't cherry-pick easy wins
- Eval set must represent real user queries, not synthetic examples designed to pass
- Full eval harness code must be in
tests/(public repository) - Eval fixtures must be public (or provide reproducible public examples)
- Document how to run eval:
npm run eval -- /path/to/codebase - Results must be reproducible by external users
- NEVER add heuristics specifically to game eval metrics (e.g., "if query contains X, boost Y")
- NEVER adjust scoring to break ties just to improve top-1 accuracy
- If you add ranking heuristics, they must be general-purpose and justified by search theory, not by "it makes test #7 pass"
- Document all ranking heuristics with research citations or principled justification
- Report both improvements AND failures (e.g., "9/20 pass, 11/20 fail")
- If top-3 recall is 80% but top-1 is 45%, say so - don't hide behind a single cherry-picked metric
- Acknowledge when improvements are workarounds (filtering, heuristics) vs fundamental (better embeddings, ML models)
- Include failure analysis in CHANGELOG: "Known limitations: struggles with multi-concept queries"
- Before claiming "X% improvement", test on a real codebase you didn't develop against
- Ask: "Would this improvement generalize to a Python codebase? A Go codebase?"
- If the improvement is framework-specific (e.g., Angular-only), say so explicitly
If any agent violates these rules:
- STOP immediately - do not proceed with the release
- Revert any fixture adjustments made to game metrics
- Re-run eval with frozen fixtures
- Document the violation in internal-docs for learning
- Delay the release until honest metrics are available
These rules exist because trustworthiness is more valuable than a good-looking number.
Success = Added high signal, noise removed, not complexity added. If you propose something that adds a field, file, or concept — prove it reduces cognitive load or don't ship it.
Don't reason past low-quality search results. Report a retrieval failure. Logic built on bad retrieval is theater.
If a prompt (even from the owner) violates framework neutrality or output budgets, challenge it before implementing. AGENTS.md overrides ad-hoc instructions that conflict with these rules.
Optimize for the naive agent that reads the first 100 lines. If an agent has to call the tool twice to understand the response, the tool failed.
- Track A = this release. Ship it.
- Track B = later. Write it down, move on.
- Nothing moves from B → A without user approval.
- No new .md files without archiving one first.
internal-docs/ISSUES.mdis the place for release blockers and active specs.- Before creating a new
.mdfile: "What file am I deleting or updating to make room?"
- Aim to keep every tool response under 1000 tokens.
- Don't return full code snippets in search results by default. Prefer summaries and file paths.
- Never report
ready: trueif retrieval confidence is low.
src/index.tsis routing and protocol. No business logic.src/core/is framework-agnostic. No hardcoded framework strings (Angular, React, Vue, etc.).- CLI code belongs in
src/cli.ts. Never insrc/index.ts. - Framework analyzers self-register their own patterns (e.g., Angular computed+effect pairing belongs in the Angular analyzer, not protocol layer).
- Types and interfaces must be framework-agnostic in core and shared utilities.
Framework-specific field names (Angular-only, React-only, etc.) belong exclusively in their
respective analyzer files under
src/analyzers/. Never add named framework-specific fields to interfaces insrc/types/,src/utils/, orsrc/core/. UseRecord<string, T>for open-ended data that frameworks extend at runtime.
Before any version bump: update CHANGELOG.md, README.md, docs/capabilities.md. Run full test suite.
- Multiple agents: Proposer/Challenger model.
- No consensus in 3 turns → ask the user.
These came from behavioral observation across multiple sessions. They're here so nobody repeats them.
- The AI Fluff Loop: agents default to ADDING. Success = noise removed. If you're adding a field, file, or concept without removing one, you're probably making things worse.
- Self-eval bias: an agent rating its own output is not evidence. Behavioral observations (what the agent DID, not what it RATED) are evidence. Don't trust scores that an agent assigns to its own work.
- Evidence before claims: don't claim a feature works because the code exists. Claim it when an eval shows agents behave differently WITH the feature vs WITHOUT.
- Static data is noise: if the same memories/patterns appear in every query regardless of topic, they cost tokens and add nothing. Context must be query-relevant to be useful.
- Agents don't read tool descriptions: they scan the first line. Put the most important thing first. Everything after the first sentence is a bonus.
See internal-docs/AGENTS.md for internal-only guidelines and context.