diff --git a/docs-site/public/files/review_code_skill.md b/docs-site/public/files/review_code_skill.md new file mode 100644 index 00000000..f06760ac --- /dev/null +++ b/docs-site/public/files/review_code_skill.md @@ -0,0 +1,301 @@ +--- +name: review-code +description: Review code for quality, best practices, and project standards. Use when reviewing PRs, checking code quality, or validating implementation against project guidelines. +--- + +# Code Review Guidelines + +## General Principles + +- Write correct, best practice code adhering to DRY principles +- Prioritize readability over performance +- Implement all requested functionality completely +- Avoid TODOs, placeholders, or incomplete sections +- Include all required imports +- Acknowledge limitations when appropriate + +## Comments Policy + +- New code must not contain stupid comments that add no value +- Do not remove useful existing comments +- No historical comments like `# removed X from here` - if it's removed, it's gone +- Only add comments where logic isn't self-evident +- Don't add docstrings, comments, or type annotations to code you didn't change + +## Documentation: Current State Only + +Code, comments, and documentation should describe **how things work now**, not migration history: + +```python +# BAD: Historical context that doesn't help understanding +# We migrated from the old system where we used X, now we use Y +# Previously this was in module Z but we moved it here + +# GOOD: Just describe current behavior +# Validates user input before processing +``` + +If migration context is truly needed (e.g., for deprecation warnings), keep it minimal and actionable. + +## Formatting Rules + +**CRITICAL**: Never fix formatting issues by rewriting code manually. + +- Python: Run `uv run ruff check --fix` +- TypeScript/NextJS: Run `pnpm run lint:fix` + +## Avoid Over-Engineering + +- Only make changes directly requested or clearly necessary +- Don't add features, refactor code, or make "improvements" beyond what was asked +- A bug fix doesn't need surrounding code cleaned up +- A simple feature doesn't need extra configurability +- Don't add error handling for scenarios that can't happen +- Trust internal code and framework guarantees +- Only validate at system boundaries (user input, external APIs) +- Don't create helpers/utilities/abstractions for one-time operations +- Don't design for hypothetical future requirements +- Three similar lines of code is better than a premature abstraction + +## No Overly Defensive Coding + +Don't add defensive checks "just in case". Trust the types and explore them first: + +```python +# BAD: Defensive chains when types guarantee the data exists +value = thing.get("key", {}).get("nested", {}).get("value") if thing else None + +# GOOD: Trust the types, access directly +value = thing["key"]["nested"]["value"] + +# BAD: Checking for None when type says it's not optional +if user and user.email: + send_email(user.email) + +# GOOD: Type says User has email, just use it +send_email(user.email) +``` + +Before adding defensive code: +1. Check the types - is the field actually optional? +2. Trace where the data comes from - is it validated upstream? +3. If truly uncertain, add proper type narrowing, not silent fallbacks + +## No Backwards Compatibility (Unless Explicitly Requested) + +**We don't do backwards compatibility by default.** Keep implementations clean: + +- No "keeping this for backwards compatibility" code paths +- No deprecated parameter shims +- No re-exporting moved types/functions +- No renaming unused `_vars` to preserve signatures +- No `// removed` comments for deleted code +- If something is unused, delete it completely +- If an API changes, update all callers + +Only add backwards compatibility if the user explicitly asks for it (e.g., public API with external consumers). + +--- + +# Python Guidelines + +## Standards + +- Python 3.13 with modern type hints and generics +- Package management: `uv` +- Always use builtin typehints: `dict[str, int]` not `Dict` +- Prefer Pydantic models over bare dicts and dataclasses +- Prefer async functionality in asynchronous contexts +- Prefer functional and declarative patterns over objects + +## Type Safety + +- Type casting (`cast()`, `# type: ignore`) is an absolute last resort +- Avoid `Any` - find the correct type or use proper generics +- If you need a cast, first try to fix the underlying type issue + +## Exception Formatting + +When formatting exceptions, use `!r` (repr) not `!s` (str): + +```python +# Good +logger.error(f"Failed to process: {e!r}") + +# Bad +logger.error(f"Failed to process: {e}") +``` + +`!r` provides more debugging info (exception type, full details). + +## Import Organization + +- All imports must be at the top of the module +- Inline/local imports are only allowed as a **last resort** for circular imports +- If you have circular imports, first try to refactor modules to break the cycle + +## Testing with pytest + +```bash +uv run pytest +``` + +Principles: +1. Test one thing at a time, no huge test cases +2. No multiple asserts testing different things +3. Use `pytest-*` libs (pytest-mock), not `unittest` +4. Work on a single test at a time +5. Use pytest fixtures or factories +6. Minimize mocking unless absolutely necessary - refactor code for testability + +--- + +# TypeScript/NextJS Guidelines + +## Standards + +- Modern App router (Next.js 15.2+) +- Functional components with React hooks +- Package management: `pnpm` + +**Note:** A Next.js 16 documentation MCP server is available (`nextjs-docs`) which can be used to validate best practices and API usage against official docs. + +## useEffect Anti-Pattern + +**Avoid excessive `useEffect`** - prefer derived state/useMemo. + +Using useEffect for anything but synchronizing with third-party APIs (remote subscriptions, etc.) is wrong. ESPECIALLY for syncing local React state. + +Instead use: +- `useMemo` for derived values +- `key=` prop to reset component state +- Derived state computed during render +- Lifting state up + +Reference: https://react.dev/learn/you-might-not-need-an-effect + +## useMemo Guidelines + +Only use `useMemo` if caching saves meaningful compute. Never `useMemo` for simple one-or-two liners without function calls. + +## Server Actions + +- Implement in dedicated `actions.ts` files +- Prefer server actions over HTTP endpoints +- Abstract logic into lib files, call from server actions + +## Testing + +- Unit/Integration: `pnpm run test` (vitest + react-testing-library) +- E2E: `pnpm run cypress:run` +- Focus on behavior, not implementation details +- Avoid overtesting and very slow tests + +--- + +# Debugging Approach + +1. **Reason first**: Think about the problem and errors before implementing a solution +2. **Avoid workarounds**: When first approach doesn't work, don't add hacks +3. **Gather evidence**: Add logs, use debugging tools to understand what's happening +4. **Minimal reproduction**: Consider implementing a minimal example demonstrating the issue +5. **Test-driven debugging**: Write a test and iterate against that + +--- + +# Code Patterns + +## Functional Over Object-Oriented + +Prefer functional and declarative patterns over objects: +- Easier to test +- Easier to reason about +- Break long code blocks into nicely named functions + +## Data/Rendering Separation (React) + +Separate data preparation logic from TSX rendering: +- Instead of returning same TSX blocks with different data in multiple conditions +- Prepare data in conditions, render once at the end + +```typescript +// Good +const displayData = condition ? dataA : dataB; +return ; + +// Bad +if (condition) { + return ; +} else { + return ; +} +``` + +--- + +# Generated Files + +Never directly edit generated files: +- `@cohort/portal/src/lib/engine_api/generated` - regenerate with `pnpm run openapi-ts` +- `portal/src/lib/db/types.gen.ts` - database types + +--- + +# Tracing Review + +When reviewing code that involves tracing or should have tracing, consider: + +## Opportunities for Tracing + +- **New async workflows**: Should key functions have `@observe()` decorators? +- **Service boundaries**: Is context being propagated correctly? +- **Debugging value**: Would traces help debug this code in production? + +## Trace Size Concerns + +- **`capture_input=True` on large data**: If a function processes lists/dicts of user data (artifacts, tables), capturing full I/O can create 20MB+ traces +- **Manual `update_span(output=...)`**: Is large data being added to spans? +- **Generation spans are OK**: LLM input/output capture is expected and useful + +```python +# Review for: Is artifacts potentially large? +@observe(span_type="tool", capture_input=True) # Could be problematic +async def process_artifacts(artifacts: list[dict]): + ... +``` + +## Context Propagation + +- **Celery tasks**: Context should be serialized in task args (we have harness for this) +- **HTTP calls**: HTTPX auto-propagates (should work automatically) +- **Custom queues/workers**: Need manual `serialize_context()`/`deserialize_context()` + +## Short-Running Scripts + +- Does the script call `force_flush()` before exit? +- Without it, traces may not be exported + +See `use-langfuse/using-tracing.md` for full tracing guidelines. + +--- + +--- + +# Review Checklist + +- [ ] No manual formatting fixes (use linters) +- [ ] No unnecessary useEffect +- [ ] No useless comments, no historical "removed X" comments +- [ ] No TODOs or placeholders +- [ ] Exceptions use `{e!r}` not `{e}` +- [ ] No type casts or Any without good reason +- [ ] Imports at top of module (no inline imports unless circular) +- [ ] No over-engineering +- [ ] No backwards-compatibility hacks +- [ ] Tests focus on behavior, not implementation +- [ ] Generated files not manually edited +- [ ] Type hints use modern Python syntax +- [ ] Functional patterns preferred +- [ ] Tracing: No huge data in `capture_input`/`capture_output` for non-LLM spans +- [ ] Tracing: Context propagation across service boundaries +- [ ] Large files (>100KB): Check `.gitattributes` has LFS auto-tracking; verify with `git lfs ls-files` if unsure diff --git a/docs-site/public/images/claude-review-skill.png b/docs-site/public/images/claude-review-skill.png new file mode 100644 index 00000000..2a437c73 --- /dev/null +++ b/docs-site/public/images/claude-review-skill.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:879bdec24820ba9f94ab2c0a6a124f178d8d1007745d241093006844af6ba713 +size 110240 diff --git a/docs-site/src/app/blog/[slug]/page.tsx b/docs-site/src/app/blog/[slug]/page.tsx index 3e101d38..44084bf7 100644 --- a/docs-site/src/app/blog/[slug]/page.tsx +++ b/docs-site/src/app/blog/[slug]/page.tsx @@ -55,6 +55,13 @@ export default async function BlogPostPage({ params }: PageProps) { {post.authors.join(", ")} )} + {post.heroImage && ( + {post.title} + )} diff --git a/docs-site/src/app/globals.css b/docs-site/src/app/globals.css index f34b275c..fc7779b4 100644 --- a/docs-site/src/app/globals.css +++ b/docs-site/src/app/globals.css @@ -717,6 +717,14 @@ noscript + .installation-tabs .tab-content:first-child { margin-right: 0.75rem; } +.blog-post-hero { + width: 100%; + max-height: 420px; + object-fit: contain; + border-radius: 12px; + margin-bottom: 2rem; +} + .blog-listing-subtitle { font-size: 1rem; color: var(--muted); diff --git a/docs-site/src/blog/claude-review-skill.mdx b/docs-site/src/blog/claude-review-skill.mdx new file mode 100644 index 00000000..f8bd5749 --- /dev/null +++ b/docs-site/src/blog/claude-review-skill.mdx @@ -0,0 +1,168 @@ +--- +title: "What to Put in a Claude Code Skill for Reviewing Your Team's Code" +subtitle: "Every tutorial shows you how to set up AI code review. None of them show you what to actually put in the review prompt. We're publishing ours." +date: 2026-02-25 +description: "We publish the 300-line Claude Code review skill we use in production, covering five key rules: trusting your types, deferring to linters, avoiding over-engineering, skipping backwards compatibility, and encoding domain knowledge." +authors: + - "Christoph Sträter" +tags: + - "Claude Code" + - "AI Code Review" + - "Developer Tools" + - "Skills" +heroImage: /docs/images/claude-review-skill.png +--- + +# What to Put in a Claude Code Skill for Reviewing Your Team's Code + +*Every tutorial shows you how to set up AI code review. None of them show you what to actually put in the review prompt. We're publishing ours [HERE](/docs/files/review_code_skill.md).* + +--- + +Code review is one of those things that scales terribly. Your team grows from four to twelve, PRs pile up, and suddenly your senior engineers are spending half their day leaving the same comments: "run the linter instead of reformatting manually," or "don't add a helper function for something we do once." The knowledge exists, it's just trapped in people's heads, repeated one PR at a time. + +[Claude Code](https://docs.anthropic.com/en/docs/claude-code/overview) and other automated review tools change this equation. Anthropic's coding agent can review pull requests: you `@claude` mention it in a GitHub PR, or wire it into CI via [`claude-code-action`](https://github.com/anthropics/claude-code-action). It brings genuine understanding of code, not just pattern matching. The [setup guides](https://docs.anthropic.com/en/docs/claude-code/github-actions) are excellent and you can be up and running in ten minutes. + +But here's what we learned the hard way: **the setup isn't the hard part. The prompt is.** + +Anthropic's own [prompting best practices](https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/claude-4-best-practices) acknowledge that Claude has known tendencies: over-engineering solutions, creating unnecessary abstractions, adding flexibility that wasn't requested. They recommend adding explicit guidance to counter this. Good advice. But what does that guidance actually look like for a real team? That's the part nobody publishes. + +And that's exactly the problem. A few weeks in, we kept seeing the same failure modes: unnecessary safety checks on typed fields, formatting fixes that diverged from our linter, backwards-compatibility shims nobody asked for. Every comment technically reasonable. None of them matching how we actually write code. + +So we built a [`review-code` skill](https://github.com/futuresearch/delphos/pull/3762), a `SKILL.md` that teaches Claude our actual standards. In Claude Code, **Skills** are markdown playbooks that the agent reads before acting. Think of it as the onboarding doc you wish every reviewer actually read, except this one reads it every time, without skimming. + +Here are five things we encoded that made the biggest difference. + +--- + +## 1. No Defensive Coding: Trust Your Types + +Left to its own defaults, Claude adds defensive checks everywhere. Got a dictionary with known keys? Claude wraps it in `.get()` chains with empty-dict fallbacks. Got a user object with a required email field? Claude adds `if user and user.email:` before using it. + +This looks careful. It's actually harmful. If your types say a field exists, a defensive check hides real bugs under silent `None` values. When something does break, you get a confusing failure three layers downstream instead of an immediate error at the source. + +Our rule in the skill: + +```python +# Bad: checking for None when the type says it's not optional +if user and user.email: + send_email(user.email) + +# Good: the type says User has email, just use it +send_email(user.email) +``` + +Before adding defensive code, the skill tells Claude to check whether the field is actually optional in the type definition, trace where the data comes from to see if it's validated upstream, and only then, if truly uncertain, add proper type narrowing instead of a silent fallback. + +The result: Claude now flags unnecessary defensive code in PRs instead of adding more of it. + +--- + +## 2. Linters, Not Rewrites + +This one is marked critical in our skill for a reason: + +``` +CRITICAL: Never fix formatting issues by rewriting code manually. + +- Python: Run `uv run ruff check --fix` +- TypeScript/NextJS: Run `pnpm run lint:fix` +``` + +Without this rule, Claude will reformat code while reviewing it: reordering imports, adjusting spacing, wrapping long lines. The result is a review comment that's technically about a real issue but buried in a diff full of whitespace changes. + +It gets worse. Manual reformatting often diverges from what your linter would actually produce. You merge the suggestion, run the linter anyway, and now you have a follow-up commit just to fix the formatting that was "fixed." + +The rule is simple: if there's a formatting issue, say "run the linter." Don't rewrite the code. Review comments stay focused on logic and semantics, not aesthetics that a tool handles better. + +--- + +## 3. No Over-Engineering: Do What Was Asked, Nothing More + +Anthropic themselves call this out. Their [prompting best practices](https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/claude-4-best-practices) note that Claude "has a tendency to overengineer by creating extra files, adding unnecessary abstractions, or building in flexibility that wasn't requested" and suggest adding guidance to keep solutions minimal. They even provide a sample prompt. + +We found the sample prompt wasn't enough. Claude would still ask to fix a bug and come back with the fix _plus_ a new helper function, an extra config option, error handling for three scenarios that can't happen, and the surrounding code "cleaned up." Each change defensible on its own. Together they triple the review surface for a one-line fix. + +So our skill gets specific: + +``` +- Only make changes directly requested or clearly necessary +- A bug fix doesn't need surrounding code cleaned up +- A simple feature doesn't need extra configurability +- Don't add error handling for scenarios that can't happen +- Trust internal code and framework guarantees +- Only validate at system boundaries (user input, external APIs) +- Don't create helpers/utilities/abstractions for one-time operations +- Three similar lines of code is better than a premature abstraction +``` + +That last line does a lot of work. LLMs have a strong pull toward DRY code. If they see three blocks that look similar, they'll extract a function. But "similar" doesn't mean "same," and the abstraction often obscures what each callsite actually needs. We'd rather have three clear, slightly repetitive blocks than a `_process_thing(mode="variant_a")` helper that nobody can read without jumping to the definition. + +The "only validate at system boundaries" rule matters too. Without it, Claude adds input validation deep inside internal functions, checking that arguments aren't `None`, that lists aren't empty, that strings match expected formats, even when the caller already guarantees all of this. You end up with validation logic scattered across every layer instead of concentrated where data enters the system. + +--- + +## 4. No Backwards Compatibility (Unless You're Shipping a Public API) + +By default, Claude is conservative. It keeps old code around, adds shims, re-exports moved types "just in case something depends on it." For a library with external consumers, that's the right instinct. For a team that owns every callsite, it's just clutter. + +Our skill spells it out: + +``` +We don't do backwards compatibility by default. Keep implementations clean: + +- No "keeping this for backwards compatibility" code paths +- No deprecated parameter shims +- No re-exporting moved types/functions +- No renaming unused _vars to preserve signatures +- No `// removed` comments for deleted code +- If something is unused, delete it completely +- If an API changes, update all callers +``` + +The list of specific patterns matters. Without it, Claude will add `# kept for backwards compatibility` comments, rename removed parameters to `_old_param`, or keep parallel code paths "for safety." With the explicit list, it flags those patterns instead. + +The carve-out is important too: "Only add backwards compatibility if the user explicitly asks for it (e.g., public API with external consumers)." Claude can tell the difference between internal refactoring and a public SDK, but it needs to know which situation it's in. + +--- + +## 5. Encode Your Domain Knowledge (Ours: Tracing) + +Generic review prompts can't know about your observability stack. Our skill has a whole section on [Langfuse](https://langfuse.com/) tracing that's caught real production problems. + +The biggest one: trace size. A decorator like `@observe(capture_input=True)` on a function that processes user data can produce 20MB+ traces. Langfuse doesn't reject them, but they make the UI unusable and cause export timeouts. The skill flags any `@observe` with `capture_input=True` on functions that handle large list or dict data. + +It also checks for context propagation across service boundaries: that Celery tasks serialize tracing context in their args, and that short-running scripts call `force_flush()` before exit so traces actually get exported. + +None of this would come from a generic review prompt. All of it has caught real bugs. Whatever your team's equivalent is (your ORM conventions, your feature flag patterns, your deployment constraints), that's the stuff worth encoding. + +--- + +## The Other 200 Lines + +The rest of the skill covers things like: comments policy (no historical comments like `# removed X from here`, no docstrings on code you didn't change), Python specifics (modern 3.13 type hints, Pydantic over bare dicts, `cast()` only as an absolute last resort), React patterns (no `useEffect` for syncing local state, separate data preparation from JSX rendering), and testing guidelines (test behavior not implementation, minimize mocking, one assertion per test). + +None of these are controversial on their own. The value is having them written down in a place Claude actually reads, so you don't have to repeat yourself across PRs. + +The skill ends with a checklist. We found this matters more than you'd expect. Without it, Claude tends to write long-form commentary on the one or two things that catch its attention and ignore everything else. A checklist forces it to at least consider each item: + +``` +- [ ] No manual formatting fixes (use linters) +- [ ] No unnecessary useEffect +- [ ] No useless comments, no historical "removed X" comments +... +``` + +--- + +## What We Didn't Do: Auto-Review + +One more thing worth sharing. We disabled automatic review on every PR and switched to interactive-only `@claude` mentions. + +Auto-review on routine PRs produced too much noise. A one-line config change doesn't need a 12-point review; it needs a human to glance at it for 30 seconds. When every PR gets reviewed automatically, developers start ignoring the comments, which defeats the purpose entirely. + +We now trigger the skill manually on PRs that benefit from it: complex diffs, unfamiliar parts of the codebase, security-adjacent changes. The signal-to-noise ratio is much higher. + +--- + +The full skill is published [HERE](/docs/files/review_code_skill.md). Feel free to copy it and adapt it to your team's needs. diff --git a/docs-site/src/utils/blog.ts b/docs-site/src/utils/blog.ts index c986d691..0c40bb9f 100644 --- a/docs-site/src/utils/blog.ts +++ b/docs-site/src/utils/blog.ts @@ -12,6 +12,7 @@ export interface BlogPostMeta { date: Date; authors: string[]; tags: string[]; + heroImage?: string; } export interface BlogPost extends BlogPostMeta { @@ -26,6 +27,7 @@ function extractMeta(data: Record, slug: string): Omit