Version: 1.0 Last Updated: February 2026 Audience: AI/ML Engineering Teams
- Executive Summary
- The Problem We're Solving
- Tiered Review System
- Review Process by Code Category
- Handling Experimental Branches
- AI-Assisted Code Review
- Vibe Coding Guidelines
- Review SLAs and Metrics
- Tools and Automation
- Implementation Roadmap
This document establishes code review guidelines specifically designed for AI companies with:
- Long-running training experiments (days/weeks)
- High experimentation velocity requirements
- AI-assisted development ("vibe coding") practices
- The need to balance speed with code quality
Key Principles:
- Not all code needs the same level of review - Tiered review system
- Experimentation should not be blocked - Ship/Show/Ask model
- AI-generated code requires different review focus - C.L.E.A.R. framework
- Automate the automatable - Reserve human attention for high-value review
- Review process is a bottleneck (~41% of code is now AI-assisted)
- Team members run experiments on non-reviewed branches
- PRs are created after experiments complete, making review difficult
- Long-running training means code diverges significantly before review
- AI coding tools have reached mass adoption (84% developer adoption)
- Code generation is faster than ever; review capacity hasn't scaled
- Monthly GitHub pushes exceed 82M, merged PRs hit 43M
- Review throughput, not implementation speed, determines delivery velocity
┌─────────────────────────────────────────────────────────────┐
│ TIERED REVIEW SYSTEM │
├─────────────────────────────────────────────────────────────┤
│ SHIP (Auto-merge) │ SHOW (Async) │ ASK (Full Review) │
│ Config, docs, │ Standard │ Architecture, │
│ experiments │ features │ security, core │
└─────────────────────────────────────────────────────────────┘
Not all changes carry equal risk. Apply review effort proportionally.
| Tier | Review Type | SLA | When to Use |
|---|---|---|---|
| SHIP | Auto-merge | Immediate | Config changes, documentation, experiment configs |
| SHOW | Async notification | 24h acknowledgment | Standard features, non-critical bug fixes |
| ASK | Full synchronous review | 4h response, 24h completion | Architecture changes, security, production pipelines |
Criteria for auto-merge:
- All automated tests pass
- Linting and type checks pass
- Security scans clean
- Change is in allowed category (see below)
Allowed Categories:
✓ Documentation updates (README, comments, docs/)
✓ Experiment configuration files (configs/, experiments/)
✓ Development environment configs (.vscode/, .devcontainer/)
✓ Dependency version bumps (with lockfile)
✓ Generated code (if generation is verified)
✓ Experiment notebook outputs (.ipynb outputs)
Implementation:
# .github/auto-merge.yml
auto_merge:
paths:
- "docs/**"
- "experiments/configs/**"
- "*.md"
- ".vscode/**"
conditions:
- ci_passed
- security_scan_clean
- no_production_code_changesProcess:
- Developer merges after CI passes
- Team is notified asynchronously
- Reviewers have 24h to raise concerns
- If concerns raised, address in follow-up PR
Criteria:
- Standard feature development
- Bug fixes (non-security)
- Refactoring within existing patterns
- Test additions/modifications
Notification Template:
## [SHOW] Feature: {title}
**What:** Brief description
**Why:** Business/technical justification
**Risk:** Low/Medium
**Rollback:** How to revert if issues found
Changes: {link to diff}Must be used for:
- Production ML pipeline changes
- Training infrastructure modifications
- Security-sensitive code
- API contracts and interfaces
- Database schema changes
- Authentication/authorization
- Cost-impacting changes (compute, storage)
- New external dependencies
Review Requirements:
- Minimum 1 senior engineer approval
- Security review for sensitive changes
- Architecture review for structural changes
- All comments resolved before merge
| Category | Review Tier | Special Considerations |
|---|---|---|
| Experiment Scripts | SHIP/SHOW | Reproducibility check, resource usage |
| Training Configs | SHIP | Validate against schema |
| Model Architecture | ASK | Performance implications, memory usage |
| Data Pipeline | ASK | Data integrity, privacy compliance |
| Evaluation Metrics | SHOW | Correctness verification |
| Inference Code | ASK | Latency, error handling |
| Production Pipeline | ASK | Full review + staging test |
## Experiment Review Checklist
### Reproducibility
- [ ] Random seeds are set and documented
- [ ] Environment/dependencies are pinned
- [ ] Data version is tracked
- [ ] Hyperparameters are logged
### Resource Management
- [ ] GPU memory usage is reasonable
- [ ] Checkpointing is implemented for long runs
- [ ] Resource cleanup on failure
### Logging & Tracking
- [ ] Metrics are logged to experiment tracker
- [ ] Model artifacts are versioned
- [ ] Training curves are captured
### Code Quality (Lighter Standard)
- [ ] Code runs without errors
- [ ] Key logic is understandable
- [ ] No hardcoded credentials## Production Review Checklist
### Functionality
- [ ] Meets requirements
- [ ] Edge cases handled
- [ ] Error handling is comprehensive
### Performance
- [ ] No obvious performance issues
- [ ] Memory usage is acceptable
- [ ] Scales appropriately
### Security
- [ ] Input validation present
- [ ] No credential exposure
- [ ] OWASP top 10 considered
### Maintainability
- [ ] Code is readable
- [ ] Tests are adequate
- [ ] Documentation updated
### ML-Specific
- [ ] Model versioning in place
- [ ] Fallback behavior defined
- [ ] Monitoring hooks presentTraining experiments often take days or weeks. Blocking experimentation on review creates:
- Developers working on unreviewed branches for extended periods
- Large, difficult-to-review PRs after experiment completion
- Code divergence and merge conflicts
┌──────────────────────────────────────────────────────────────┐
│ EXPERIMENT BRANCH WORKFLOW │
├──────────────────────────────────────────────────────────────┤
│ │
│ 1. CREATE: exp/{username}/{experiment-name} │
│ └── Register in experiment tracker │
│ │
│ 2. DEVELOP: Work freely, commit often │
│ └── Auto-sync with main daily (merge main → exp) │
│ │
│ 3. CHECKPOINT: Weekly mini-reviews (optional) │
│ └── Lightweight async review of key changes │
│ │
│ 4. COMPLETE: When experiment concludes │
│ └── Create PR with experiment results │
│ └── Review focuses on: what to keep vs discard │
│ │
│ 5. PRODUCTIONIZE: If successful │
│ └── Full ASK review for production-bound code │
│ └── Refactor experiment code to production standards │
│ │
└──────────────────────────────────────────────────────────────┘
# Branch protection for experiment branches
experiment_branches:
pattern: "exp/**"
rules:
- allow_force_push: true
- require_ci: false # Allow broken experiments
- auto_delete_after: 90_days
- require_experiment_registration: true# experiments/registry/{experiment-id}.yaml
experiment:
id: exp-2026-001
name: "Improved attention mechanism"
owner: "@username"
branch: "exp/username/attention-v2"
hypothesis: "New attention reduces training time by 20%"
start_date: 2026-02-01
expected_duration: "2 weeks"
resources:
gpus: 8
estimated_cost: "$5000"
success_criteria:
- "Training time < 80% of baseline"
- "Accuracy >= baseline"
status: in_progress## Experiment Results: {experiment-name}
### Hypothesis
{What we were testing}
### Results
- **Outcome:** Success / Partial / Failed
- **Key Metrics:** {metrics vs baseline}
- **Artifacts:** {links to models, logs}
### Code to Keep
- {file1}: {reason}
- {file2}: {reason}
### Code to Discard
- {file3}: {reason - experimental only}
### Production Path
- [ ] Refactor needed: {yes/no}
- [ ] Estimated effort: {t-shirt size}
- [ ] Dependencies: {any blockers}
### Learnings
{What we learned, even if experiment failed}"The big bottleneck in shipping software is the code review process, especially as you grow your team." — Harjot Gill, CodeRabbit founder
AI tools have shifted the bottleneck:
- Before: Writing code was slow, review was manageable
- Now: Writing code is fast, review capacity is the constraint
┌─────────────────────────────────────────────────────────────┐
│ AI + HUMAN REVIEW WORKFLOW │
├─────────────────────────────────────────────────────────────┤
│ │
│ PR Created │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ AI Review │ ← Automated (immediate) │
│ │ - Style │ │
│ │ - Bugs │ │
│ │ - Security │ │
│ │ - Tests │ │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Human Review │ ← Focused on high-value areas │
│ │ - Architecture │ │
│ │ - Design │ │
│ │ - Business │ │
│ │ - Intent │ │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ Merge │
│ │
└─────────────────────────────────────────────────────────────┘
| Category | AI Capability | Tools |
|---|---|---|
| Style/Formatting | Excellent | Prettier, ESLint, Black |
| Type Errors | Excellent | TypeScript, mypy, Pyright |
| Common Bugs | Good | CodeRabbit, Qodo, SonarQube |
| Security Patterns | Good | Snyk, Semgrep, GitGuardian |
| Test Coverage | Good | Codecov, Coverage.py |
| Documentation | Moderate | AI doc generators |
| Category | Why AI Struggles |
|---|---|
| Architectural Fit | Requires system-wide context |
| Business Logic | Needs domain knowledge |
| Performance Trade-offs | Context-dependent decisions |
| User Experience | Subjective, requires empathy |
| Security (Advanced) | Novel attack vectors |
| Code Intent | "Does this solve the right problem?" |
| Tool | Best For | Integration |
|---|---|---|
| CodeRabbit | Comprehensive AI review | GitHub, GitLab |
| Qodo (formerly CodiumAI) | Test generation, review | VS Code, GitHub |
| GitHub Copilot | In-editor suggestions | GitHub native |
| Graphite | Stacked PRs, review flow | GitHub |
| Aikido | Security-focused review | CI/CD |
"Vibe coding" refers to a coding approach using LLMs where programmers generate working code via natural language descriptions rather than writing it manually. A key part is accepting AI-generated code without fully understanding it. — Andrej Karpathy, 2025
When developers don't write code line-by-line:
- Comprehension gap: Reviewer didn't participate in prompt engineering
- Pattern unfamiliarity: AI generates different patterns than humans
- Bulk generation: Larger volumes create review fatigue
- False confidence: Well-formatted code creates false sense of security
C - Context
├── Review the original prompt used
├── Understand the requirements
└── Check generation history/iterations
L - Logic
├── Verify the logic is correct
├── Check edge cases
└── Validate error handling
E - Environment
├── Does it fit existing architecture?
├── Consistent with codebase patterns?
└── Dependencies appropriate?
A - Assurance
├── Test coverage adequate?
├── Security implications?
└── Performance acceptable?
R - Readability
├── Can team maintain this?
├── Documentation sufficient?
└── Future developers can understand?
## Vibe Coding Review Checklist
### Context (Before Looking at Code)
- [ ] What prompt generated this code?
- [ ] What were the requirements?
- [ ] Were there multiple iterations?
### Critical Security Review
- [ ] Input validation present (AI often forgets)
- [ ] Authorization checks in place
- [ ] No SQL injection vulnerabilities
- [ ] No hardcoded secrets
- [ ] Proper error messages (no info leakage)
### AI-Specific Red Flags
- [ ] Overly complex solutions for simple problems
- [ ] Unnecessary abstractions
- [ ] Copy-paste patterns (subtle bugs)
- [ ] Placeholder code left in
- [ ] TODO comments not addressed
- [ ] Hallucinated APIs or libraries
### Integration Verification
- [ ] Works with existing code
- [ ] Follows team conventions
- [ ] Tests actually test the right things
- [ ] Documentation matches behaviorRecommendation: Require AI assistance disclosure in PRs
## AI Assistance Disclosure
- [ ] No AI assistance used
- [ ] AI assisted (Copilot/Claude/Cursor)
- Tools used: ___________
- Extent: Light suggestions / Significant generation / Majority AI-generated
- All generated code reviewed and understood: Yes / NoDefinition: Technical debt specifically arising from accepting AI-generated code without full understanding.
Symptoms:
- Code that "works" but no one understands why
- Inconsistent patterns across codebase
- Over-engineered solutions
- Missing edge case handling
- Subtle bugs that surface later
Prevention:
- Test after every generation cycle - Don't trust "looks correct"
- Require understanding - If you can't explain it, refactor it
- Regular refactoring sprints - Consolidate AI-generated patterns
- Documentation requirements - AI should document, human should verify
| Tier | First Response | Completion | Escalation |
|---|---|---|---|
| SHIP | Automated | Immediate | N/A |
| SHOW | 24 hours | 48 hours | After 48h → Slack ping |
| ASK | 4 hours | 24 hours | After 24h → Manager notified |
review_metrics:
speed:
- time_to_first_review: "Hours from PR open to first comment"
- time_to_merge: "Hours from PR open to merge"
- review_cycles: "Number of review rounds"
quality:
- post_merge_issues: "Bugs found after merge"
- revert_rate: "PRs reverted within 7 days"
- review_coverage: "% of lines with review comments"
load:
- reviews_per_engineer_per_week: "Review workload distribution"
- pr_size_distribution: "Lines changed histogram"
- pending_review_age: "Hours PRs wait for review"| Metric | Target | Warning | Critical |
|---|---|---|---|
| Time to first review | < 4h | 4-8h | > 8h |
| Time to merge (SHOW) | < 24h | 24-48h | > 48h |
| Time to merge (ASK) | < 48h | 48-72h | > 72h |
| Review cycles | < 3 | 3-5 | > 5 |
| Post-merge issues | < 5% | 5-10% | > 10% |
| PR size (lines) | < 400 | 400-800 | > 800 |
# Pseudo-code for review assignment
def assign_reviewer(pr):
candidates = get_available_reviewers(pr.affected_areas)
# Factor in current load
for reviewer in candidates:
reviewer.score = (
reviewer.expertise_match * 0.4 +
(1 / reviewer.pending_reviews) * 0.3 +
reviewer.recent_response_time * 0.2 +
reviewer.timezone_overlap * 0.1
)
return candidates.sort_by_score().first()┌─────────────────────────────────────────────────────────────┐
│ CODE REVIEW TOOL STACK │
├─────────────────────────────────────────────────────────────┤
│ │
│ Layer 1: Automated Checks (CI/CD) │
│ ├── Linting: ESLint, Pylint, Ruff │
│ ├── Typing: TypeScript, mypy, Pyright │
│ ├── Testing: pytest, Jest (with coverage) │
│ ├── Security: Snyk, Semgrep, GitGuardian │
│ └── Style: Prettier, Black, isort │
│ │
│ Layer 2: AI-Assisted Review │
│ ├── CodeRabbit: Comprehensive AI review │
│ ├── Qodo: Test generation + review │
│ └── SonarQube: Code quality analysis │
│ │
│ Layer 3: Workflow Optimization │
│ ├── Graphite: Stacked PRs, review queue │
│ ├── LinearB: Engineering metrics │
│ └── Sleuth: DORA metrics, deployment tracking │
│ │
│ Layer 4: Human Review │
│ ├── GitHub/GitLab: Native review interface │
│ ├── Review Board: Complex review workflows │
│ └── Slack/Teams: Review notifications │
│ │
└─────────────────────────────────────────────────────────────┘
# .github/workflows/review-automation.yml
name: Automated Review Checks
on: [pull_request]
jobs:
tier-classification:
runs-on: ubuntu-latest
outputs:
tier: ${{ steps.classify.outputs.tier }}
steps:
- uses: actions/checkout@v4
- id: classify
run: |
# Classify PR tier based on files changed
if [[ $(git diff --name-only origin/main | grep -E '^(src/|lib/)') ]]; then
echo "tier=ASK" >> $GITHUB_OUTPUT
elif [[ $(git diff --name-only origin/main | grep -E '^(docs/|experiments/)') ]]; then
echo "tier=SHIP" >> $GITHUB_OUTPUT
else
echo "tier=SHOW" >> $GITHUB_OUTPUT
fi
automated-checks:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Lint
run: npm run lint
- name: Type Check
run: npm run typecheck
- name: Unit Tests
run: npm test -- --coverage
- name: Security Scan
uses: snyk/actions/node@master
env:
SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
ai-review:
runs-on: ubuntu-latest
steps:
- name: CodeRabbit Review
uses: coderabbit-ai/coderabbit-action@v1
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
auto-merge:
needs: [tier-classification, automated-checks]
if: needs.tier-classification.outputs.tier == 'SHIP'
runs-on: ubuntu-latest
steps:
- name: Auto-merge SHIP tier
uses: pascalgn/automerge-action@v0.15.6
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
MERGE_METHOD: squash# Install Graphite CLI
npm install -g @withgraphite/graphite-cli
# Create a stack of dependent PRs
gt branch create feature/step-1
# ... make changes ...
gt commit -m "Step 1: Add data loading"
gt submit # Creates PR for step 1
gt branch create feature/step-2
# ... make changes ...
gt commit -m "Step 2: Add preprocessing"
gt submit # Creates PR for step 2, stacked on step 1
# When step 1 is merged, step 2 automatically rebases## Week 1
- [ ] Establish tier classification criteria
- [ ] Configure branch protection rules
- [ ] Set up experiment branch naming convention
- [ ] Create PR templates for each tier
## Week 2
- [ ] Implement automated CI checks
- [ ] Configure auto-merge for SHIP tier
- [ ] Set up review SLA tracking
- [ ] Create review assignment automation## Week 3
- [ ] Integrate AI review tool (CodeRabbit/Qodo)
- [ ] Configure AI review scope and rules
- [ ] Train team on AI review output interpretation
- [ ] Establish AI disclosure policy
## Week 4
- [ ] Implement C.L.E.A.R. framework checklists
- [ ] Create vibe coding review guidelines
- [ ] Set up AI-generated code debt tracking
- [ ] Configure security-focused AI scanning## Week 5
- [ ] Implement experiment branch protocol
- [ ] Set up experiment registration system
- [ ] Create post-experiment PR template
- [ ] Configure stacked PR workflow
## Week 6
- [ ] Deploy metrics dashboard
- [ ] Establish review load balancing
- [ ] Create escalation procedures
- [ ] Document and communicate changes## Monthly Review
- [ ] Analyze review metrics
- [ ] Gather team feedback
- [ ] Adjust tier thresholds
- [ ] Update automation rules
- [ ] Refine AI review configuration┌─────────────────────────────────────────────────────────────┐
│ QUICK REFERENCE │
├─────────────────────────────────────────────────────────────┤
│ │
│ WHICH TIER? │
│ ─────────── │
│ SHIP: Docs, configs, experiments → Auto-merge │
│ SHOW: Standard features → Async review, 24h │
│ ASK: Production, security, arch → Full review, 4h │
│ │
│ PR SIZE LIMITS │
│ ────────────── │
│ Ideal: < 400 lines │
│ Maximum: 800 lines (split if larger) │
│ Review pace: 200-400 lines/hour │
│ │
│ AI-GENERATED CODE │
│ ───────────────── │
│ 1. Disclose AI usage │
│ 2. Apply C.L.E.A.R. framework │
│ 3. Extra scrutiny on security │
│ 4. Verify you understand it │
│ │
│ EXPERIMENT BRANCHES │
│ ─────────────────── │
│ Branch: exp/{username}/{name} │
│ Register in experiment tracker │
│ Sync with main weekly │
│ Full review only for production code │
│ │
│ SLAs │
│ ──── │
│ SHIP: Immediate (automated) │
│ SHOW: First response 24h, merge 48h │
│ ASK: First response 4h, merge 24h │
│ │
└─────────────────────────────────────────────────────────────┘
- Softr - 8 Vibe Coding Best Practices
- CodeRabbit - Code Review Best Practices for Vibe Coding
- Google Cloud - Vibe Coding Explained
- Qodo - Code Review Best Practices 2026
- Graphite - AI Code Review Implementation
- The New Stack - AI That Codes Faster Than Humans Can Review
- Async Squad - Code Review Bottleneck in AI Era
- Better Software - From Bottleneck to Superpower
- Ship, Show, Ask Pattern
- SE4ML - Code Review for Machine Learning
- Google ML - Managing ML Experiments
- Addy Osmani - LLM Coding Workflow 2026
Document Owner: Engineering Team Lead Review Cycle: Quarterly Next Review: May 2026