Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
240 changes: 240 additions & 0 deletions docs/cloudflare-livegrep-search-rfc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,240 @@
# RFC: DiffKit Repo Search MVP on Cloudflare + Livegrep

- Status: Draft
- Date: April 23, 2026
- Owner: DiffKit

## 1. Summary

This RFC proposes a smaller and cheaper search architecture for DiffKit:

- Cloudflare as the control plane and API surface.
- Livegrep as the code search engine.
- Limited initial repository scope (not global internet-scale).

The design optimizes for fast delivery, low operational complexity, and a clean upgrade path.

## 2. Goals

- Ship a working multi-repo code search MVP quickly.
- Keep monthly cost predictable and low.
- Integrate cleanly with existing Cloudflare-backed app infra.
- Handle not-yet-indexed repositories gracefully.

## 3. Non-goals (MVP)

- Indexing all public GitHub repositories.
- Building a custom search engine in v1.
- Full diff-aware semantic retrieval in the first iteration.

## 4. Proposed Architecture

## Control plane (Cloudflare)

- Workers: public search API and repo onboarding API.
- D1: metadata state for repositories, jobs, and index status.
- Queues: async job pipeline (`repo_sync`, `index_build`).
- Cron Triggers: periodic sync scheduling.
- R2: store index build manifests, logs, and backups.
- Optional: Cloudflare Access/JWT for internal admin endpoints.

## Data plane (Livegrep)

Livegrep requires persistent CPU and disk-heavy indexing/search. For MVP, run this as a small dedicated service outside Workers runtime:

- 1 index/search node (or 2 for HA later).
- Local SSD for bare clones + active index.
- Internal endpoint consumed by Worker API layer.

Notes:

- Keep Cloudflare as the product-facing layer.
- Keep Livegrep private behind network policy and only callable from Worker/API gateway.

Comment on lines +41 to +53
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Specify authentication mechanism between Worker and Livegrep endpoint.

The RFC mentions keeping Livegrep private with network policy but doesn't specify how Workers will authenticate to the Livegrep endpoint. Consider documenting the authentication approach (e.g., shared secrets, mTLS, API keys).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/cloudflare-livegrep-search-rfc.md` around lines 41 - 53, The RFC omits
how Workers authenticate to the Livegrep endpoint; update the "Data plane
(Livegrep)" section to specify a concrete authentication mechanism (e.g., mutual
TLS with service certificates, signed short-lived JWTs from the Worker API
layer, or scoped API keys) and detail how it is provisioned, rotated, and
enforced by the Livegrep service and network policy; reference the Worker ->
"Internal endpoint consumed by Worker API layer" flow and document the expected
headers/certificates, validation steps on the Livegrep side, and any necessary
configuration for the Worker/API gateway to obtain or present credentials.

## 5. Repository Scope Strategy

Start small:

- Tier A: DiffKit org repositories.
- Tier B: user-added repositories.
- Tier C: curated public repositories (manual allowlist).

Explicitly do not crawl all GitHub in MVP.

Comment on lines +54 to +63
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify relationship between repository tiers (A/B/C) and sync tiers (hot/warm/cold).

Section 5 defines tiers A, B, C based on repository source, while Section 6 (line 103) uses hot/warm/cold based on sync cadence. The relationship between these two classification systems is unclear. Consider clarifying how a repo's tier (A/B/C) maps to its sync cadence (hot/warm/cold).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/cloudflare-livegrep-search-rfc.md` around lines 54 - 63, Clarify how
repository source tiers (Tier A, Tier B, Tier C) map to sync cadence tiers (hot,
warm, cold) by updating Section 5 and the sync-cadence text in Section 6: state
a default mapping (e.g., Tier A -> hot, Tier B -> warm, Tier C -> cold), note
any exceptions (manual overrides, org policies, high-activity repos) and
describe rules that promote/demote repos between hot/warm/cold (e.g., activity
thresholds, manual flags), and add one short example per mapping to make the
behavior explicit.

## 6. Indexing Model

- Clone mode: `--mirror` bare repos.
- Branch policy: default branch only.
- Sync cadence:
- hot repos: every 15 minutes
- warm repos: every 1-3 hours
- cold repos: daily
- Index updates:
- batch rebuild every hour, or
- event-triggered rebuild after N repo updates
- Publish model: atomic index swap only when build succeeds.

Comment on lines +64 to +76
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Detail the atomic index swap mechanism with Livegrep.

Line 75 mentions "atomic index swap" but doesn't specify how this will be implemented with Livegrep. Consider documenting:

  • Whether Livegrep natively supports atomic index switching
  • The specific mechanism (symlink swap, dual-index approach, etc.)
  • Rollback strategy if the new index has issues
  • How concurrent queries are handled during the swap
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/cloudflare-livegrep-search-rfc.md` around lines 64 - 76, Update the
"Indexing Model" section to explicitly describe the atomic index swap
implementation for Livegrep: state whether Livegrep supports native atomic
switching or not, and if not, document using a dual-index directory approach
(build new index in a temp path and perform an atomic rename/symlink swap to the
live index), include a clear rollback strategy (keep previous index staged and
revert rename/symlink on failure or health-check), and explain concurrent query
handling during swap (use atomic filesystem rename or symlink update so readers
continue hitting the old index until the instant swap, add health-checks and
graceful failover to the previous index if the new one fails). Ensure these
details are placed under "Indexing Model" and reference "atomic index swap",
"Livegrep", and the dual-index/symlink rename approach so reviewers can find and
verify the change.

## 7. Query and Status Flow

1. Client calls Worker search endpoint.
2. Worker checks D1 for repo status and routing metadata.
3. Worker queries Livegrep backend.
4. Worker normalizes and returns results + status.

If repo is not indexed:

- Return `NOT_INDEXED` status in response.
- Enqueue high-priority bootstrap job.
- Return optional ETA bucket (`<10m`, `10-30m`, `>30m`).

This avoids empty-result ambiguity and improves UX.

Comment on lines +77 to +91
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Define "routing metadata" referenced in query flow.

Line 80 mentions checking D1 for "routing metadata" but this field is not included in the D1 schema (Section 8). Consider either adding these fields to the schema or clarifying what routing metadata means in this context.

🧰 Tools
🪛 LanguageTool

[style] ~82-~82: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... 3. Worker queries Livegrep backend. 4. Worker normalizes and returns results + status...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/cloudflare-livegrep-search-rfc.md` around lines 77 - 91, Clarify what
"routing metadata" means in the query flow by either adding explicit fields to
the D1 schema in Section 8 (e.g., repo_routing_host, region, livegrep_backend_id
or routing_hint) or by updating the text in "7. Query and Status Flow" to
reference the existing D1 fields that convey routing (e.g., indexed status,
backend id, region) and remove the ambiguous term; update the Worker step
"Worker checks D1 for repo status and routing metadata" to explicitly list the
exact D1 fields the Worker reads (or add those new fields to the D1 schema) so
the schema and flow remain consistent.

## 8. D1 Schema (MVP)

Recommended tables:

- `search_repo_registry`
- `id`
- `provider` (`github`)
- `owner`
- `name`
- `default_branch`
- `is_enabled`
- `tier` (`hot|warm|cold`)
- `last_seen_head_sha`
- `last_indexed_head_sha`
- `last_synced_at`
- `last_indexed_at`
- `status` (`ready|syncing|indexing|not_indexed|failed`)

- `search_jobs`
- `id`
- `repo_id`
- `job_type` (`sync|index`)
- `priority` (`interactive|normal|backfill`)
- `status` (`queued|running|done|failed`)
- `attempt`
- `error`
- `created_at`
- `updated_at`

- `search_index_builds`
- `id`
- `build_version`
- `repo_count`
- `started_at`
- `finished_at`
- `status`
- `manifest_r2_key`

Comment on lines +92 to +129
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Enhance D1 schema with constraints, indexes, and relationships.

The proposed schema is missing several database design elements that are essential for data integrity and performance:

Missing constraints:

  • No primary key constraints specified
  • No foreign key constraint for search_jobs.repo_id → search_repo_registry.id
  • No unique constraint on search_repo_registry(provider, owner, name)
  • No enum constraints for status fields

Missing indexes:

  • search_repo_registry: index on (provider, owner, name) for lookups
  • search_repo_registry: index on tier and status for job scheduling
  • search_jobs: index on (repo_id, status) for status queries
  • search_jobs: index on created_at for queue processing

Missing fields:

  • search_repo_registry should have created_at and updated_at for auditing
  • Consider adding size_bytes to track repository size for cost controls

Unclear design:

  • search_index_builds appears to track global builds, but the indexing model (line 73-74) mentions both batch and per-repo event-triggered rebuilds. Clarify if this table tracks only global builds or also per-repo builds.
📊 Proposed schema enhancements
-- Example constraint additions:
ALTER TABLE search_repo_registry ADD CONSTRAINT pk_search_repo_registry PRIMARY KEY (id);
ALTER TABLE search_repo_registry ADD CONSTRAINT uq_repo UNIQUE (provider, owner, name);
ALTER TABLE search_jobs ADD CONSTRAINT fk_repo FOREIGN KEY (repo_id) REFERENCES search_repo_registry(id);
CREATE INDEX idx_repo_status ON search_repo_registry(status, tier);
CREATE INDEX idx_job_status ON search_jobs(repo_id, status);
🧰 Tools
🪛 LanguageTool

[uncategorized] ~98-~98: The official name of this software platform is spelled with a capital “H”.
Context: ...repo_registry -id -provider (github) - owner -name -default_b...

(GITHUB)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/cloudflare-livegrep-search-rfc.md` around lines 92 - 129, The D1 schema
lacks key constraints, indexes, timestamps, and clarity about the build table;
fix by adding primary keys (pk on search_repo_registry.id, search_jobs.id,
search_index_builds.id), a unique constraint on search_repo_registry(provider,
owner, name), a foreign key search_jobs.repo_id → search_repo_registry.id,
enum/checked constraints for status/job_type/priority/tier fields, add
created_at/updated_at (and optional size_bytes) to search_repo_registry for
auditing/cost, and create indexes: search_repo_registry(provider, owner, name),
search_repo_registry(status, tier), search_jobs(repo_id, status), and
search_jobs(created_at) for queue processing; also clarify whether
search_index_builds is global-only or per-repo and, if per-repo, add repo_id FK
and related indexes to search_index_builds to reflect that model.

## 9. API Contract (MVP)

`GET /api/search?q=&repo=&path=&lang=&page=`

Response:

- `results`
- `repo_status`
- `partial` (boolean)
- `trace_id`

`POST /api/search/repos`

Body:

- `provider`
- `owner`
- `name`

Behavior:

- validates access/policy
- inserts or updates `search_repo_registry`
- enqueues bootstrap sync+index job

`GET /api/search/repos/:id/status`

Response:

- current lifecycle state
- last indexed commit
- staleness seconds
- latest error (if any)

Comment on lines +130 to +163
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add pagination controls and complete API response schemas.

The API contract has several gaps:

Missing from search endpoint (line 132):

  • limit or per_page parameter to control result count
  • Maximum allowed page size to prevent excessive resource usage

Missing from repo onboarding endpoint (lines 141-153):

  • Response schema (success/error structure)
  • Status codes for different scenarios (created vs. already exists)
  • Validation error format

Missing globally:

  • API versioning strategy (e.g., /v1/api/search)
  • Error response schema with consistent structure
  • Rate limit headers in responses
🔧 Proposed additions
`GET /api/search?q=&repo=&path=&lang=&page=&limit=`

Additional parameters:
- `limit`: max results per page (default: 50, max: 100)

`POST /api/search/repos`

Response (201 Created):
- `id`: repository ID
- `status`: current status
- `job_id`: bootstrap job ID

Response (400 Bad Request):
- `error`: validation error message
- `field`: problematic field
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/cloudflare-livegrep-search-rfc.md` around lines 130 - 163, Update the
API contract for the search and repo endpoints to include pagination and
standardized response/error schemas: for GET /api/search (refer to the GET
/api/search signature) add a limit/per_page parameter with defaults and a
documented max page size, include limit in the response metadata (e.g.,
per_page, page, total, partial), and document rate-limit headers; for POST
/api/search/repos (refer to POST /api/search/repos) add explicit response
schemas for success (201 Created with id, status, job_id) and for validation
errors (400 with error, field), and specify status codes for "created" vs
"already exists"; for GET /api/search/repos/:id/status (refer to that route)
define the response shape (lifecycle state, last_indexed_commit,
staleness_seconds, latest_error) and error shapes; finally add a global API
versioning pattern (e.g., /v1/) and a consistent error response schema used by
all endpoints.

## 10. Operations and Reliability

Minimum SLOs for MVP:

- Search API p95 latency: < 400ms for indexed repos.
- Freshness:
- hot repos < 30 minutes
- warm repos < 6 hours
- cold repos < 24 hours

Must-have alerts:

- queue depth high for > 15m
- indexing failed repeatedly for same repo
- stale hot repos above threshold
- Livegrep node unreachable

Runbooks:

- full reindex
- single-repo reindex
- promote previous known-good index

## 11. Cost Controls

- Strict repo allowlist in MVP.
- Default branch only.
- Exclude binaries and oversized files.
- Per-tenant repo caps.
- Rate-limit expensive regex queries.
- Auto-downgrade inactive repos from hot to warm/cold.

## 12. Security

- Worker enforces authN/authZ before search.
- Private repo access validated at onboarding and query time.
- Livegrep endpoint not exposed publicly.
- Secrets stored in Cloudflare secrets.
- Audit all admin/reindex actions in D1 logs.

Comment on lines +196 to +203
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Detail private repository access handling.

Line 199 mentions validating private repo access but doesn't specify the implementation. Consider documenting:

  • How GitHub tokens/OAuth grants are stored, encrypted, and rotated
  • Token refresh mechanism for long-lived access
  • What happens when a user's access to a private repo is revoked (immediate index removal? grace period?)
  • How to handle repository deletion or archival on GitHub
  • How to prevent one tenant from querying another tenant's private repos in a shared index

These details are critical for securing private repository data.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/cloudflare-livegrep-search-rfc.md` around lines 196 - 203, Update the
"## 12. Security" section to explicitly document private repository access
handling: describe how GitHub tokens/OAuth grants are stored (encrypted at rest
using Cloudflare Secrets / KMS), rotated (rotation policy and tooling), and
refreshed (token refresh flow or short-lived tokens + refresh tokens), specify
behavior on access revocation or repo deletion/archival (immediate index removal
vs. grace period and the reindex/cleanup workflow), and detail tenant isolation
controls to prevent cross-tenant queries (per-tenant ACLs, index namespacing,
and auth checks performed in the Worker before search); reference the existing
bullet "Private repo access validated at onboarding and query time" so reviewers
can find where to add these specifics.

## 13. Implementation Plan

Phase 1 (Week 1):

- D1 tables + migrations.
- Worker endpoints for repo onboarding and status.
- Queue producers/consumers skeleton.

Phase 2 (Week 2):

- Livegrep node deployment.
- Mirror sync worker.
- Hourly index build and atomic swap.

Phase 3 (Week 3):

- Result normalization + UI integration.
- Retry/backoff, alerting, and runbooks.
- Cost guardrails and tenant limits.

## 14. Upgrade Path

When limits are reached:

- split indexing and serving nodes
- introduce shard partitioning
- add diff-specific index pipeline
- migrate to Zoekt/hybrid engine if required

The API contract above should remain stable during these upgrades.

## 15. Open Questions

- Should user-added repos be immediate index (`interactive`) by default?
- What max repository size should MVP allow?
- Do we need branch selection in MVP or keep default-branch-only strictly?
- What retention policy should apply to old index manifests in R2?
Loading