From 18f8c9a83ba9c3c44576c9c4b385962b7a8145aa Mon Sep 17 00:00:00 2001 From: "tembo[bot]" <208362400+tembo[bot]@users.noreply.github.com> Date: Thu, 23 Apr 2026 02:46:03 +0000 Subject: [PATCH] docs: add RFC for Cloudflare + Livegrep DiffKit repo search MVP architecture Co-authored-by: Alan <45767683+stylessh@users.noreply.github.com> --- docs/cloudflare-livegrep-search-rfc.md | 240 +++++++++++++++++++++++++ 1 file changed, 240 insertions(+) create mode 100644 docs/cloudflare-livegrep-search-rfc.md diff --git a/docs/cloudflare-livegrep-search-rfc.md b/docs/cloudflare-livegrep-search-rfc.md new file mode 100644 index 0000000..b254fa6 --- /dev/null +++ b/docs/cloudflare-livegrep-search-rfc.md @@ -0,0 +1,240 @@ +# RFC: DiffKit Repo Search MVP on Cloudflare + Livegrep + +- Status: Draft +- Date: April 23, 2026 +- Owner: DiffKit + +## 1. Summary + +This RFC proposes a smaller and cheaper search architecture for DiffKit: + +- Cloudflare as the control plane and API surface. +- Livegrep as the code search engine. +- Limited initial repository scope (not global internet-scale). + +The design optimizes for fast delivery, low operational complexity, and a clean upgrade path. + +## 2. Goals + +- Ship a working multi-repo code search MVP quickly. +- Keep monthly cost predictable and low. +- Integrate cleanly with existing Cloudflare-backed app infra. +- Handle not-yet-indexed repositories gracefully. + +## 3. Non-goals (MVP) + +- Indexing all public GitHub repositories. +- Building a custom search engine in v1. +- Full diff-aware semantic retrieval in the first iteration. + +## 4. Proposed Architecture + +## Control plane (Cloudflare) + +- Workers: public search API and repo onboarding API. +- D1: metadata state for repositories, jobs, and index status. +- Queues: async job pipeline (`repo_sync`, `index_build`). +- Cron Triggers: periodic sync scheduling. +- R2: store index build manifests, logs, and backups. +- Optional: Cloudflare Access/JWT for internal admin endpoints. + +## Data plane (Livegrep) + +Livegrep requires persistent CPU and disk-heavy indexing/search. For MVP, run this as a small dedicated service outside Workers runtime: + +- 1 index/search node (or 2 for HA later). +- Local SSD for bare clones + active index. +- Internal endpoint consumed by Worker API layer. + +Notes: + +- Keep Cloudflare as the product-facing layer. +- Keep Livegrep private behind network policy and only callable from Worker/API gateway. + +## 5. Repository Scope Strategy + +Start small: + +- Tier A: DiffKit org repositories. +- Tier B: user-added repositories. +- Tier C: curated public repositories (manual allowlist). + +Explicitly do not crawl all GitHub in MVP. + +## 6. Indexing Model + +- Clone mode: `--mirror` bare repos. +- Branch policy: default branch only. +- Sync cadence: + - hot repos: every 15 minutes + - warm repos: every 1-3 hours + - cold repos: daily +- Index updates: + - batch rebuild every hour, or + - event-triggered rebuild after N repo updates +- Publish model: atomic index swap only when build succeeds. + +## 7. Query and Status Flow + +1. Client calls Worker search endpoint. +2. Worker checks D1 for repo status and routing metadata. +3. Worker queries Livegrep backend. +4. Worker normalizes and returns results + status. + +If repo is not indexed: + +- Return `NOT_INDEXED` status in response. +- Enqueue high-priority bootstrap job. +- Return optional ETA bucket (`<10m`, `10-30m`, `>30m`). + +This avoids empty-result ambiguity and improves UX. + +## 8. D1 Schema (MVP) + +Recommended tables: + +- `search_repo_registry` + - `id` + - `provider` (`github`) + - `owner` + - `name` + - `default_branch` + - `is_enabled` + - `tier` (`hot|warm|cold`) + - `last_seen_head_sha` + - `last_indexed_head_sha` + - `last_synced_at` + - `last_indexed_at` + - `status` (`ready|syncing|indexing|not_indexed|failed`) + +- `search_jobs` + - `id` + - `repo_id` + - `job_type` (`sync|index`) + - `priority` (`interactive|normal|backfill`) + - `status` (`queued|running|done|failed`) + - `attempt` + - `error` + - `created_at` + - `updated_at` + +- `search_index_builds` + - `id` + - `build_version` + - `repo_count` + - `started_at` + - `finished_at` + - `status` + - `manifest_r2_key` + +## 9. API Contract (MVP) + +`GET /api/search?q=&repo=&path=&lang=&page=` + +Response: + +- `results` +- `repo_status` +- `partial` (boolean) +- `trace_id` + +`POST /api/search/repos` + +Body: + +- `provider` +- `owner` +- `name` + +Behavior: + +- validates access/policy +- inserts or updates `search_repo_registry` +- enqueues bootstrap sync+index job + +`GET /api/search/repos/:id/status` + +Response: + +- current lifecycle state +- last indexed commit +- staleness seconds +- latest error (if any) + +## 10. Operations and Reliability + +Minimum SLOs for MVP: + +- Search API p95 latency: < 400ms for indexed repos. +- Freshness: + - hot repos < 30 minutes + - warm repos < 6 hours + - cold repos < 24 hours + +Must-have alerts: + +- queue depth high for > 15m +- indexing failed repeatedly for same repo +- stale hot repos above threshold +- Livegrep node unreachable + +Runbooks: + +- full reindex +- single-repo reindex +- promote previous known-good index + +## 11. Cost Controls + +- Strict repo allowlist in MVP. +- Default branch only. +- Exclude binaries and oversized files. +- Per-tenant repo caps. +- Rate-limit expensive regex queries. +- Auto-downgrade inactive repos from hot to warm/cold. + +## 12. Security + +- Worker enforces authN/authZ before search. +- Private repo access validated at onboarding and query time. +- Livegrep endpoint not exposed publicly. +- Secrets stored in Cloudflare secrets. +- Audit all admin/reindex actions in D1 logs. + +## 13. Implementation Plan + +Phase 1 (Week 1): + +- D1 tables + migrations. +- Worker endpoints for repo onboarding and status. +- Queue producers/consumers skeleton. + +Phase 2 (Week 2): + +- Livegrep node deployment. +- Mirror sync worker. +- Hourly index build and atomic swap. + +Phase 3 (Week 3): + +- Result normalization + UI integration. +- Retry/backoff, alerting, and runbooks. +- Cost guardrails and tenant limits. + +## 14. Upgrade Path + +When limits are reached: + +- split indexing and serving nodes +- introduce shard partitioning +- add diff-specific index pipeline +- migrate to Zoekt/hybrid engine if required + +The API contract above should remain stable during these upgrades. + +## 15. Open Questions + +- Should user-added repos be immediate index (`interactive`) by default? +- What max repository size should MVP allow? +- Do we need branch selection in MVP or keep default-branch-only strictly? +- What retention policy should apply to old index manifests in R2?