-
Notifications
You must be signed in to change notification settings - Fork 25
Add search architecture RFC #174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,240 @@ | ||
| # RFC: DiffKit Repo Search MVP on Cloudflare + Livegrep | ||
|
|
||
| - Status: Draft | ||
| - Date: April 23, 2026 | ||
| - Owner: DiffKit | ||
|
|
||
| ## 1. Summary | ||
|
|
||
| This RFC proposes a smaller and cheaper search architecture for DiffKit: | ||
|
|
||
| - Cloudflare as the control plane and API surface. | ||
| - Livegrep as the code search engine. | ||
| - Limited initial repository scope (not global internet-scale). | ||
|
|
||
| The design optimizes for fast delivery, low operational complexity, and a clean upgrade path. | ||
|
|
||
| ## 2. Goals | ||
|
|
||
| - Ship a working multi-repo code search MVP quickly. | ||
| - Keep monthly cost predictable and low. | ||
| - Integrate cleanly with existing Cloudflare-backed app infra. | ||
| - Handle not-yet-indexed repositories gracefully. | ||
|
|
||
| ## 3. Non-goals (MVP) | ||
|
|
||
| - Indexing all public GitHub repositories. | ||
| - Building a custom search engine in v1. | ||
| - Full diff-aware semantic retrieval in the first iteration. | ||
|
|
||
| ## 4. Proposed Architecture | ||
|
|
||
| ## Control plane (Cloudflare) | ||
|
|
||
| - Workers: public search API and repo onboarding API. | ||
| - D1: metadata state for repositories, jobs, and index status. | ||
| - Queues: async job pipeline (`repo_sync`, `index_build`). | ||
| - Cron Triggers: periodic sync scheduling. | ||
| - R2: store index build manifests, logs, and backups. | ||
| - Optional: Cloudflare Access/JWT for internal admin endpoints. | ||
|
|
||
| ## Data plane (Livegrep) | ||
|
|
||
| Livegrep requires persistent CPU and disk-heavy indexing/search. For MVP, run this as a small dedicated service outside Workers runtime: | ||
|
|
||
| - 1 index/search node (or 2 for HA later). | ||
| - Local SSD for bare clones + active index. | ||
| - Internal endpoint consumed by Worker API layer. | ||
|
|
||
| Notes: | ||
|
|
||
| - Keep Cloudflare as the product-facing layer. | ||
| - Keep Livegrep private behind network policy and only callable from Worker/API gateway. | ||
|
|
||
| ## 5. Repository Scope Strategy | ||
|
|
||
| Start small: | ||
|
|
||
| - Tier A: DiffKit org repositories. | ||
| - Tier B: user-added repositories. | ||
| - Tier C: curated public repositories (manual allowlist). | ||
|
|
||
| Explicitly do not crawl all GitHub in MVP. | ||
|
|
||
|
Comment on lines
+54
to
+63
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Clarify relationship between repository tiers (A/B/C) and sync tiers (hot/warm/cold). Section 5 defines tiers A, B, C based on repository source, while Section 6 (line 103) uses hot/warm/cold based on sync cadence. The relationship between these two classification systems is unclear. Consider clarifying how a repo's tier (A/B/C) maps to its sync cadence (hot/warm/cold). 🤖 Prompt for AI Agents |
||
| ## 6. Indexing Model | ||
|
|
||
| - Clone mode: `--mirror` bare repos. | ||
| - Branch policy: default branch only. | ||
| - Sync cadence: | ||
| - hot repos: every 15 minutes | ||
| - warm repos: every 1-3 hours | ||
| - cold repos: daily | ||
| - Index updates: | ||
| - batch rebuild every hour, or | ||
| - event-triggered rebuild after N repo updates | ||
| - Publish model: atomic index swap only when build succeeds. | ||
|
|
||
|
Comment on lines
+64
to
+76
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🛠️ Refactor suggestion | 🟠 Major Detail the atomic index swap mechanism with Livegrep. Line 75 mentions "atomic index swap" but doesn't specify how this will be implemented with Livegrep. Consider documenting:
🤖 Prompt for AI Agents |
||
| ## 7. Query and Status Flow | ||
|
|
||
| 1. Client calls Worker search endpoint. | ||
| 2. Worker checks D1 for repo status and routing metadata. | ||
| 3. Worker queries Livegrep backend. | ||
| 4. Worker normalizes and returns results + status. | ||
|
|
||
| If repo is not indexed: | ||
|
|
||
| - Return `NOT_INDEXED` status in response. | ||
| - Enqueue high-priority bootstrap job. | ||
| - Return optional ETA bucket (`<10m`, `10-30m`, `>30m`). | ||
|
|
||
| This avoids empty-result ambiguity and improves UX. | ||
|
|
||
|
Comment on lines
+77
to
+91
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Define "routing metadata" referenced in query flow. Line 80 mentions checking D1 for "routing metadata" but this field is not included in the D1 schema (Section 8). Consider either adding these fields to the schema or clarifying what routing metadata means in this context. 🧰 Tools🪛 LanguageTool[style] ~82-~82: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym. (ENGLISH_WORD_REPEAT_BEGINNING_RULE) 🤖 Prompt for AI Agents |
||
| ## 8. D1 Schema (MVP) | ||
|
|
||
| Recommended tables: | ||
|
|
||
| - `search_repo_registry` | ||
| - `id` | ||
| - `provider` (`github`) | ||
| - `owner` | ||
| - `name` | ||
| - `default_branch` | ||
| - `is_enabled` | ||
| - `tier` (`hot|warm|cold`) | ||
| - `last_seen_head_sha` | ||
| - `last_indexed_head_sha` | ||
| - `last_synced_at` | ||
| - `last_indexed_at` | ||
| - `status` (`ready|syncing|indexing|not_indexed|failed`) | ||
|
|
||
| - `search_jobs` | ||
| - `id` | ||
| - `repo_id` | ||
| - `job_type` (`sync|index`) | ||
| - `priority` (`interactive|normal|backfill`) | ||
| - `status` (`queued|running|done|failed`) | ||
| - `attempt` | ||
| - `error` | ||
| - `created_at` | ||
| - `updated_at` | ||
|
|
||
| - `search_index_builds` | ||
| - `id` | ||
| - `build_version` | ||
| - `repo_count` | ||
| - `started_at` | ||
| - `finished_at` | ||
| - `status` | ||
| - `manifest_r2_key` | ||
|
|
||
|
Comment on lines
+92
to
+129
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🛠️ Refactor suggestion | 🟠 Major Enhance D1 schema with constraints, indexes, and relationships. The proposed schema is missing several database design elements that are essential for data integrity and performance: Missing constraints:
Missing indexes:
Missing fields:
Unclear design:
📊 Proposed schema enhancements-- Example constraint additions:
ALTER TABLE search_repo_registry ADD CONSTRAINT pk_search_repo_registry PRIMARY KEY (id);
ALTER TABLE search_repo_registry ADD CONSTRAINT uq_repo UNIQUE (provider, owner, name);
ALTER TABLE search_jobs ADD CONSTRAINT fk_repo FOREIGN KEY (repo_id) REFERENCES search_repo_registry(id);
CREATE INDEX idx_repo_status ON search_repo_registry(status, tier);
CREATE INDEX idx_job_status ON search_jobs(repo_id, status);🧰 Tools🪛 LanguageTool[uncategorized] ~98-~98: The official name of this software platform is spelled with a capital “H”. (GITHUB) 🤖 Prompt for AI Agents |
||
| ## 9. API Contract (MVP) | ||
|
|
||
| `GET /api/search?q=&repo=&path=&lang=&page=` | ||
|
|
||
| Response: | ||
|
|
||
| - `results` | ||
| - `repo_status` | ||
| - `partial` (boolean) | ||
| - `trace_id` | ||
|
|
||
| `POST /api/search/repos` | ||
|
|
||
| Body: | ||
|
|
||
| - `provider` | ||
| - `owner` | ||
| - `name` | ||
|
|
||
| Behavior: | ||
|
|
||
| - validates access/policy | ||
| - inserts or updates `search_repo_registry` | ||
| - enqueues bootstrap sync+index job | ||
|
|
||
| `GET /api/search/repos/:id/status` | ||
|
|
||
| Response: | ||
|
|
||
| - current lifecycle state | ||
| - last indexed commit | ||
| - staleness seconds | ||
| - latest error (if any) | ||
|
|
||
|
Comment on lines
+130
to
+163
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🛠️ Refactor suggestion | 🟠 Major Add pagination controls and complete API response schemas. The API contract has several gaps: Missing from search endpoint (line 132):
Missing from repo onboarding endpoint (lines 141-153):
Missing globally:
🔧 Proposed additions`GET /api/search?q=&repo=&path=&lang=&page=&limit=`
Additional parameters:
- `limit`: max results per page (default: 50, max: 100)
`POST /api/search/repos`
Response (201 Created):
- `id`: repository ID
- `status`: current status
- `job_id`: bootstrap job ID
Response (400 Bad Request):
- `error`: validation error message
- `field`: problematic field🤖 Prompt for AI Agents |
||
| ## 10. Operations and Reliability | ||
|
|
||
| Minimum SLOs for MVP: | ||
|
|
||
| - Search API p95 latency: < 400ms for indexed repos. | ||
| - Freshness: | ||
| - hot repos < 30 minutes | ||
| - warm repos < 6 hours | ||
| - cold repos < 24 hours | ||
|
|
||
| Must-have alerts: | ||
|
|
||
| - queue depth high for > 15m | ||
| - indexing failed repeatedly for same repo | ||
| - stale hot repos above threshold | ||
| - Livegrep node unreachable | ||
|
|
||
| Runbooks: | ||
|
|
||
| - full reindex | ||
| - single-repo reindex | ||
| - promote previous known-good index | ||
|
|
||
| ## 11. Cost Controls | ||
|
|
||
| - Strict repo allowlist in MVP. | ||
| - Default branch only. | ||
| - Exclude binaries and oversized files. | ||
| - Per-tenant repo caps. | ||
| - Rate-limit expensive regex queries. | ||
| - Auto-downgrade inactive repos from hot to warm/cold. | ||
|
|
||
| ## 12. Security | ||
|
|
||
| - Worker enforces authN/authZ before search. | ||
| - Private repo access validated at onboarding and query time. | ||
| - Livegrep endpoint not exposed publicly. | ||
| - Secrets stored in Cloudflare secrets. | ||
| - Audit all admin/reindex actions in D1 logs. | ||
|
|
||
|
Comment on lines
+196
to
+203
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Detail private repository access handling. Line 199 mentions validating private repo access but doesn't specify the implementation. Consider documenting:
These details are critical for securing private repository data. 🤖 Prompt for AI Agents |
||
| ## 13. Implementation Plan | ||
|
|
||
| Phase 1 (Week 1): | ||
|
|
||
| - D1 tables + migrations. | ||
| - Worker endpoints for repo onboarding and status. | ||
| - Queue producers/consumers skeleton. | ||
|
|
||
| Phase 2 (Week 2): | ||
|
|
||
| - Livegrep node deployment. | ||
| - Mirror sync worker. | ||
| - Hourly index build and atomic swap. | ||
|
|
||
| Phase 3 (Week 3): | ||
|
|
||
| - Result normalization + UI integration. | ||
| - Retry/backoff, alerting, and runbooks. | ||
| - Cost guardrails and tenant limits. | ||
|
|
||
| ## 14. Upgrade Path | ||
|
|
||
| When limits are reached: | ||
|
|
||
| - split indexing and serving nodes | ||
| - introduce shard partitioning | ||
| - add diff-specific index pipeline | ||
| - migrate to Zoekt/hybrid engine if required | ||
|
|
||
| The API contract above should remain stable during these upgrades. | ||
|
|
||
| ## 15. Open Questions | ||
|
|
||
| - Should user-added repos be immediate index (`interactive`) by default? | ||
| - What max repository size should MVP allow? | ||
| - Do we need branch selection in MVP or keep default-branch-only strictly? | ||
| - What retention policy should apply to old index manifests in R2? | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specify authentication mechanism between Worker and Livegrep endpoint.
The RFC mentions keeping Livegrep private with network policy but doesn't specify how Workers will authenticate to the Livegrep endpoint. Consider documenting the authentication approach (e.g., shared secrets, mTLS, API keys).
🤖 Prompt for AI Agents