Add search architecture RFC#174
Conversation
…tecture Co-authored-by: Alan <45767683+stylessh@users.noreply.github.com>
📝 WalkthroughWalkthroughAdds a new RFC document detailing the proposed MVP architecture for implementing repository search functionality on Cloudflare using Livegrep, including system components, indexing strategies, API contracts, and implementation phasing. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~15 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (7)
docs/cloudflare-livegrep-search-rfc.md (7)
79-83: Consider rewording to avoid repetition.Steps 2-4 all begin with "Worker." Consider varying the sentence structure for better readability.
✍️ Alternative phrasing
1. Client calls Worker search endpoint. 2. Worker checks D1 for repo status and routing metadata. 3. Livegrep backend is queried. 4. Results are normalized and returned with status.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/cloudflare-livegrep-search-rfc.md` around lines 79 - 83, The three consecutive steps starting with "Worker" are repetitive; revise steps 2–4 to vary sentence structure so readability improves (e.g., keep step 2 as "Worker checks D1 for repo status and routing metadata," change step 3 to "Livegrep backend is queried" and step 4 to "Results are normalized and returned with status")—apply this phrasing or a similar non-repetitive rewrite for the list in docs/cloudflare-livegrep-search-rfc.md.
164-186: Consider adding metrics and observability requirements.While SLOs and alerts are well-defined, consider documenting key metrics to track for operational visibility:
- Query volume and query latency distribution
- Index build duration and success rate
- Total indexed repository count and size
- Queue depth over time
- Cost metrics (Worker invocations, D1 operations, R2 storage)
These metrics will help validate that the system meets SLOs and support capacity planning.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/cloudflare-livegrep-search-rfc.md` around lines 164 - 186, Update the "10. Operations and Reliability" section to include a new "Metrics and Observability" subsection that lists and defines the key metrics to track (e.g., query volume and latency distribution, index build duration and success rate, total indexed repo count and size, queue depth over time, cost metrics like Worker invocations/D1/R2 usage) and map those metrics to existing SLOs and alerts (Search API p95, freshness tiers, queue depth, indexing failures, Livegrep node reachability) so they feed dashboards and runbooks (full reindex, single-repo reindex, promote previous index) for capacity planning and incident response.
204-223: Consider adding testing and deployment strategy to implementation plan.The 3-week timeline is ambitious. Consider adding:
- Unit and integration testing approach for each phase
- Staging environment deployment before production
- Beta user group for early validation
- Rollback criteria and procedure for each phase
- Infrastructure provisioning timeline (Livegrep node setup may have lead time)
This will help derisk the implementation and ensure quality.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/cloudflare-livegrep-search-rfc.md` around lines 204 - 223, Update the "Implementation Plan" section to include a testing and deployment strategy by adding explicit entries under Phase 1, Phase 2, and Phase 3: for each phase mention unit and integration test scope, staging environment deployment and validation steps, a beta user rollout plan, rollback criteria/procedures, and infrastructure provisioning lead times (e.g., Livegrep node setup); also add a short checklist for CI/CD gating and monitoring/alerting readiness so reviewers can verify de-risking tasks alongside the existing D1 tables, worker endpoints, mirror sync, and result normalization items.
1-240: Consider adding disaster recovery and backup strategy section.While R2 backups are mentioned briefly (line 38), the RFC would benefit from a dedicated section covering:
- RTO (Recovery Time Objective) and RPO (Recovery Point Objective) targets
- Backup frequency for Livegrep indexes and D1 database
- Restoration procedure and testing schedule
- Geographic redundancy strategy (if any)
- What happens if the Livegrep node is lost completely
This is particularly important given that Livegrep is described as a single node with local SSD storage (line 46).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/cloudflare-livegrep-search-rfc.md` around lines 1 - 240, Add a dedicated "Disaster Recovery & Backups" section that specifies RTO/RPO targets, backup frequency and retention for Livegrep indexes (from local SSD) and D1 metadata, the use of R2 manifests as durable backups, an automated restore procedure and runbook for full Livegrep node loss (including steps to rehydrate indexes from R2, rebuild from mirrors, and promote a standby node), geographic redundancy strategy (cross-region R2 copies or multi-AZ Livegrep replicas) and a regular restoration testing cadence (e.g., quarterly drills) plus alerting/SLAs; reference Livegrep node, R2 manifest_r2_key, and D1 tables (search_index_builds, search_repo_registry) in the section so operators know exactly which artifacts to back up and how to recover them.
235-240: Consider additional open questions for MVP scope.The listed questions are valuable. Additional questions to consider:
- How should repositories with Git submodules be handled (index submodules or skip)?
- What is the strategy for forks of already-indexed repositories (reuse index or create separate)?
- How to handle extremely large monorepos (>1M files or >100GB)?
- Should archived/read-only repositories be indexed with a different cadence?
- What is the minimum GitHub API rate limit required for the planned sync cadence?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/cloudflare-livegrep-search-rfc.md` around lines 235 - 240, Update the "## 15. Open Questions" section to include the suggested MVP-scope questions: add bullets addressing handling of Git submodules (index vs skip), strategy for forks of already-indexed repos (reuse index vs separate), policy for extremely large monorepos (>1M files or >100GB), indexing cadence for archived/read-only repositories, and the minimum GitHub API rate limit required for the planned sync cadence; place these as additional list items under the existing Open Questions block so they are visible alongside the original questions.
4-4: Minor style: Consider adding comma after year in date.Some style guides recommend a comma after the year in month-day-year format.
📝 Suggested style improvement
-- Date: April 23, 2026 +- Date: April 23, 2026,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/cloudflare-livegrep-search-rfc.md` at line 4, Update the Date header line that currently reads "Date: April 23, 2026" to include a trailing comma after the year (e.g., "Date: April 23, 2026,") so it follows month-day-year punctuation conventions; locate the "Date:" metadata/header in the document (the line starting with "Date:") and add the comma.
187-195: Quantify cost control thresholds for MVP implementation.Several cost controls are mentioned without specific values. Consider defining:
- Maximum file size to index (e.g., skip files > 1MB)
- Binary file detection approach (extension list, magic bytes, etc.)
- Per-tenant repository limit (e.g., 100 repos per tenant)
- Inactivity threshold for tier downgrade (e.g., no queries for 7 days)
- Regex query timeout/complexity limits
These concrete thresholds will guide implementation and help estimate costs more accurately.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/cloudflare-livegrep-search-rfc.md` around lines 187 - 195, Update the "## 11. Cost Controls" section in docs/cloudflare-livegrep-search-rfc.md to include concrete thresholds: specify a MAX_INDEX_FILE_SIZE (e.g., 1MB) and state the binary detection approach (e.g., extension blacklist + magic-byte sniffing), set a PER_TENANT_REPO_LIMIT (e.g., 100 repos), define an INACTIVITY_DOWNTIER_DAYS threshold (e.g., 7 days) for auto-downgrade, and add a REGEX_QUERY_TIMEOUT/complexity limit (e.g., X seconds or complexity metric); use these symbolic names (MAX_INDEX_FILE_SIZE, BINARY_DETECTION_METHOD, PER_TENANT_REPO_LIMIT, INACTIVITY_DOWNTIER_DAYS, REGEX_QUERY_TIMEOUT) so implementers can map them to config constants and cost estimates.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/cloudflare-livegrep-search-rfc.md`:
- Around line 77-91: Clarify what "routing metadata" means in the query flow by
either adding explicit fields to the D1 schema in Section 8 (e.g.,
repo_routing_host, region, livegrep_backend_id or routing_hint) or by updating
the text in "7. Query and Status Flow" to reference the existing D1 fields that
convey routing (e.g., indexed status, backend id, region) and remove the
ambiguous term; update the Worker step "Worker checks D1 for repo status and
routing metadata" to explicitly list the exact D1 fields the Worker reads (or
add those new fields to the D1 schema) so the schema and flow remain consistent.
- Around line 92-129: The D1 schema lacks key constraints, indexes, timestamps,
and clarity about the build table; fix by adding primary keys (pk on
search_repo_registry.id, search_jobs.id, search_index_builds.id), a unique
constraint on search_repo_registry(provider, owner, name), a foreign key
search_jobs.repo_id → search_repo_registry.id, enum/checked constraints for
status/job_type/priority/tier fields, add created_at/updated_at (and optional
size_bytes) to search_repo_registry for auditing/cost, and create indexes:
search_repo_registry(provider, owner, name), search_repo_registry(status, tier),
search_jobs(repo_id, status), and search_jobs(created_at) for queue processing;
also clarify whether search_index_builds is global-only or per-repo and, if
per-repo, add repo_id FK and related indexes to search_index_builds to reflect
that model.
- Around line 41-53: The RFC omits how Workers authenticate to the Livegrep
endpoint; update the "Data plane (Livegrep)" section to specify a concrete
authentication mechanism (e.g., mutual TLS with service certificates, signed
short-lived JWTs from the Worker API layer, or scoped API keys) and detail how
it is provisioned, rotated, and enforced by the Livegrep service and network
policy; reference the Worker -> "Internal endpoint consumed by Worker API layer"
flow and document the expected headers/certificates, validation steps on the
Livegrep side, and any necessary configuration for the Worker/API gateway to
obtain or present credentials.
- Around line 64-76: Update the "Indexing Model" section to explicitly describe
the atomic index swap implementation for Livegrep: state whether Livegrep
supports native atomic switching or not, and if not, document using a dual-index
directory approach (build new index in a temp path and perform an atomic
rename/symlink swap to the live index), include a clear rollback strategy (keep
previous index staged and revert rename/symlink on failure or health-check), and
explain concurrent query handling during swap (use atomic filesystem rename or
symlink update so readers continue hitting the old index until the instant swap,
add health-checks and graceful failover to the previous index if the new one
fails). Ensure these details are placed under "Indexing Model" and reference
"atomic index swap", "Livegrep", and the dual-index/symlink rename approach so
reviewers can find and verify the change.
- Around line 54-63: Clarify how repository source tiers (Tier A, Tier B, Tier
C) map to sync cadence tiers (hot, warm, cold) by updating Section 5 and the
sync-cadence text in Section 6: state a default mapping (e.g., Tier A -> hot,
Tier B -> warm, Tier C -> cold), note any exceptions (manual overrides, org
policies, high-activity repos) and describe rules that promote/demote repos
between hot/warm/cold (e.g., activity thresholds, manual flags), and add one
short example per mapping to make the behavior explicit.
- Around line 196-203: Update the "## 12. Security" section to explicitly
document private repository access handling: describe how GitHub tokens/OAuth
grants are stored (encrypted at rest using Cloudflare Secrets / KMS), rotated
(rotation policy and tooling), and refreshed (token refresh flow or short-lived
tokens + refresh tokens), specify behavior on access revocation or repo
deletion/archival (immediate index removal vs. grace period and the
reindex/cleanup workflow), and detail tenant isolation controls to prevent
cross-tenant queries (per-tenant ACLs, index namespacing, and auth checks
performed in the Worker before search); reference the existing bullet "Private
repo access validated at onboarding and query time" so reviewers can find where
to add these specifics.
- Around line 130-163: Update the API contract for the search and repo endpoints
to include pagination and standardized response/error schemas: for GET
/api/search (refer to the GET /api/search signature) add a limit/per_page
parameter with defaults and a documented max page size, include limit in the
response metadata (e.g., per_page, page, total, partial), and document
rate-limit headers; for POST /api/search/repos (refer to POST /api/search/repos)
add explicit response schemas for success (201 Created with id, status, job_id)
and for validation errors (400 with error, field), and specify status codes for
"created" vs "already exists"; for GET /api/search/repos/:id/status (refer to
that route) define the response shape (lifecycle state, last_indexed_commit,
staleness_seconds, latest_error) and error shapes; finally add a global API
versioning pattern (e.g., /v1/) and a consistent error response schema used by
all endpoints.
---
Nitpick comments:
In `@docs/cloudflare-livegrep-search-rfc.md`:
- Around line 79-83: The three consecutive steps starting with "Worker" are
repetitive; revise steps 2–4 to vary sentence structure so readability improves
(e.g., keep step 2 as "Worker checks D1 for repo status and routing metadata,"
change step 3 to "Livegrep backend is queried" and step 4 to "Results are
normalized and returned with status")—apply this phrasing or a similar
non-repetitive rewrite for the list in docs/cloudflare-livegrep-search-rfc.md.
- Around line 164-186: Update the "10. Operations and Reliability" section to
include a new "Metrics and Observability" subsection that lists and defines the
key metrics to track (e.g., query volume and latency distribution, index build
duration and success rate, total indexed repo count and size, queue depth over
time, cost metrics like Worker invocations/D1/R2 usage) and map those metrics to
existing SLOs and alerts (Search API p95, freshness tiers, queue depth, indexing
failures, Livegrep node reachability) so they feed dashboards and runbooks (full
reindex, single-repo reindex, promote previous index) for capacity planning and
incident response.
- Around line 204-223: Update the "Implementation Plan" section to include a
testing and deployment strategy by adding explicit entries under Phase 1, Phase
2, and Phase 3: for each phase mention unit and integration test scope, staging
environment deployment and validation steps, a beta user rollout plan, rollback
criteria/procedures, and infrastructure provisioning lead times (e.g., Livegrep
node setup); also add a short checklist for CI/CD gating and monitoring/alerting
readiness so reviewers can verify de-risking tasks alongside the existing D1
tables, worker endpoints, mirror sync, and result normalization items.
- Around line 1-240: Add a dedicated "Disaster Recovery & Backups" section that
specifies RTO/RPO targets, backup frequency and retention for Livegrep indexes
(from local SSD) and D1 metadata, the use of R2 manifests as durable backups, an
automated restore procedure and runbook for full Livegrep node loss (including
steps to rehydrate indexes from R2, rebuild from mirrors, and promote a standby
node), geographic redundancy strategy (cross-region R2 copies or multi-AZ
Livegrep replicas) and a regular restoration testing cadence (e.g., quarterly
drills) plus alerting/SLAs; reference Livegrep node, R2 manifest_r2_key, and D1
tables (search_index_builds, search_repo_registry) in the section so operators
know exactly which artifacts to back up and how to recover them.
- Around line 235-240: Update the "## 15. Open Questions" section to include the
suggested MVP-scope questions: add bullets addressing handling of Git submodules
(index vs skip), strategy for forks of already-indexed repos (reuse index vs
separate), policy for extremely large monorepos (>1M files or >100GB), indexing
cadence for archived/read-only repositories, and the minimum GitHub API rate
limit required for the planned sync cadence; place these as additional list
items under the existing Open Questions block so they are visible alongside the
original questions.
- Line 4: Update the Date header line that currently reads "Date: April 23,
2026" to include a trailing comma after the year (e.g., "Date: April 23, 2026,")
so it follows month-day-year punctuation conventions; locate the "Date:"
metadata/header in the document (the line starting with "Date:") and add the
comma.
- Around line 187-195: Update the "## 11. Cost Controls" section in
docs/cloudflare-livegrep-search-rfc.md to include concrete thresholds: specify a
MAX_INDEX_FILE_SIZE (e.g., 1MB) and state the binary detection approach (e.g.,
extension blacklist + magic-byte sniffing), set a PER_TENANT_REPO_LIMIT (e.g.,
100 repos), define an INACTIVITY_DOWNTIER_DAYS threshold (e.g., 7 days) for
auto-downgrade, and add a REGEX_QUERY_TIMEOUT/complexity limit (e.g., X seconds
or complexity metric); use these symbolic names (MAX_INDEX_FILE_SIZE,
BINARY_DETECTION_METHOD, PER_TENANT_REPO_LIMIT, INACTIVITY_DOWNTIER_DAYS,
REGEX_QUERY_TIMEOUT) so implementers can map them to config constants and cost
estimates.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro Plus
Run ID: ed82d5c4-3586-4048-9d02-949125d0c5c7
📒 Files selected for processing (1)
docs/cloudflare-livegrep-search-rfc.md
| ## Data plane (Livegrep) | ||
|
|
||
| Livegrep requires persistent CPU and disk-heavy indexing/search. For MVP, run this as a small dedicated service outside Workers runtime: | ||
|
|
||
| - 1 index/search node (or 2 for HA later). | ||
| - Local SSD for bare clones + active index. | ||
| - Internal endpoint consumed by Worker API layer. | ||
|
|
||
| Notes: | ||
|
|
||
| - Keep Cloudflare as the product-facing layer. | ||
| - Keep Livegrep private behind network policy and only callable from Worker/API gateway. | ||
|
|
There was a problem hiding this comment.
Specify authentication mechanism between Worker and Livegrep endpoint.
The RFC mentions keeping Livegrep private with network policy but doesn't specify how Workers will authenticate to the Livegrep endpoint. Consider documenting the authentication approach (e.g., shared secrets, mTLS, API keys).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/cloudflare-livegrep-search-rfc.md` around lines 41 - 53, The RFC omits
how Workers authenticate to the Livegrep endpoint; update the "Data plane
(Livegrep)" section to specify a concrete authentication mechanism (e.g., mutual
TLS with service certificates, signed short-lived JWTs from the Worker API
layer, or scoped API keys) and detail how it is provisioned, rotated, and
enforced by the Livegrep service and network policy; reference the Worker ->
"Internal endpoint consumed by Worker API layer" flow and document the expected
headers/certificates, validation steps on the Livegrep side, and any necessary
configuration for the Worker/API gateway to obtain or present credentials.
| ## 5. Repository Scope Strategy | ||
|
|
||
| Start small: | ||
|
|
||
| - Tier A: DiffKit org repositories. | ||
| - Tier B: user-added repositories. | ||
| - Tier C: curated public repositories (manual allowlist). | ||
|
|
||
| Explicitly do not crawl all GitHub in MVP. | ||
|
|
There was a problem hiding this comment.
Clarify relationship between repository tiers (A/B/C) and sync tiers (hot/warm/cold).
Section 5 defines tiers A, B, C based on repository source, while Section 6 (line 103) uses hot/warm/cold based on sync cadence. The relationship between these two classification systems is unclear. Consider clarifying how a repo's tier (A/B/C) maps to its sync cadence (hot/warm/cold).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/cloudflare-livegrep-search-rfc.md` around lines 54 - 63, Clarify how
repository source tiers (Tier A, Tier B, Tier C) map to sync cadence tiers (hot,
warm, cold) by updating Section 5 and the sync-cadence text in Section 6: state
a default mapping (e.g., Tier A -> hot, Tier B -> warm, Tier C -> cold), note
any exceptions (manual overrides, org policies, high-activity repos) and
describe rules that promote/demote repos between hot/warm/cold (e.g., activity
thresholds, manual flags), and add one short example per mapping to make the
behavior explicit.
| ## 6. Indexing Model | ||
|
|
||
| - Clone mode: `--mirror` bare repos. | ||
| - Branch policy: default branch only. | ||
| - Sync cadence: | ||
| - hot repos: every 15 minutes | ||
| - warm repos: every 1-3 hours | ||
| - cold repos: daily | ||
| - Index updates: | ||
| - batch rebuild every hour, or | ||
| - event-triggered rebuild after N repo updates | ||
| - Publish model: atomic index swap only when build succeeds. | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Detail the atomic index swap mechanism with Livegrep.
Line 75 mentions "atomic index swap" but doesn't specify how this will be implemented with Livegrep. Consider documenting:
- Whether Livegrep natively supports atomic index switching
- The specific mechanism (symlink swap, dual-index approach, etc.)
- Rollback strategy if the new index has issues
- How concurrent queries are handled during the swap
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/cloudflare-livegrep-search-rfc.md` around lines 64 - 76, Update the
"Indexing Model" section to explicitly describe the atomic index swap
implementation for Livegrep: state whether Livegrep supports native atomic
switching or not, and if not, document using a dual-index directory approach
(build new index in a temp path and perform an atomic rename/symlink swap to the
live index), include a clear rollback strategy (keep previous index staged and
revert rename/symlink on failure or health-check), and explain concurrent query
handling during swap (use atomic filesystem rename or symlink update so readers
continue hitting the old index until the instant swap, add health-checks and
graceful failover to the previous index if the new one fails). Ensure these
details are placed under "Indexing Model" and reference "atomic index swap",
"Livegrep", and the dual-index/symlink rename approach so reviewers can find and
verify the change.
| ## 7. Query and Status Flow | ||
|
|
||
| 1. Client calls Worker search endpoint. | ||
| 2. Worker checks D1 for repo status and routing metadata. | ||
| 3. Worker queries Livegrep backend. | ||
| 4. Worker normalizes and returns results + status. | ||
|
|
||
| If repo is not indexed: | ||
|
|
||
| - Return `NOT_INDEXED` status in response. | ||
| - Enqueue high-priority bootstrap job. | ||
| - Return optional ETA bucket (`<10m`, `10-30m`, `>30m`). | ||
|
|
||
| This avoids empty-result ambiguity and improves UX. | ||
|
|
There was a problem hiding this comment.
Define "routing metadata" referenced in query flow.
Line 80 mentions checking D1 for "routing metadata" but this field is not included in the D1 schema (Section 8). Consider either adding these fields to the schema or clarifying what routing metadata means in this context.
🧰 Tools
🪛 LanguageTool
[style] ~82-~82: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... 3. Worker queries Livegrep backend. 4. Worker normalizes and returns results + status...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/cloudflare-livegrep-search-rfc.md` around lines 77 - 91, Clarify what
"routing metadata" means in the query flow by either adding explicit fields to
the D1 schema in Section 8 (e.g., repo_routing_host, region, livegrep_backend_id
or routing_hint) or by updating the text in "7. Query and Status Flow" to
reference the existing D1 fields that convey routing (e.g., indexed status,
backend id, region) and remove the ambiguous term; update the Worker step
"Worker checks D1 for repo status and routing metadata" to explicitly list the
exact D1 fields the Worker reads (or add those new fields to the D1 schema) so
the schema and flow remain consistent.
| ## 8. D1 Schema (MVP) | ||
|
|
||
| Recommended tables: | ||
|
|
||
| - `search_repo_registry` | ||
| - `id` | ||
| - `provider` (`github`) | ||
| - `owner` | ||
| - `name` | ||
| - `default_branch` | ||
| - `is_enabled` | ||
| - `tier` (`hot|warm|cold`) | ||
| - `last_seen_head_sha` | ||
| - `last_indexed_head_sha` | ||
| - `last_synced_at` | ||
| - `last_indexed_at` | ||
| - `status` (`ready|syncing|indexing|not_indexed|failed`) | ||
|
|
||
| - `search_jobs` | ||
| - `id` | ||
| - `repo_id` | ||
| - `job_type` (`sync|index`) | ||
| - `priority` (`interactive|normal|backfill`) | ||
| - `status` (`queued|running|done|failed`) | ||
| - `attempt` | ||
| - `error` | ||
| - `created_at` | ||
| - `updated_at` | ||
|
|
||
| - `search_index_builds` | ||
| - `id` | ||
| - `build_version` | ||
| - `repo_count` | ||
| - `started_at` | ||
| - `finished_at` | ||
| - `status` | ||
| - `manifest_r2_key` | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Enhance D1 schema with constraints, indexes, and relationships.
The proposed schema is missing several database design elements that are essential for data integrity and performance:
Missing constraints:
- No primary key constraints specified
- No foreign key constraint for
search_jobs.repo_id → search_repo_registry.id - No unique constraint on
search_repo_registry(provider, owner, name) - No enum constraints for status fields
Missing indexes:
search_repo_registry: index on(provider, owner, name)for lookupssearch_repo_registry: index ontierandstatusfor job schedulingsearch_jobs: index on(repo_id, status)for status queriessearch_jobs: index oncreated_atfor queue processing
Missing fields:
search_repo_registryshould havecreated_atandupdated_atfor auditing- Consider adding
size_bytesto track repository size for cost controls
Unclear design:
search_index_buildsappears to track global builds, but the indexing model (line 73-74) mentions both batch and per-repo event-triggered rebuilds. Clarify if this table tracks only global builds or also per-repo builds.
📊 Proposed schema enhancements
-- Example constraint additions:
ALTER TABLE search_repo_registry ADD CONSTRAINT pk_search_repo_registry PRIMARY KEY (id);
ALTER TABLE search_repo_registry ADD CONSTRAINT uq_repo UNIQUE (provider, owner, name);
ALTER TABLE search_jobs ADD CONSTRAINT fk_repo FOREIGN KEY (repo_id) REFERENCES search_repo_registry(id);
CREATE INDEX idx_repo_status ON search_repo_registry(status, tier);
CREATE INDEX idx_job_status ON search_jobs(repo_id, status);🧰 Tools
🪛 LanguageTool
[uncategorized] ~98-~98: The official name of this software platform is spelled with a capital “H”.
Context: ...repo_registry -id -provider (github) - owner -name -default_b...
(GITHUB)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/cloudflare-livegrep-search-rfc.md` around lines 92 - 129, The D1 schema
lacks key constraints, indexes, timestamps, and clarity about the build table;
fix by adding primary keys (pk on search_repo_registry.id, search_jobs.id,
search_index_builds.id), a unique constraint on search_repo_registry(provider,
owner, name), a foreign key search_jobs.repo_id → search_repo_registry.id,
enum/checked constraints for status/job_type/priority/tier fields, add
created_at/updated_at (and optional size_bytes) to search_repo_registry for
auditing/cost, and create indexes: search_repo_registry(provider, owner, name),
search_repo_registry(status, tier), search_jobs(repo_id, status), and
search_jobs(created_at) for queue processing; also clarify whether
search_index_builds is global-only or per-repo and, if per-repo, add repo_id FK
and related indexes to search_index_builds to reflect that model.
| ## 9. API Contract (MVP) | ||
|
|
||
| `GET /api/search?q=&repo=&path=&lang=&page=` | ||
|
|
||
| Response: | ||
|
|
||
| - `results` | ||
| - `repo_status` | ||
| - `partial` (boolean) | ||
| - `trace_id` | ||
|
|
||
| `POST /api/search/repos` | ||
|
|
||
| Body: | ||
|
|
||
| - `provider` | ||
| - `owner` | ||
| - `name` | ||
|
|
||
| Behavior: | ||
|
|
||
| - validates access/policy | ||
| - inserts or updates `search_repo_registry` | ||
| - enqueues bootstrap sync+index job | ||
|
|
||
| `GET /api/search/repos/:id/status` | ||
|
|
||
| Response: | ||
|
|
||
| - current lifecycle state | ||
| - last indexed commit | ||
| - staleness seconds | ||
| - latest error (if any) | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Add pagination controls and complete API response schemas.
The API contract has several gaps:
Missing from search endpoint (line 132):
limitorper_pageparameter to control result count- Maximum allowed page size to prevent excessive resource usage
Missing from repo onboarding endpoint (lines 141-153):
- Response schema (success/error structure)
- Status codes for different scenarios (created vs. already exists)
- Validation error format
Missing globally:
- API versioning strategy (e.g.,
/v1/api/search) - Error response schema with consistent structure
- Rate limit headers in responses
🔧 Proposed additions
`GET /api/search?q=&repo=&path=&lang=&page=&limit=`
Additional parameters:
- `limit`: max results per page (default: 50, max: 100)
`POST /api/search/repos`
Response (201 Created):
- `id`: repository ID
- `status`: current status
- `job_id`: bootstrap job ID
Response (400 Bad Request):
- `error`: validation error message
- `field`: problematic field🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/cloudflare-livegrep-search-rfc.md` around lines 130 - 163, Update the
API contract for the search and repo endpoints to include pagination and
standardized response/error schemas: for GET /api/search (refer to the GET
/api/search signature) add a limit/per_page parameter with defaults and a
documented max page size, include limit in the response metadata (e.g.,
per_page, page, total, partial), and document rate-limit headers; for POST
/api/search/repos (refer to POST /api/search/repos) add explicit response
schemas for success (201 Created with id, status, job_id) and for validation
errors (400 with error, field), and specify status codes for "created" vs
"already exists"; for GET /api/search/repos/:id/status (refer to that route)
define the response shape (lifecycle state, last_indexed_commit,
staleness_seconds, latest_error) and error shapes; finally add a global API
versioning pattern (e.g., /v1/) and a consistent error response schema used by
all endpoints.
| ## 12. Security | ||
|
|
||
| - Worker enforces authN/authZ before search. | ||
| - Private repo access validated at onboarding and query time. | ||
| - Livegrep endpoint not exposed publicly. | ||
| - Secrets stored in Cloudflare secrets. | ||
| - Audit all admin/reindex actions in D1 logs. | ||
|
|
There was a problem hiding this comment.
Detail private repository access handling.
Line 199 mentions validating private repo access but doesn't specify the implementation. Consider documenting:
- How GitHub tokens/OAuth grants are stored, encrypted, and rotated
- Token refresh mechanism for long-lived access
- What happens when a user's access to a private repo is revoked (immediate index removal? grace period?)
- How to handle repository deletion or archival on GitHub
- How to prevent one tenant from querying another tenant's private repos in a shared index
These details are critical for securing private repository data.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/cloudflare-livegrep-search-rfc.md` around lines 196 - 203, Update the
"## 12. Security" section to explicitly document private repository access
handling: describe how GitHub tokens/OAuth grants are stored (encrypted at rest
using Cloudflare Secrets / KMS), rotated (rotation policy and tooling), and
refreshed (token refresh flow or short-lived tokens + refresh tokens), specify
behavior on access revocation or repo deletion/archival (immediate index removal
vs. grace period and the reindex/cleanup workflow), and detail tenant isolation
controls to prevent cross-tenant queries (per-tenant ACLs, index namespacing,
and auth checks performed in the Worker before search); reference the existing
bullet "Private repo access validated at onboarding and query time" so reviewers
can find where to add these specifics.
Summary
Adds RFC for DiffKit's multi-repo code search MVP using Cloudflare + Livegrep. Proposes a lightweight architecture with Cloudflare Workers as control plane, Livegrep for indexing/search, and D1 for metadata. Includes repository tiering strategy, indexing model, API contracts, operations SLOs, and phased implementation plan.
Changes
docs/cloudflare-livegrep-search-rfc.mdwith complete architecture proposalTest Plan
Screenshots
Summary by CodeRabbit