Skip to content

Add search architecture RFC#174

Closed
tembo[bot] wants to merge 1 commit intomainfrom
tembo/fast-github-repo-search-without-api
Closed

Add search architecture RFC#174
tembo[bot] wants to merge 1 commit intomainfrom
tembo/fast-github-repo-search-without-api

Conversation

@tembo
Copy link
Copy Markdown
Contributor

@tembo tembo Bot commented Apr 23, 2026

Summary

Adds RFC for DiffKit's multi-repo code search MVP using Cloudflare + Livegrep. Proposes a lightweight architecture with Cloudflare Workers as control plane, Livegrep for indexing/search, and D1 for metadata. Includes repository tiering strategy, indexing model, API contracts, operations SLOs, and phased implementation plan.

Changes

  • Added docs/cloudflare-livegrep-search-rfc.md with complete architecture proposal
  • Defines control plane (Workers, D1, Queues, R2) and data plane (Livegrep) separation
  • Specifies D1 schema for repo registry, jobs, and index builds
  • Documents API endpoints for search, repo onboarding, and status checks
  • Includes cost controls, security measures, and upgrade path for scaling

Test Plan

  • Review RFC for technical feasibility
  • Validate cost projections with Cloudflare pricing
  • Confirm Livegrep integration approach with team

Screenshots

Before After
No search architecture defined RFC-driven implementation plan ready

Want tembo to make any changes? Add a review or comment with @tembo and i'll get back to work!

View on Tembo View Agent Settings

Summary by CodeRabbit

  • Documentation
    • Added RFC documentation detailing the proposed architecture and implementation roadmap for repository search functionality, including API contracts, operational considerations, and security requirements.

…tecture

Co-authored-by: Alan <45767683+stylessh@users.noreply.github.com>
@tembo tembo Bot added the tembo Pull request created by Tembo label Apr 23, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 23, 2026

📝 Walkthrough

Walkthrough

Adds a new RFC document detailing the proposed MVP architecture for implementing repository search functionality on Cloudflare using Livegrep, including system components, indexing strategies, API contracts, and implementation phasing.

Changes

Cohort / File(s) Summary
RFC Documentation
docs/cloudflare-livegrep-search-rfc.md
New RFC specifying MVP architecture for DiffKit repository search with Cloudflare and Livegrep. Defines control-plane components (Workers, D1, Queues, Cron, R2), data-plane deployment, repository scoping tiers, indexing model, query/status flows, database schema, API contracts, SLOs, cost controls, security requirements, and 3-phase implementation plan.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Add search architecture RFC' is clear and directly describes the main change—adding an RFC document for search architecture.
Description check ✅ Passed The description follows the template structure with Summary, Changes, Test Plan, and Screenshots sections all populated with relevant details.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch tembo/fast-github-repo-search-without-api

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (7)
docs/cloudflare-livegrep-search-rfc.md (7)

79-83: Consider rewording to avoid repetition.

Steps 2-4 all begin with "Worker." Consider varying the sentence structure for better readability.

✍️ Alternative phrasing
1. Client calls Worker search endpoint.
2. Worker checks D1 for repo status and routing metadata.
3. Livegrep backend is queried.
4. Results are normalized and returned with status.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/cloudflare-livegrep-search-rfc.md` around lines 79 - 83, The three
consecutive steps starting with "Worker" are repetitive; revise steps 2–4 to
vary sentence structure so readability improves (e.g., keep step 2 as "Worker
checks D1 for repo status and routing metadata," change step 3 to "Livegrep
backend is queried" and step 4 to "Results are normalized and returned with
status")—apply this phrasing or a similar non-repetitive rewrite for the list in
docs/cloudflare-livegrep-search-rfc.md.

164-186: Consider adding metrics and observability requirements.

While SLOs and alerts are well-defined, consider documenting key metrics to track for operational visibility:

  • Query volume and query latency distribution
  • Index build duration and success rate
  • Total indexed repository count and size
  • Queue depth over time
  • Cost metrics (Worker invocations, D1 operations, R2 storage)

These metrics will help validate that the system meets SLOs and support capacity planning.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/cloudflare-livegrep-search-rfc.md` around lines 164 - 186, Update the
"10. Operations and Reliability" section to include a new "Metrics and
Observability" subsection that lists and defines the key metrics to track (e.g.,
query volume and latency distribution, index build duration and success rate,
total indexed repo count and size, queue depth over time, cost metrics like
Worker invocations/D1/R2 usage) and map those metrics to existing SLOs and
alerts (Search API p95, freshness tiers, queue depth, indexing failures,
Livegrep node reachability) so they feed dashboards and runbooks (full reindex,
single-repo reindex, promote previous index) for capacity planning and incident
response.

204-223: Consider adding testing and deployment strategy to implementation plan.

The 3-week timeline is ambitious. Consider adding:

  • Unit and integration testing approach for each phase
  • Staging environment deployment before production
  • Beta user group for early validation
  • Rollback criteria and procedure for each phase
  • Infrastructure provisioning timeline (Livegrep node setup may have lead time)

This will help derisk the implementation and ensure quality.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/cloudflare-livegrep-search-rfc.md` around lines 204 - 223, Update the
"Implementation Plan" section to include a testing and deployment strategy by
adding explicit entries under Phase 1, Phase 2, and Phase 3: for each phase
mention unit and integration test scope, staging environment deployment and
validation steps, a beta user rollout plan, rollback criteria/procedures, and
infrastructure provisioning lead times (e.g., Livegrep node setup); also add a
short checklist for CI/CD gating and monitoring/alerting readiness so reviewers
can verify de-risking tasks alongside the existing D1 tables, worker endpoints,
mirror sync, and result normalization items.

1-240: Consider adding disaster recovery and backup strategy section.

While R2 backups are mentioned briefly (line 38), the RFC would benefit from a dedicated section covering:

  • RTO (Recovery Time Objective) and RPO (Recovery Point Objective) targets
  • Backup frequency for Livegrep indexes and D1 database
  • Restoration procedure and testing schedule
  • Geographic redundancy strategy (if any)
  • What happens if the Livegrep node is lost completely

This is particularly important given that Livegrep is described as a single node with local SSD storage (line 46).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/cloudflare-livegrep-search-rfc.md` around lines 1 - 240, Add a dedicated
"Disaster Recovery & Backups" section that specifies RTO/RPO targets, backup
frequency and retention for Livegrep indexes (from local SSD) and D1 metadata,
the use of R2 manifests as durable backups, an automated restore procedure and
runbook for full Livegrep node loss (including steps to rehydrate indexes from
R2, rebuild from mirrors, and promote a standby node), geographic redundancy
strategy (cross-region R2 copies or multi-AZ Livegrep replicas) and a regular
restoration testing cadence (e.g., quarterly drills) plus alerting/SLAs;
reference Livegrep node, R2 manifest_r2_key, and D1 tables (search_index_builds,
search_repo_registry) in the section so operators know exactly which artifacts
to back up and how to recover them.

235-240: Consider additional open questions for MVP scope.

The listed questions are valuable. Additional questions to consider:

  • How should repositories with Git submodules be handled (index submodules or skip)?
  • What is the strategy for forks of already-indexed repositories (reuse index or create separate)?
  • How to handle extremely large monorepos (>1M files or >100GB)?
  • Should archived/read-only repositories be indexed with a different cadence?
  • What is the minimum GitHub API rate limit required for the planned sync cadence?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/cloudflare-livegrep-search-rfc.md` around lines 235 - 240, Update the
"## 15. Open Questions" section to include the suggested MVP-scope questions:
add bullets addressing handling of Git submodules (index vs skip), strategy for
forks of already-indexed repos (reuse index vs separate), policy for extremely
large monorepos (>1M files or >100GB), indexing cadence for archived/read-only
repositories, and the minimum GitHub API rate limit required for the planned
sync cadence; place these as additional list items under the existing Open
Questions block so they are visible alongside the original questions.

4-4: Minor style: Consider adding comma after year in date.

Some style guides recommend a comma after the year in month-day-year format.

📝 Suggested style improvement
-- Date: April 23, 2026
+- Date: April 23, 2026,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/cloudflare-livegrep-search-rfc.md` at line 4, Update the Date header
line that currently reads "Date: April 23, 2026" to include a trailing comma
after the year (e.g., "Date: April 23, 2026,") so it follows month-day-year
punctuation conventions; locate the "Date:" metadata/header in the document (the
line starting with "Date:") and add the comma.

187-195: Quantify cost control thresholds for MVP implementation.

Several cost controls are mentioned without specific values. Consider defining:

  • Maximum file size to index (e.g., skip files > 1MB)
  • Binary file detection approach (extension list, magic bytes, etc.)
  • Per-tenant repository limit (e.g., 100 repos per tenant)
  • Inactivity threshold for tier downgrade (e.g., no queries for 7 days)
  • Regex query timeout/complexity limits

These concrete thresholds will guide implementation and help estimate costs more accurately.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/cloudflare-livegrep-search-rfc.md` around lines 187 - 195, Update the
"## 11. Cost Controls" section in docs/cloudflare-livegrep-search-rfc.md to
include concrete thresholds: specify a MAX_INDEX_FILE_SIZE (e.g., 1MB) and state
the binary detection approach (e.g., extension blacklist + magic-byte sniffing),
set a PER_TENANT_REPO_LIMIT (e.g., 100 repos), define an
INACTIVITY_DOWNTIER_DAYS threshold (e.g., 7 days) for auto-downgrade, and add a
REGEX_QUERY_TIMEOUT/complexity limit (e.g., X seconds or complexity metric); use
these symbolic names (MAX_INDEX_FILE_SIZE, BINARY_DETECTION_METHOD,
PER_TENANT_REPO_LIMIT, INACTIVITY_DOWNTIER_DAYS, REGEX_QUERY_TIMEOUT) so
implementers can map them to config constants and cost estimates.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/cloudflare-livegrep-search-rfc.md`:
- Around line 77-91: Clarify what "routing metadata" means in the query flow by
either adding explicit fields to the D1 schema in Section 8 (e.g.,
repo_routing_host, region, livegrep_backend_id or routing_hint) or by updating
the text in "7. Query and Status Flow" to reference the existing D1 fields that
convey routing (e.g., indexed status, backend id, region) and remove the
ambiguous term; update the Worker step "Worker checks D1 for repo status and
routing metadata" to explicitly list the exact D1 fields the Worker reads (or
add those new fields to the D1 schema) so the schema and flow remain consistent.
- Around line 92-129: The D1 schema lacks key constraints, indexes, timestamps,
and clarity about the build table; fix by adding primary keys (pk on
search_repo_registry.id, search_jobs.id, search_index_builds.id), a unique
constraint on search_repo_registry(provider, owner, name), a foreign key
search_jobs.repo_id → search_repo_registry.id, enum/checked constraints for
status/job_type/priority/tier fields, add created_at/updated_at (and optional
size_bytes) to search_repo_registry for auditing/cost, and create indexes:
search_repo_registry(provider, owner, name), search_repo_registry(status, tier),
search_jobs(repo_id, status), and search_jobs(created_at) for queue processing;
also clarify whether search_index_builds is global-only or per-repo and, if
per-repo, add repo_id FK and related indexes to search_index_builds to reflect
that model.
- Around line 41-53: The RFC omits how Workers authenticate to the Livegrep
endpoint; update the "Data plane (Livegrep)" section to specify a concrete
authentication mechanism (e.g., mutual TLS with service certificates, signed
short-lived JWTs from the Worker API layer, or scoped API keys) and detail how
it is provisioned, rotated, and enforced by the Livegrep service and network
policy; reference the Worker -> "Internal endpoint consumed by Worker API layer"
flow and document the expected headers/certificates, validation steps on the
Livegrep side, and any necessary configuration for the Worker/API gateway to
obtain or present credentials.
- Around line 64-76: Update the "Indexing Model" section to explicitly describe
the atomic index swap implementation for Livegrep: state whether Livegrep
supports native atomic switching or not, and if not, document using a dual-index
directory approach (build new index in a temp path and perform an atomic
rename/symlink swap to the live index), include a clear rollback strategy (keep
previous index staged and revert rename/symlink on failure or health-check), and
explain concurrent query handling during swap (use atomic filesystem rename or
symlink update so readers continue hitting the old index until the instant swap,
add health-checks and graceful failover to the previous index if the new one
fails). Ensure these details are placed under "Indexing Model" and reference
"atomic index swap", "Livegrep", and the dual-index/symlink rename approach so
reviewers can find and verify the change.
- Around line 54-63: Clarify how repository source tiers (Tier A, Tier B, Tier
C) map to sync cadence tiers (hot, warm, cold) by updating Section 5 and the
sync-cadence text in Section 6: state a default mapping (e.g., Tier A -> hot,
Tier B -> warm, Tier C -> cold), note any exceptions (manual overrides, org
policies, high-activity repos) and describe rules that promote/demote repos
between hot/warm/cold (e.g., activity thresholds, manual flags), and add one
short example per mapping to make the behavior explicit.
- Around line 196-203: Update the "## 12. Security" section to explicitly
document private repository access handling: describe how GitHub tokens/OAuth
grants are stored (encrypted at rest using Cloudflare Secrets / KMS), rotated
(rotation policy and tooling), and refreshed (token refresh flow or short-lived
tokens + refresh tokens), specify behavior on access revocation or repo
deletion/archival (immediate index removal vs. grace period and the
reindex/cleanup workflow), and detail tenant isolation controls to prevent
cross-tenant queries (per-tenant ACLs, index namespacing, and auth checks
performed in the Worker before search); reference the existing bullet "Private
repo access validated at onboarding and query time" so reviewers can find where
to add these specifics.
- Around line 130-163: Update the API contract for the search and repo endpoints
to include pagination and standardized response/error schemas: for GET
/api/search (refer to the GET /api/search signature) add a limit/per_page
parameter with defaults and a documented max page size, include limit in the
response metadata (e.g., per_page, page, total, partial), and document
rate-limit headers; for POST /api/search/repos (refer to POST /api/search/repos)
add explicit response schemas for success (201 Created with id, status, job_id)
and for validation errors (400 with error, field), and specify status codes for
"created" vs "already exists"; for GET /api/search/repos/:id/status (refer to
that route) define the response shape (lifecycle state, last_indexed_commit,
staleness_seconds, latest_error) and error shapes; finally add a global API
versioning pattern (e.g., /v1/) and a consistent error response schema used by
all endpoints.

---

Nitpick comments:
In `@docs/cloudflare-livegrep-search-rfc.md`:
- Around line 79-83: The three consecutive steps starting with "Worker" are
repetitive; revise steps 2–4 to vary sentence structure so readability improves
(e.g., keep step 2 as "Worker checks D1 for repo status and routing metadata,"
change step 3 to "Livegrep backend is queried" and step 4 to "Results are
normalized and returned with status")—apply this phrasing or a similar
non-repetitive rewrite for the list in docs/cloudflare-livegrep-search-rfc.md.
- Around line 164-186: Update the "10. Operations and Reliability" section to
include a new "Metrics and Observability" subsection that lists and defines the
key metrics to track (e.g., query volume and latency distribution, index build
duration and success rate, total indexed repo count and size, queue depth over
time, cost metrics like Worker invocations/D1/R2 usage) and map those metrics to
existing SLOs and alerts (Search API p95, freshness tiers, queue depth, indexing
failures, Livegrep node reachability) so they feed dashboards and runbooks (full
reindex, single-repo reindex, promote previous index) for capacity planning and
incident response.
- Around line 204-223: Update the "Implementation Plan" section to include a
testing and deployment strategy by adding explicit entries under Phase 1, Phase
2, and Phase 3: for each phase mention unit and integration test scope, staging
environment deployment and validation steps, a beta user rollout plan, rollback
criteria/procedures, and infrastructure provisioning lead times (e.g., Livegrep
node setup); also add a short checklist for CI/CD gating and monitoring/alerting
readiness so reviewers can verify de-risking tasks alongside the existing D1
tables, worker endpoints, mirror sync, and result normalization items.
- Around line 1-240: Add a dedicated "Disaster Recovery & Backups" section that
specifies RTO/RPO targets, backup frequency and retention for Livegrep indexes
(from local SSD) and D1 metadata, the use of R2 manifests as durable backups, an
automated restore procedure and runbook for full Livegrep node loss (including
steps to rehydrate indexes from R2, rebuild from mirrors, and promote a standby
node), geographic redundancy strategy (cross-region R2 copies or multi-AZ
Livegrep replicas) and a regular restoration testing cadence (e.g., quarterly
drills) plus alerting/SLAs; reference Livegrep node, R2 manifest_r2_key, and D1
tables (search_index_builds, search_repo_registry) in the section so operators
know exactly which artifacts to back up and how to recover them.
- Around line 235-240: Update the "## 15. Open Questions" section to include the
suggested MVP-scope questions: add bullets addressing handling of Git submodules
(index vs skip), strategy for forks of already-indexed repos (reuse index vs
separate), policy for extremely large monorepos (>1M files or >100GB), indexing
cadence for archived/read-only repositories, and the minimum GitHub API rate
limit required for the planned sync cadence; place these as additional list
items under the existing Open Questions block so they are visible alongside the
original questions.
- Line 4: Update the Date header line that currently reads "Date: April 23,
2026" to include a trailing comma after the year (e.g., "Date: April 23, 2026,")
so it follows month-day-year punctuation conventions; locate the "Date:"
metadata/header in the document (the line starting with "Date:") and add the
comma.
- Around line 187-195: Update the "## 11. Cost Controls" section in
docs/cloudflare-livegrep-search-rfc.md to include concrete thresholds: specify a
MAX_INDEX_FILE_SIZE (e.g., 1MB) and state the binary detection approach (e.g.,
extension blacklist + magic-byte sniffing), set a PER_TENANT_REPO_LIMIT (e.g.,
100 repos), define an INACTIVITY_DOWNTIER_DAYS threshold (e.g., 7 days) for
auto-downgrade, and add a REGEX_QUERY_TIMEOUT/complexity limit (e.g., X seconds
or complexity metric); use these symbolic names (MAX_INDEX_FILE_SIZE,
BINARY_DETECTION_METHOD, PER_TENANT_REPO_LIMIT, INACTIVITY_DOWNTIER_DAYS,
REGEX_QUERY_TIMEOUT) so implementers can map them to config constants and cost
estimates.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: ed82d5c4-3586-4048-9d02-949125d0c5c7

📥 Commits

Reviewing files that changed from the base of the PR and between 6c2b8f0 and 18f8c9a.

📒 Files selected for processing (1)
  • docs/cloudflare-livegrep-search-rfc.md

Comment on lines +41 to +53
## Data plane (Livegrep)

Livegrep requires persistent CPU and disk-heavy indexing/search. For MVP, run this as a small dedicated service outside Workers runtime:

- 1 index/search node (or 2 for HA later).
- Local SSD for bare clones + active index.
- Internal endpoint consumed by Worker API layer.

Notes:

- Keep Cloudflare as the product-facing layer.
- Keep Livegrep private behind network policy and only callable from Worker/API gateway.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Specify authentication mechanism between Worker and Livegrep endpoint.

The RFC mentions keeping Livegrep private with network policy but doesn't specify how Workers will authenticate to the Livegrep endpoint. Consider documenting the authentication approach (e.g., shared secrets, mTLS, API keys).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/cloudflare-livegrep-search-rfc.md` around lines 41 - 53, The RFC omits
how Workers authenticate to the Livegrep endpoint; update the "Data plane
(Livegrep)" section to specify a concrete authentication mechanism (e.g., mutual
TLS with service certificates, signed short-lived JWTs from the Worker API
layer, or scoped API keys) and detail how it is provisioned, rotated, and
enforced by the Livegrep service and network policy; reference the Worker ->
"Internal endpoint consumed by Worker API layer" flow and document the expected
headers/certificates, validation steps on the Livegrep side, and any necessary
configuration for the Worker/API gateway to obtain or present credentials.

Comment on lines +54 to +63
## 5. Repository Scope Strategy

Start small:

- Tier A: DiffKit org repositories.
- Tier B: user-added repositories.
- Tier C: curated public repositories (manual allowlist).

Explicitly do not crawl all GitHub in MVP.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify relationship between repository tiers (A/B/C) and sync tiers (hot/warm/cold).

Section 5 defines tiers A, B, C based on repository source, while Section 6 (line 103) uses hot/warm/cold based on sync cadence. The relationship between these two classification systems is unclear. Consider clarifying how a repo's tier (A/B/C) maps to its sync cadence (hot/warm/cold).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/cloudflare-livegrep-search-rfc.md` around lines 54 - 63, Clarify how
repository source tiers (Tier A, Tier B, Tier C) map to sync cadence tiers (hot,
warm, cold) by updating Section 5 and the sync-cadence text in Section 6: state
a default mapping (e.g., Tier A -> hot, Tier B -> warm, Tier C -> cold), note
any exceptions (manual overrides, org policies, high-activity repos) and
describe rules that promote/demote repos between hot/warm/cold (e.g., activity
thresholds, manual flags), and add one short example per mapping to make the
behavior explicit.

Comment on lines +64 to +76
## 6. Indexing Model

- Clone mode: `--mirror` bare repos.
- Branch policy: default branch only.
- Sync cadence:
- hot repos: every 15 minutes
- warm repos: every 1-3 hours
- cold repos: daily
- Index updates:
- batch rebuild every hour, or
- event-triggered rebuild after N repo updates
- Publish model: atomic index swap only when build succeeds.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Detail the atomic index swap mechanism with Livegrep.

Line 75 mentions "atomic index swap" but doesn't specify how this will be implemented with Livegrep. Consider documenting:

  • Whether Livegrep natively supports atomic index switching
  • The specific mechanism (symlink swap, dual-index approach, etc.)
  • Rollback strategy if the new index has issues
  • How concurrent queries are handled during the swap
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/cloudflare-livegrep-search-rfc.md` around lines 64 - 76, Update the
"Indexing Model" section to explicitly describe the atomic index swap
implementation for Livegrep: state whether Livegrep supports native atomic
switching or not, and if not, document using a dual-index directory approach
(build new index in a temp path and perform an atomic rename/symlink swap to the
live index), include a clear rollback strategy (keep previous index staged and
revert rename/symlink on failure or health-check), and explain concurrent query
handling during swap (use atomic filesystem rename or symlink update so readers
continue hitting the old index until the instant swap, add health-checks and
graceful failover to the previous index if the new one fails). Ensure these
details are placed under "Indexing Model" and reference "atomic index swap",
"Livegrep", and the dual-index/symlink rename approach so reviewers can find and
verify the change.

Comment on lines +77 to +91
## 7. Query and Status Flow

1. Client calls Worker search endpoint.
2. Worker checks D1 for repo status and routing metadata.
3. Worker queries Livegrep backend.
4. Worker normalizes and returns results + status.

If repo is not indexed:

- Return `NOT_INDEXED` status in response.
- Enqueue high-priority bootstrap job.
- Return optional ETA bucket (`<10m`, `10-30m`, `>30m`).

This avoids empty-result ambiguity and improves UX.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Define "routing metadata" referenced in query flow.

Line 80 mentions checking D1 for "routing metadata" but this field is not included in the D1 schema (Section 8). Consider either adding these fields to the schema or clarifying what routing metadata means in this context.

🧰 Tools
🪛 LanguageTool

[style] ~82-~82: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... 3. Worker queries Livegrep backend. 4. Worker normalizes and returns results + status...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/cloudflare-livegrep-search-rfc.md` around lines 77 - 91, Clarify what
"routing metadata" means in the query flow by either adding explicit fields to
the D1 schema in Section 8 (e.g., repo_routing_host, region, livegrep_backend_id
or routing_hint) or by updating the text in "7. Query and Status Flow" to
reference the existing D1 fields that convey routing (e.g., indexed status,
backend id, region) and remove the ambiguous term; update the Worker step
"Worker checks D1 for repo status and routing metadata" to explicitly list the
exact D1 fields the Worker reads (or add those new fields to the D1 schema) so
the schema and flow remain consistent.

Comment on lines +92 to +129
## 8. D1 Schema (MVP)

Recommended tables:

- `search_repo_registry`
- `id`
- `provider` (`github`)
- `owner`
- `name`
- `default_branch`
- `is_enabled`
- `tier` (`hot|warm|cold`)
- `last_seen_head_sha`
- `last_indexed_head_sha`
- `last_synced_at`
- `last_indexed_at`
- `status` (`ready|syncing|indexing|not_indexed|failed`)

- `search_jobs`
- `id`
- `repo_id`
- `job_type` (`sync|index`)
- `priority` (`interactive|normal|backfill`)
- `status` (`queued|running|done|failed`)
- `attempt`
- `error`
- `created_at`
- `updated_at`

- `search_index_builds`
- `id`
- `build_version`
- `repo_count`
- `started_at`
- `finished_at`
- `status`
- `manifest_r2_key`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Enhance D1 schema with constraints, indexes, and relationships.

The proposed schema is missing several database design elements that are essential for data integrity and performance:

Missing constraints:

  • No primary key constraints specified
  • No foreign key constraint for search_jobs.repo_id → search_repo_registry.id
  • No unique constraint on search_repo_registry(provider, owner, name)
  • No enum constraints for status fields

Missing indexes:

  • search_repo_registry: index on (provider, owner, name) for lookups
  • search_repo_registry: index on tier and status for job scheduling
  • search_jobs: index on (repo_id, status) for status queries
  • search_jobs: index on created_at for queue processing

Missing fields:

  • search_repo_registry should have created_at and updated_at for auditing
  • Consider adding size_bytes to track repository size for cost controls

Unclear design:

  • search_index_builds appears to track global builds, but the indexing model (line 73-74) mentions both batch and per-repo event-triggered rebuilds. Clarify if this table tracks only global builds or also per-repo builds.
📊 Proposed schema enhancements
-- Example constraint additions:
ALTER TABLE search_repo_registry ADD CONSTRAINT pk_search_repo_registry PRIMARY KEY (id);
ALTER TABLE search_repo_registry ADD CONSTRAINT uq_repo UNIQUE (provider, owner, name);
ALTER TABLE search_jobs ADD CONSTRAINT fk_repo FOREIGN KEY (repo_id) REFERENCES search_repo_registry(id);
CREATE INDEX idx_repo_status ON search_repo_registry(status, tier);
CREATE INDEX idx_job_status ON search_jobs(repo_id, status);
🧰 Tools
🪛 LanguageTool

[uncategorized] ~98-~98: The official name of this software platform is spelled with a capital “H”.
Context: ...repo_registry -id -provider (github) - owner -name -default_b...

(GITHUB)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/cloudflare-livegrep-search-rfc.md` around lines 92 - 129, The D1 schema
lacks key constraints, indexes, timestamps, and clarity about the build table;
fix by adding primary keys (pk on search_repo_registry.id, search_jobs.id,
search_index_builds.id), a unique constraint on search_repo_registry(provider,
owner, name), a foreign key search_jobs.repo_id → search_repo_registry.id,
enum/checked constraints for status/job_type/priority/tier fields, add
created_at/updated_at (and optional size_bytes) to search_repo_registry for
auditing/cost, and create indexes: search_repo_registry(provider, owner, name),
search_repo_registry(status, tier), search_jobs(repo_id, status), and
search_jobs(created_at) for queue processing; also clarify whether
search_index_builds is global-only or per-repo and, if per-repo, add repo_id FK
and related indexes to search_index_builds to reflect that model.

Comment on lines +130 to +163
## 9. API Contract (MVP)

`GET /api/search?q=&repo=&path=&lang=&page=`

Response:

- `results`
- `repo_status`
- `partial` (boolean)
- `trace_id`

`POST /api/search/repos`

Body:

- `provider`
- `owner`
- `name`

Behavior:

- validates access/policy
- inserts or updates `search_repo_registry`
- enqueues bootstrap sync+index job

`GET /api/search/repos/:id/status`

Response:

- current lifecycle state
- last indexed commit
- staleness seconds
- latest error (if any)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add pagination controls and complete API response schemas.

The API contract has several gaps:

Missing from search endpoint (line 132):

  • limit or per_page parameter to control result count
  • Maximum allowed page size to prevent excessive resource usage

Missing from repo onboarding endpoint (lines 141-153):

  • Response schema (success/error structure)
  • Status codes for different scenarios (created vs. already exists)
  • Validation error format

Missing globally:

  • API versioning strategy (e.g., /v1/api/search)
  • Error response schema with consistent structure
  • Rate limit headers in responses
🔧 Proposed additions
`GET /api/search?q=&repo=&path=&lang=&page=&limit=`

Additional parameters:
- `limit`: max results per page (default: 50, max: 100)

`POST /api/search/repos`

Response (201 Created):
- `id`: repository ID
- `status`: current status
- `job_id`: bootstrap job ID

Response (400 Bad Request):
- `error`: validation error message
- `field`: problematic field
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/cloudflare-livegrep-search-rfc.md` around lines 130 - 163, Update the
API contract for the search and repo endpoints to include pagination and
standardized response/error schemas: for GET /api/search (refer to the GET
/api/search signature) add a limit/per_page parameter with defaults and a
documented max page size, include limit in the response metadata (e.g.,
per_page, page, total, partial), and document rate-limit headers; for POST
/api/search/repos (refer to POST /api/search/repos) add explicit response
schemas for success (201 Created with id, status, job_id) and for validation
errors (400 with error, field), and specify status codes for "created" vs
"already exists"; for GET /api/search/repos/:id/status (refer to that route)
define the response shape (lifecycle state, last_indexed_commit,
staleness_seconds, latest_error) and error shapes; finally add a global API
versioning pattern (e.g., /v1/) and a consistent error response schema used by
all endpoints.

Comment on lines +196 to +203
## 12. Security

- Worker enforces authN/authZ before search.
- Private repo access validated at onboarding and query time.
- Livegrep endpoint not exposed publicly.
- Secrets stored in Cloudflare secrets.
- Audit all admin/reindex actions in D1 logs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Detail private repository access handling.

Line 199 mentions validating private repo access but doesn't specify the implementation. Consider documenting:

  • How GitHub tokens/OAuth grants are stored, encrypted, and rotated
  • Token refresh mechanism for long-lived access
  • What happens when a user's access to a private repo is revoked (immediate index removal? grace period?)
  • How to handle repository deletion or archival on GitHub
  • How to prevent one tenant from querying another tenant's private repos in a shared index

These details are critical for securing private repository data.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/cloudflare-livegrep-search-rfc.md` around lines 196 - 203, Update the
"## 12. Security" section to explicitly document private repository access
handling: describe how GitHub tokens/OAuth grants are stored (encrypted at rest
using Cloudflare Secrets / KMS), rotated (rotation policy and tooling), and
refreshed (token refresh flow or short-lived tokens + refresh tokens), specify
behavior on access revocation or repo deletion/archival (immediate index removal
vs. grace period and the reindex/cleanup workflow), and detail tenant isolation
controls to prevent cross-tenant queries (per-tenant ACLs, index namespacing,
and auth checks performed in the Worker before search); reference the existing
bullet "Private repo access validated at onboarding and query time" so reviewers
can find where to add these specifics.

@tembo tembo Bot closed this Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tembo Pull request created by Tembo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants