[Proposal] Generic scheduled scraping + corpus groups (alternative landing path for #1305)#1444
[Proposal] Generic scheduled scraping + corpus groups (alternative landing path for #1305)#1444JSv4 wants to merge 1 commit into
Conversation
… yet) This proposal extracts the genuinely missing primitives from PR #1305 (scheduled scraping, multi-corpus retrieval) into reusable OC-native infrastructure, in two sequential phases. Phase A is the scraping app; Phase B is the corpus-group / multi-corpus tool concept. No implementation in this PR -- design doc only, intended to anchor discussion with the #1305 contributor before any code lands.
Code Review — PR #1444 (Design Doc: Scheduled Scraping + Corpus Groups)This is a well-constructed architecture proposal. The overlap analysis is honest and accurate, the migration story for PR #1305 is respectful of the contributor's work, and the two-phase split is the right call — Phase A is self-contained and Phase B can safely wait. Notes below are ordered by weight. Issues worth resolving before implementation1. Storing cron strings as
The beat-sync signal should also handle 2.
This means adding a new corpus to a 3. Dedup scope is per-source, not global
4. The free-form Design questions worth answering before Phase A lands5. When 6. Max-failure threshold in The test spec says "partial failure does not abort batch," which is correct for resilience. The runner design should also specify whether there is a failure-rate threshold above which the batch is aborted and 7. Beat schedule reload without worker restart
8. Phase B Conversation binding The open question is flagged but it has architectural weight. Option (b) — Small nits
What's good
Overall this proposal is ready for maintainer and contributor discussion. Resolving items 1–4 above in the doc before implementation starts would prevent the most likely implementation-time surprises. |
Summary
Design doc only — no code, no migrations, no tests in this PR. Intended to anchor discussion before any implementation lands.
This proposal extracts the genuinely missing primitives from #1305 (scheduled scraping, multi-corpus retrieval) into reusable OC-native infrastructure, in two sequential phases:
Phase A — Scheduled scraping. A generic
opencontractserver/scraping/app:BaseScraper+ auto-discovery registry,ScrapedSource+ScrapedDocumentmodels, atomic ingestion service (closes a race window), DB-driven Beat schedules, generic management commands, GraphQL surface with permission gating. PR Add Bolivian Laws RAG service with multi-agent orchestration #1305's three scrapers move into this app verbatim asscraping/scrapers/bolivia/{gaceta,tsj,tcp}.py.Phase B — Corpus Groups + multi-corpus retrieval. A
CorpusGroupmodel bundles N corpora; an asyncasearch_across_corporatool searches across them with per-user visibility filtering. Bound to anAgentConfigurationwhose system prompt is PR Add Bolivian Laws RAG service with multi-agent orchestration #1305's orchestrator text. The existingws/agent-chat/?agent_id=Xroute handles streaming + persistence — no new transport.Why this approach
PR #1305 is well-built — three working scrapers, defensive parsing with
httpx.MockTransporttestability, eleven thoughtful specialist personas, and a working orchestrator pattern. The architectural concern is overlap, not implementation quality:Corpus.corpus_agent_instructions) and are auto-injected byCoreCorpusAgentFactory.UnifiedAgentConsumeroverws/agent-chat/?corpus_id=X).Conversation.chat_with_corpus+ django-guardian.What OC genuinely lacks today: scheduled scraping into a Corpus (Phase A) and multi-corpus retrieval (Phase B). Once those exist as generic primitives, PR #1305 collapses into ~20 lines of fixture data, and the same pattern works for any future deployment (Brazilian jurisprudence, EU regulations, internal compliance feeds, etc.) without copy-pasting an app.
Migration story for PR #1305
The intent is to credit @jseborga as co-author on the Phase A implementation PR — the three scrapers, dedup approach, persona text, and
httpx.MockTransporttesting pattern all port over. The full preservation list is in the doc.PR #1305 stays open as the reference implementation while this proposal is reviewed; once Phase A merges, PR #1305 either closes or rebases into a small fixture PR creating eleven Bolivian corpora + three
ScrapedSourcerows.What's in this PR
docs/architecture/proposals/0001-scheduled-scraping-and-corpus-groups.md— full design doc, including:Test plan
Generated by Claude Code