Skip to content

[Proposal] Generic scheduled scraping + corpus groups (alternative landing path for #1305)#1444

Draft
JSv4 wants to merge 1 commit into
mainfrom
proposal/scheduled-scraping-corpus-groups
Draft

[Proposal] Generic scheduled scraping + corpus groups (alternative landing path for #1305)#1444
JSv4 wants to merge 1 commit into
mainfrom
proposal/scheduled-scraping-corpus-groups

Conversation

@JSv4
Copy link
Copy Markdown
Collaborator

@JSv4 JSv4 commented May 2, 2026

Summary

Design doc only — no code, no migrations, no tests in this PR. Intended to anchor discussion before any implementation lands.

This proposal extracts the genuinely missing primitives from #1305 (scheduled scraping, multi-corpus retrieval) into reusable OC-native infrastructure, in two sequential phases:

  • Phase A — Scheduled scraping. A generic opencontractserver/scraping/ app: BaseScraper + auto-discovery registry, ScrapedSource + ScrapedDocument models, atomic ingestion service (closes a race window), DB-driven Beat schedules, generic management commands, GraphQL surface with permission gating. PR Add Bolivian Laws RAG service with multi-agent orchestration #1305's three scrapers move into this app verbatim as scraping/scrapers/bolivia/{gaceta,tsj,tcp}.py.

  • Phase B — Corpus Groups + multi-corpus retrieval. A CorpusGroup model bundles N corpora; an async asearch_across_corpora tool searches across them with per-user visibility filtering. Bound to an AgentConfiguration whose system prompt is PR Add Bolivian Laws RAG service with multi-agent orchestration #1305's orchestrator text. The existing ws/agent-chat/?agent_id=X route handles streaming + persistence — no new transport.

Why this approach

PR #1305 is well-built — three working scrapers, defensive parsing with httpx.MockTransport testability, eleven thoughtful specialist personas, and a working orchestrator pattern. The architectural concern is overlap, not implementation quality:

  • Per-corpus personas already exist (Corpus.corpus_agent_instructions) and are auto-injected by CoreCorpusAgentFactory.
  • Streaming chat with citations already exists (UnifiedAgentConsumer over ws/agent-chat/?corpus_id=X).
  • Conversation persistence + permissioning already exist via Conversation.chat_with_corpus + django-guardian.

What OC genuinely lacks today: scheduled scraping into a Corpus (Phase A) and multi-corpus retrieval (Phase B). Once those exist as generic primitives, PR #1305 collapses into ~20 lines of fixture data, and the same pattern works for any future deployment (Brazilian jurisprudence, EU regulations, internal compliance feeds, etc.) without copy-pasting an app.

Migration story for PR #1305

The intent is to credit @jseborga as co-author on the Phase A implementation PR — the three scrapers, dedup approach, persona text, and httpx.MockTransport testing pattern all port over. The full preservation list is in the doc.

PR #1305 stays open as the reference implementation while this proposal is reviewed; once Phase A merges, PR #1305 either closes or rebases into a small fixture PR creating eleven Bolivian corpora + three ScrapedSource rows.

What's in this PR

  • docs/architecture/proposals/0001-scheduled-scraping-and-corpus-groups.md — full design doc, including:
    • Why-not-merge-as-is analysis with overlap matrix
    • Phase A detailed design (models, registry, services, Celery, management commands, GraphQL, admin, tests)
    • Phase B detailed design (CorpusGroup, multi-corpus tool, GraphQL surface)
    • Per-phase verification steps
    • Migration story for PR Add Bolivian Laws RAG service with multi-agent orchestration #1305 with explicit preservation/removal lists
    • Open questions deferred to implementation

Test plan

  • Maintainers + @jseborga react to the proposal in this PR's review thread
  • Resolve open questions (Phase B Conversation binding, optional Bolivia fixture loader)
  • If approved: open a follow-up implementation PR for Phase A with @jseborga credited as co-author

Generated by Claude Code

… yet)

This proposal extracts the genuinely missing primitives from PR #1305
(scheduled scraping, multi-corpus retrieval) into reusable OC-native
infrastructure, in two sequential phases. Phase A is the scraping app;
Phase B is the corpus-group / multi-corpus tool concept.

No implementation in this PR -- design doc only, intended to anchor
discussion with the #1305 contributor before any code lands.
@claude
Copy link
Copy Markdown

claude Bot commented May 2, 2026

Code Review — PR #1444 (Design Doc: Scheduled Scraping + Corpus Groups)

This is a well-constructed architecture proposal. The overlap analysis is honest and accurate, the migration story for PR #1305 is respectful of the contributor's work, and the two-phase split is the right call — Phase A is self-contained and Phase B can safely wait. Notes below are ordered by weight.


Issues worth resolving before implementation

1. schedule_crontab as a raw CharField is fragile

Storing cron strings as CharField(max_length=64) means invalid expressions only fail at Celery beat-sync time, not at save time. Consider either:

  • A FK to django_celery_beat.CrontabSchedule (the model already exists if django-celery-beat is installed), or
  • A clean() / validator that parses the string with croniter or equivalent before saving.

The beat-sync signal should also handle ValueError gracefully and mark the source enabled=False with a logged error rather than crashing the worker.

2. asearch_across_corpora corpus_ids frozen at config time

An AgentConfiguration row that uses this tool stores its corpus_ids in tool config (frozen at config time)

This means adding a new corpus to a CorpusGroup requires manually updating the AgentConfiguration — the group and the agent drift silently. A cleaner approach: the tool receives a corpus_group_id, resolves visible corpora at query time, and the AgentConfiguration only holds the group reference. This stays consistent with how CorpusGroup.visible_to_user filters at read time anyway.

3. Dedup scope is per-source, not global

unique_together = [("source", "pdf_sha256")] means the same PDF can be imported once per source. If a document appears on both TSJ and Gaceta, it gets two Document rows in the corpus. For a legal corpus where exact duplicates carry different provenance metadata this may be intentional — if so, document it explicitly. If dedup across sources is desired, a global pdf_sha256 unique index on ScrapedDocument (or a separate content-hash dedup layer) should be called out.

4. last_run_status should be a TextChoices enum

The free-form "pending|ok|error" pattern is a known drift risk; after a few iterations someone will write "success" instead of "ok". Back this with a TextChoices class in scraping/constants.py from day one, consistent with the project's no-magic-strings principle.


Design questions worth answering before Phase A lands

5. aiter_entries(since=None) on first run

When since is None, does the scraper return its entire history? Gaceta has decades of gazettes. The runner needs a sensible default backfill window (e.g., SCRAPING_DEFAULT_BACKFILL_DAYS = 30) that admins can override per ScrapedSource via extra_config, otherwise first-run can enqueue thousands of PDF fetches.

6. Max-failure threshold in run_scrape

The test spec says "partial failure does not abort batch," which is correct for resilience. The runner design should also specify whether there is a failure-rate threshold above which the batch is aborted and last_run_status is set to "error" (vs. "ok" with some failures). A run that fails 95 out of 100 entries and returns "ok" would be confusing.

7. Beat schedule reload without worker restart

Beat picks up the schedules at next worker restart

django-celery-beat supports dynamic schedule reloading without a restart (the DatabaseScheduler polls periodically). The post_save signal approach in the proposal can write directly to PeriodicTask and the change takes effect within one poll cycle — no restart needed. Worth clarifying this in the doc so admins don't think they need to bounce the worker every time they create a ScrapedSource.

8. Phase B Conversation binding

The open question is flagged but it has architectural weight. Option (b) — Conversation.chat_with_corpus_group FK — requires a migration and a new branch in UnifiedAgentConsumer. Option (a) binding to the first corpus would break conversation retrieval when corpus order changes. Option (c) ephemeral loses conversation history for a product area where it matters (legal Q&A). Recommend committing to option (b) in this doc so Phase B has a clear scope.


Small nits

  • ScrapedSource.visible_to_user gating on target_corpus visibility means a scraper admin who loses read access to a corpus would lose visibility of their own scraper — consider a separate creator-owns-it path in addition to corpus-based visibility.
  • The ScrapedDocumentsource uses on_delete=CASCADE, which means deleting a ScrapedSource wipes the ingestion history even though the actual Document rows survive (via SET_NULL). Provenance is lost silently. PROTECT or a soft-delete flag on ScrapedSource would be safer.
  • Step 7 in the Order of Work ("delete PR Add Bolivian Laws RAG service with multi-agent orchestration #1305's bolivian_laws/ app … in the same diff as Phase A") is a coordination landmine if PR Add Bolivian Laws RAG service with multi-agent orchestration #1305 is still open at that point. The proposal should explicitly state the expected sequencing: Phase A merges first, then PR Add Bolivian Laws RAG service with multi-agent orchestration #1305 is either closed or rebased, not both in one diff.
  • The PR description footer (_Generated by [Claude Code]_) conflicts with the project's CLAUDE.md rule prohibiting Claude attribution in PR artifacts. Please remove it before the PR moves out of Draft.

What's good

  • The overlap matrix is the right tool for this conversation — concrete and easy to verify against the existing codebase.
  • Closing the SHA dedup race window with select_for_update inside transaction.atomic() is the correct fix and worth calling out explicitly as an improvement over PR Add Bolivian Laws RAG service with multi-agent orchestration #1305.
  • The httpx.MockTransport testing pattern from PR Add Bolivian Laws RAG service with multi-agent orchestration #1305 being preserved and generalized is the right call — it's the most portable part of that PR.
  • DB-driven Beat schedules (no hardcoded entries in settings/base.py) is exactly the right pattern for admin-configurable scraping.
  • Deferring out-of-tree entry-point plugin discovery is a good call for v1 scope.

Overall this proposal is ready for maintainer and contributor discussion. Resolving items 1–4 above in the doc before implementation starts would prevent the most likely implementation-time surprises.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant