Skip to content

One-click collection-intelligence setup#1982

Open
JSv4 wants to merge 9 commits into
mainfrom
feature/corpus-intelligence-setup
Open

One-click collection-intelligence setup#1982
JSv4 wants to merge 9 commits into
mainfrom
feature/corpus-intelligence-setup

Conversation

@JSv4

@JSv4 JSv4 commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Problem

A freshly created corpus lands with an unreadable document index: titles all share the same truncated prefix and Document.description holds whatever the importer wrote (raw metadata dumps like **filing_type:** S-1 **doc_role:** exhibit…), and summary coverage sits at 0%. All the machinery to fix this already exists — the reference-web analyzer, and the Document Description Updater / Document Summary Generator action templates from the Action Library — but nothing composed it at corpus setup, and nothing batch-ran the LLM actions over documents already in the corpus.

Before (documents drawer) After (LLM descriptions)
**filing_type:** S-1 **doc_role:** exhibit **exhibit_number:** 10.(2)(… "This document is the primary S-1 registration statement filed by Fervo Energy Company with the SEC on April 17, 2026…"

What this adds

CorpusIntelligenceSetupService (opencontractserver/corpuses/services/intelligence_setup.py) — one idempotent composite:

  1. Deterministic: installs the reference-enrichment add_document CorpusAction (same row the governance-graph CTA creates) and starts the first weave immediately.
  2. LLM: clones the description + summary templates (bundle pinned in INTELLIGENCE_SETUP_TEMPLATE_NAMES) and batch-runs each over every existing document.

Re-running converges: existing rows are reused, already-run documents are skipped, an in-flight weave isn't duplicated. Per-template failures (e.g. the BATCH_RUN_MAX_DOCS cap) surface in the summary without failing the whole call.

GraphQL: setupCorpusIntelligence mutation + corpusIntelligenceSetupStatus query; createCorpus now returns objId so follow-ups can chain off creation.

Frontend (both entry points):

  • IntelligenceSetupBanner mounted inside IntelligencePanel — appears on the intelligence overview and the insight-panel CAML embed, offers "Set up", reports the queued fan-out in a toast, and hides once the bundle is installed.
  • New Corpus modal gains a default-on "Set up collection intelligence" opt-in that chains the mutation after creation.

Verification

  • test_intelligence_setup.py: service install/idempotence/permission-gating/status + GraphQL smoke (7 tests).
  • IntelligenceSetupBanner.ct.tsx: offer → run → hide, and silent-when-set-up (2 CT tests, doc screenshot).
  • Live-proven on the dev stack: a fresh 2-doc corpus flips not-set-up → fully-set-up in one call (weave started, 4 agent runs queued); the 37-doc Fervo demo corpus's index is now human-readable after the same machinery drained.

Stacked on #1977.

New corpora landed with unreadable document indexes (raw import metadata as
descriptions, 0% summary coverage) because nothing composed the existing
enrichment machinery at corpus setup: the reference-web CTA installed only
the deterministic half, and the LLM action templates (descriptions,
summaries) sat in the Action Library waiting to be manually added and
batch-run.

CorpusIntelligenceSetupService composes the default bundle in one
idempotent call: installs the reference-enrichment add_document action and
starts the first weave, clones the Document Description Updater + Document
Summary Generator templates, and batch-runs each over every document
already present. Exposed as the setupCorpusIntelligence mutation +
corpusIntelligenceSetupStatus query; createCorpus now returns objId so the
New Corpus modal's default-on opt-in can chain setup after creation. An
IntelligenceSetupBanner inside IntelligencePanel offers setup on both the
intelligence overview and the insight-panel CAML embed, and disappears once
the bundle is installed.

Live-proven on the dev stack: status flips not-set-up -> fully-set-up, the
weave starts, and the description/summary agents drain over existing docs
(the Fervo demo corpus index is now human-readable).
@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

Code Review — PR #1982: One-click collection-intelligence setup

Overall: Solid feature addition. The service architecture is clean, idempotency is handled correctly at every layer, and permission semantics follow project conventions. A few items worth addressing before merge.


What it does (summary)

  • CorpusIntelligenceSetupService — an idempotent composite that installs the reference-enrichment add_document action, starts the first weave, and clones plus batch-runs the description/summary templates over existing documents.
  • GraphQL: setupCorpusIntelligence mutation + corpusIntelligenceSetupStatus query.
  • Frontend: IntelligenceSetupBanner (mounts inside IntelligencePanel, hides once fully set up) and a default-on opt-in in the corpus-creation modal.

Issues to Address

1. Backend createCorpus must return objId — confirm this is covered

frontend/src/graphql/mutations.ts adds objId to the CREATE_CORPUS query, and Corpuses.tsx reads data.data.createCorpus.objId to chain the setup call. The diff does not show a corresponding backend change to corpus_mutations.py's createCorpus resolver returning obj_id. If that field is absent from the backend response, newCorpusId is always undefined and the opt-in setup silently never fires.

Please confirm (or point to the base-branch commit) that createCorpus already emits obj_id before this PR lands.

2. Private model method: corpus._get_active_documents()

# intelligence_setup.py — setup() method
total_active_documents=corpus._get_active_documents().count(),

This calls an underscore-prefixed method, which by convention is private API. If it is renamed or removed, this breaks silently at runtime (no static analysis will catch it). Either expose a public counterpart on Corpus or reach for the service layer's own queryset helpers.

3. No rate limit on the setupCorpusIntelligence mutation

The status query correctly applies READ_LIGHT throttling. The mutation kicks off potentially dozens of agent runs, but carries no @graphql_ratelimit_dynamic decorator. A user with a large corpus could call this in a tight loop and fan out a lot of background work. Worth adding a WRITE_HEAVY (or equivalent) gate consistent with other mutations that trigger Celery work.


Minor Observations

Race condition in _setup_reference_enrichment (acceptable)
The in-flight check and the analysis start are not atomic — another request could slip between the two. Starting a duplicate analysis is recoverable (just wasted work), but a brief comment acknowledging the TOCTOU window is intentional would help future readers.

IntelligenceSetupBanner silently ignores query errors

const { data, refetch } = useQuery(GET_CORPUS_INTELLIGENCE_SETUP_STATUS, ...);
if (!status || status.isFullySetUp) return null;

A network error also satisfies !status, so the banner disappears on transient failures with no feedback. Returning null is fine for a banner, but not destructuring error from useQuery means you cannot distinguish "loading" from "error" if you ever need to log it. Low priority.

Optimistic hide after clicking "Set up"
After click, the banner stays visible with a spinner until refetch() resolves. Since setup is idempotent server-side, hiding optimistically on a successful mutation response (before the refetch) would feel snappier. Not a blocker.

log_action keyword arguments

cls.log_action(
    "Intelligence setup for",
    corpus,
    user,
    reference_started=summary.reference_analysis_started,
    ...
)

BaseService.log_action needs to accept **kwargs for this not to raise TypeError. Quick sanity-check to confirm the base class signature handles these.


What is done well

  • Idempotency at every step: action dedup, batch-run skip-if-already-run, in-flight analysis guard — all verified by tests.
  • Permission semantics are correct: get_or_none (READ) gates visibility; require_permission(UPDATE) gates mutation — consistent with project conventions.
  • IDOR prevention: single ambiguous error message for both "not found" and "no permission".
  • Per-template error isolation: a BATCH_RUN_MAX_DOCS cap or similar failure in one template surfaces in outcome.error without failing the whole call.
  • Soft failure from the modal: intelligence-setup failure emits toast.info, not toast.error, and does not roll back the corpus creation.
  • Tests: service (install / idempotence / permission gating / status), GraphQL smoke, and two CT tests with a doc screenshot. Good coverage for the scope.
  • Changelog fragment present and well-written.

Verdict: Address items 1 through 3 before merge. Everything else is polish.

@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

JSv4 added 2 commits June 10, 2026 23:47
…note

- intelligence_setup.py: count active documents through
  CorpusDocumentService.get_corpus_documents (corpus-as-gate; the setup user
  holds UPDATE→READ) instead of reaching into the private
  Corpus._get_active_documents — same set, include_caml=False on both.
- corpus_mutations.py: rate-limit SetupCorpusIntelligence with WRITE_HEAVY,
  matching StartCorpusActionBatchRun (it likewise fans out batch agent runs).
- intelligence_setup.py: document the deliberate non-atomic in-flight check /
  analysis start (duplicate weave is recoverable; the writer is idempotent).

createCorpus already emits obj_id via DRFMutation (base.py:197); no change
needed. log_action accepts **extra, so the kwargs call is safe.
@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

Code Review: One-click collection-intelligence setup

Overview: This PR adds a well-structured orchestration layer composing existing enrichment machinery (reference-web analyzer, description/summary templates) into a single idempotent setupCorpusIntelligence call. No blockers found, but a few items worth addressing.

Strengths

  • Service layer compliance: CorpusIntelligenceSetupService extends BaseService, uses ServiceResult, defers model imports to prevent circular imports.
  • Idempotency is thorough: action rows deduplicated, already-run documents skipped, in-flight analyses not duplicated. test_setup_is_idempotent validates all three.
  • IDOR prevention is correct: the stranger test returns the same opaque message whether the corpus is missing or the caller lacks permission.
  • Race condition is acknowledged: the comment in _setup_reference_enrichment correctly explains why non-atomic is acceptable.
  • get_corpus_documents corpus-as-gate semantic is the right choice for a pipeline-facing caller.

Issues to address before merge

1. Incomplete TypeScript output type (mutations.ts)

SetupCorpusIntelligenceOutputs.summary is missing referenceActionAlreadyInstalled, which is required=True in the backend CorpusIntelligenceSetupSummaryType. The frontend does not consume it today but the type should mirror the schema.

2. Misleading toast when the batch-run cap is hit (IntelligenceSetupBanner.tsx)

When batch_run_on_corpus hits BATCH_RUN_MAX_DOCS, outcome.queued_count stays 0 but ok=True. The toast is built from sum(t.queuedCount), so when the cap fires on all templates the user sees 'Collection intelligence is set up.' even though nothing was queued. Consider checking payload.summary.templates.some(t => t.error) and showing a toast.warn instead.

Optional cleanup

  • INTELLIGENCE_SETUP_TEMPLATE_NAMES (corpus_actions.py): Declared list[str] but never mutated. tuple[str, ...] is more semantically accurate for a module-level constant.
  • Duplicate mock in the already-set-up CT test: mocks={[statusMock(true), statusMock(true)]} but only one fetch fires. Using two can silently paper over an unexpected extra query.
  • No test for the batch-fail path in _setup_templates: A test patching CorpusActionService.batch_run_on_corpus to return a failure would lock in the error-absorption behaviour.

Summary: Orchestration design, idempotency, permission gating, and service layer usage are all sound. Items 1 and 2 are worth fixing before merge. The rest is optional cleanup.

JSv4 added 3 commits June 11, 2026 00:47
- mutations.ts: add referenceActionAlreadyInstalled to
  SetupCorpusIntelligenceOutputs.summary (interface + query selection) to
  mirror the required backend field.
- IntelligenceSetupBanner: when ok=True but nothing queued AND a template
  carried an error (e.g. BATCH_RUN_MAX_DOCS cap), show toast.warning instead
  of 'Collection intelligence is set up.'
- corpus_actions.py: INTELLIGENCE_SETUP_TEMPLATE_NAMES list -> tuple (module
  constant is never mutated); test compares against list(...) accordingly.
@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

Code Review: One-click collection-intelligence setup

Overview

Solid orchestration layer that composes existing enrichment machinery into a single idempotent call. The design is clean — service layer owns the logic, GraphQL is a thin adapter, frontend stays simple. The idempotency story is well-reasoned and the IDOR-safe messaging convention is followed correctly throughout.


Issues to Address

1. _setup_templates — unhandled non-IntegrityError exceptions from clone_to_corpus (bug)

intelligence_setup.py:_setup_templates only catches IntegrityError inside the savepoint:

try:
    with transaction.atomic():
        action = template.clone_to_corpus(corpus, creator=user)
except IntegrityError:
    ...

If clone_to_corpus raises anything else (e.g. OperationalError, ValueError), the exception propagates out of the for name in INTELLIGENCE_SETUP_TEMPLATE_NAMES: loop. All remaining templates are aborted and the caller gets a 500 rather than the graceful partial-success response the design promises. The reference action and any already-committed earlier templates are left installed in a half-done state.

Suggested fix — add a broader except after IntegrityError:

except IntegrityError:
    action = CorpusAction.objects.filter(corpus=corpus, source_template=template).first()
    outcome.already_installed = action is not None
except Exception as exc:
    outcome.error = f"Failed to install template: {exc}"
    logger.exception("Intelligence setup: clone failed for %r on corpus %s", name, corpus.pk)
    continue

2. createCorpus.objId — backend availability needs confirmation

mutations.ts adds objId to the CREATE_CORPUS query, and Corpuses.tsx chains trySetupIntelligence off data.data.createCorpus.objId. The diff doesn't show a corresponding backend change to the CreateCorpus graphene mutation. If the field isn't already exposed, it resolves to null, newCorpusId is falsy, and the post-create intelligence setup silently never runs even when the user checked the opt-in.

Please confirm the existing CreateCorpus mutation already returns objId (or note where it was added, e.g. stacked PR #1977).


Minor Issues

3. CT test mock missing referenceActionAlreadyInstalled

IntelligenceSetupBanner.ct.tsx:setupMock — the summary object omits referenceActionAlreadyInstalled, which the SETUP_CORPUS_INTELLIGENCE query requests. Apollo's MockedProvider returns undefined for it. The component doesn't read this field so the test passes, but the mock doesn't match the real server contract.

  summary: {
    referenceAvailable: true,
    referenceActionInstalledNow: true,
+   referenceActionAlreadyInstalled: false,
    referenceAnalysisStarted: true,

4. status query — three DB queries per call

resolve_corpus_intelligence_setup_status issues three queries (reference-action exists, installed names, missing names). Currently mounted once per corpus page load, so this is fine. Worth a brief comment so the cost is visible if polling or corpus-list rendering is considered later.


What's Well Done

  • Idempotency design: every step dedupes — reference action checked before create, IntegrityError savepoint mirrors AddTemplateToCorpus, batch-run skips already-executed docs, in-flight analysis detection. Re-running converges cleanly.
  • Permission model: get_or_none (visibility) + require_permission(UPDATE) + IDOR-safe single error message throughout.
  • Rate limiting: mutation on WRITE_HEAVY, status query on READ_LIGHT.
  • Frontend banner: silent-while-loading, silent-once-set-up, post-create failure surfaces as a toast without tainting the corpus-creation success.
  • Service architecture: logic lives entirely in the service; GraphQL is a thin adapter.
  • Test coverage: service install/idempotence/permission/status + GraphQL smoke + CT banner offer→run→hide. setUpTestData/patch pattern is correct and the seeder mirrors the migration path.
  • Changelog fragment: present and well-structured.

Summary

Two items before merge: the unhandled-exception gap in _setup_templates (real user-visible 500) and confirmation that createCorpus.objId is already exposed server-side (silent setup-skip if not). Items 3–4 are polish.

JSv4 added 3 commits June 11, 2026 01:53
Root cause of the codecov backend-patch miss: test_enrichment_backfill.py and
test_enrichment_writer.py imported config.graphql.schema at MODULE level. Under
--cov instrumentation that builds the graphene schema at collection time and
errors (graphene-django CustomField field-resolution), so the whole file is
dropped and its coverage — including the bootstrap_authority management command
and the enrichment services — never reaches the upload. That is why a
well-tested command showed 0% patch.

Defer the schema import into the _execute helpers (the pattern already used by
test_enrichment_tools.py / test_governance_graph.py), so the build happens at
runtime and the files' coverage is always captured.

Also close the genuinely-uncovered bootstrap_authority branches: unknown
creator, unreadable spec file, and missing 'sections' list (now 100%), plus
relink_corpora_for_keys per-corpus failure isolation and the empty-keys
short-circuit.
Review fixes:
- _setup_templates now contains non-IntegrityError clone failures per
  template (broad except + log + continue) instead of letting them abort the
  loop and return a 500 with earlier templates half-installed — honoring the
  bundle's graceful partial-success contract.
- status(): documented the three-query cost (fine at one-per-page-load;
  revisit if ever polled or rendered per corpus-list row).
- IntelligenceSetupBanner.ct.tsx setupMock now includes
  referenceActionAlreadyInstalled, matching the real SETUP_CORPUS_INTELLIGENCE
  selection set.

Coverage (codecov/patch/Backend + /Frontend both below target):
- Backend: cover intelligence_setup error/edge branches — analyzer not
  registered, in-flight analysis suppresses a duplicate start, failed analysis
  start, inactive template, and the new contained clone-failure path.
- Frontend: cover the banner's error-toast, soft-warning (nothing queued +
  template error), clean set-up, and catch (network error) branches.

Confirmed (no change needed): createCorpus.objId is exposed server-side via
DRFMutation (config/graphql/base.py:197), so the post-create intelligence
setup chain in Corpuses.tsx receives a real id.
@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

test

@JSv4 JSv4 changed the base branch from feature/reference-web-ui to main June 11, 2026 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant