You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PRD: Canonical document model (DB schema + migrations)
Problem Statement
EU Fact Force needs a stable canonical representation of a document in the database so ingestion, parsing, embeddings, search, and future features (claims, graph, trends) can share one notion of "a document." Today ingestion centers on SourceFile (storage + status) and DocumentChunk tied only to files. That does not separate logical document identity (DOI or not, national reports, uploads) from physical file artifacts, and it does not capture where metadata came from or ingestion lineage for debugging and audit.
Without this layer, later work (DOI fetch, upload paths, parsing, chunking) risks inconsistent data and expensive refactors.
Solution
Introduce a canonical Document model as the aggregate root for bibliographic and product-facing metadata, with relational polymorphism for future source types (not Django model inheritance). Keep SourceFile as the model for stored binary assets (e.g. PDF). Add IngestionRun to record each ingestion attempt with full stage and success-kind tracking. Add ParsedArtifact for the single parse output per document, storing the raw Docling output, postprocessed text, extracted metadata snapshot, and parser config. Persist raw provider payloads on IngestionRun so mappings can evolve without losing source truth.
Two entry points are supported: DOI-initiated ingestion (with upfront DOI validation) and direct PDF upload. Both converge on the same parse → metadata reconciliation → chunk pipeline when a PDF is available. DOI-initiated ingestion without a PDF produces a metadata-only document.
Deliver this as database models and migrations (plus minimal admin registration for visibility). Do not fully rewire the ingestion pipeline in this change; that is follow-up work.
Ingestion Flow
flowchart TD
Start([Start ingestion]) --> Input{Input type?}
Input -->|DOI| R0[Check DOI]
R0 --> |if DOI is unique and valid| R1["IngestionRun: Set status=running and stage=acquire"]
R0 --> |if DOI is duplicate or invalid| R1prime[STOP: Invalid DOI or duplicate]
R1 --> D1["Document: set DOI"]
D1 --> Fetch["Fetch metadata APIs + attempt PDF acquisition"]
Fetch --> PDF{PDF found ?}
PDF -->|No| Fill["Write normalized metadata on Document"]
Fill --> Skip["No SourceFile / ParsedArtifact / Chunks since no PDF"]
Skip --> OKmeta["IngestionRun: success, success_kind=metadata_only, stage=done"]
PDF -->|Yes| SFdoi["SourceFile: store blob, advance stage store"]
Input -->|PDF upload| R2["IngestionRun: status=running, stage=acquire"]
R2 --> SFup["SourceFile: create from upload"]
SFup --> Dpdf["Document: link SourceFile"]
SFdoi --> Parse["ParsedArtifact: Store resulting docling JSON + postprocessed text + metadata_extracted + parser_config"]
Dpdf --> Parse
Parse --> Q{For each metadata field: Is API value non-empty?}
Q -->|Yes| UseAPI["Write API value on Document"]
Q -->|No| Q2{Parsing value non-empty?}
Q2 -->|Yes| UseParse["Write parsing value on Document"]
Q2 -->|No| UseNull["Set field NULL on Document"]
UseAPI --> Audit["If API ≠ parsing: keep snapshot in ParsedArtifact.metadata_extracted"]
UseParse --> Audit
UseNull --> Audit
Audit --> More{More normalized fields?}
More -->|Yes| Q
More -->|No| Chunk["DocumentChunks + embeddings stages chunk → embed"]
Chunk --> OKfull["IngestionRun: success, success_kind=full, stage=done"]
Fetch -.->|error| Fail([IngestionRun: status=failed])
Parse -.->|error| Fail
Chunk -.->|error| Fail
Loading
User Stories
As a backend developer, I want a Document table with a required non-empty title, so that every canonical record has a human-readable label even when other metadata is incomplete.
As a backend developer, I want Document to support optional DOI and other external identifiers, so that national reports and uploads without DOIs are first-class.
As a backend developer, I want to create a Document with metadata only before any PDF is available, so that workflows can record bibliographic data as soon as it is known.
As a backend developer, I want SourceFile to represent only the stored file (e.g. S3 key, status), so that file lifecycle stays separate from canonical metadata.
As a backend developer, I want Document to link to at most one SourceFile when a file exists, so that the "metadata-only" and "file attached" states are explicit.
As a backend developer, I want deleting a SourceFile to delete the related Document when that document is tied to that file, so that storage cleanup matches the agreed cascade semantics (see implementation decisions for nullable vs attached cases).
As a backend developer, I want an IngestionRun row created at the very beginning of ingestion (before any fetch or file operation), so that every attempt is recorded even if it fails immediately.
As a backend developer, I want IngestionRun to expose a stage field that reflects the last-reached pipeline stage (acquire | store | parse | chunk | done), so that I can pinpoint exactly where a failed run stopped.
As a backend developer, I want IngestionRun to expose a success_kind field (metadata_only | full) on success, so that I can distinguish runs that produced a full document from those that produced metadata only.
As a backend developer, I want raw provider API responses stored on IngestionRun, so that I can reprocess or audit without re-fetching from external APIs.
As a backend developer, I want DOI format and uniqueness validated before an IngestionRun is created, so that duplicate or malformed DOIs are rejected early with a clear error.
As a backend developer, I want exactly one ParsedArtifact per Document (enforced in schema), so that there is no ambiguity about "which parse is current."
As a backend developer, I want ParsedArtifact to store the raw Docling JSON, postprocessed text, a metadata_extracted snapshot, and the parser_config used, so that parsing is auditable and reproducible.
As a backend developer, I want normalized metadata written to Document using a deterministic priority rule (API value > parsed value > NULL), with divergences preserved in ParsedArtifact.metadata_extracted, so that the canonical record reflects the best available source while the audit trail is complete.
As a backend developer, I want DocumentChunk to require a Document, so that retrieval and future evidence linking are anchored on the canonical document.
Relationship to draft research catalog model
A separate draft shared on Mattermost (shown above) sketches entities such as ResearchPaper, Author, reference tables for document type and evidence hierarchy, Theme, Keywords, and chunk-level embedding/citation concepts. That draft is not final and describes a broader research-catalog and taxonomy layer than this work.
The codebase uses Document as the canonical entity name instead of ResearchPaper. This is a voluntary choice for clarity and consistency with the ingestion roadmap; it does not change the intended role of the table as the logical "paper" or publication record.
Scope: This PRD remains limited to the ingestion spine (canonical document, stored file, lineage, raw provider payload, single parse artifact, chunks). Normalized catalog concerns—including DocType, HierarchyOfEvidence, Journal, Author, Keywords, and theme assignment—are out of scope here and are expected in follow-up work once stable document and chunk identifiers exist.
Themes: The draft shows a single topic/theme identifier per paper; the product direction is to support many themes per document when applicable (e.g. a many-to-many relationship in a later schema). This PRD does not implement theme tables or links.
Implementation Decisions
Aggregate root: Document is the canonical entity for bibliographic/product metadata; SourceFile remains a physical artifact.
Polymorphism: Use relational modeling (typed fields / related tables later), not multi-table Django inheritance for SourceFile subclasses.
Identifiers: DOI is optional; support additional IDs via a structured field (e.g. JSON) for provider-specific keys (PMID, arXiv, internal ids). DOI format and uniqueness are validated before the IngestionRun is created; invalid or duplicate DOIs are rejected immediately.
Title: Required, non-null and non-blank at the database level.
Partial ingest: Allow Document rows with title and other allowed fields before any SourceFile is given; attachment is optional until the file exists.
Entry points: Two supported paths:
DOI-initiated: validate DOI -> create IngestionRun -> create Document (DOI set) -> fetch metadata APIs + attempt PDF acquisition -> if no PDF: write normalized metadata and close as metadata_only; if PDF found: continue to full pipeline.
PDF upload: create IngestionRun -> create SourceFile from upload -> link Document to SourceFile -> full parse pipeline.
IngestionRun lifecycle: Created at the very start of ingestion (before any fetch or file operation). Fields:
status: running | success | failed
stage: last-reached pipeline stage — acquire | store | parse | chunk | done — used to pinpoint where a failed run stopped
success_kind: metadata_only | full (set on success; null otherwise)
input_type: doi | pdf_upload
input_identifier: DOI string or upload reference
provider: metadata API(s) used, if applicable
raw_provider_payload: verbatim API response (JSON); stored here, not on Document
error_message and error_stage for debugging failed runs
pipeline_version: version tag for reproducibility
Timestamps: created_at, updated_at
ParsedArtifact fields: 1:1 with Document (enforced uniqueness). Stores:
docling_output: raw Docling JSON
postprocessed_text: text after postprocessing
metadata_extracted: structured snapshot of metadata as extracted by the parser — serves as audit record and fallback when API metadata is absent or diverges from parsed values
parser_config: Docling parameters and model versions used, for reproducibility
Re-parsing in the future replaces the same logical row or requires a follow-up PRD.
Metadata reconciliation: When writing normalized metadata to Document, apply a deterministic priority: API value (non-empty) > parsed value (non-empty) > NULL. When the API value and parsed value diverge, the parsed value is preserved in ParsedArtifact.metadata_extracted for audit. The reconciliation priority across multiple API providers (e.g. CrossRef vs PubMed) will be defined with the data-acquisition team in a follow-up.
Chunks: DocumentChunk must have a required foreign key to Document; adjust or retain linkage to SourceFile / parse artifact as needed for migration continuity and provenance.
Cascade deletion: Deleting SourceFile deletes the associated Document when the document is defined as dependent on that file—implemented with Django on_delete and nullable FK where metadata-only documents exist without a file.
Constraints: Strict nullability and uniqueness rules in migrations (e.g. partial unique constraint on DOI when non-empty).
Admin: Light registration for new models; richer admin later.
Scope boundary: Models and migrations only for this PRD; ingestion services, views, embedding, and chunking do not need to be fully migrated in this deliverable, but migrations should remain applicable to existing deployments.
Testing Decisions
Good tests assert observable database behavior: constraints (unique DOI when present), required fields (title), cascade behaviour, and relationship cardinality (one parse per document), not internal implementation details of helpers.
Test IngestionRun stage and success_kind transitions: assert that a metadata-only run lands on success_kind=metadata_only, stage=done and a full run on success_kind=full, stage=done.
Test metadata reconciliation priority: assert that an API-provided value overwrites a parsed value, that a parsed value is used when the API value is empty, and that ParsedArtifact.metadata_extracted captures the divergence when both are non-empty and differ.
Test DOI validation: assert that a duplicate or malformed DOI is rejected before any IngestionRun row is created.
Modules to test: Model-level behaviour via Django's ORM and migrations (integration-style tests in the existing test suite pattern).
Existing ingestion tests for models, services, and pipeline runs in the repository's test layout; new tests should follow the same pytest + django_db patterns.
Out of Scope
Rewriting run_pipeline, fetch stubs, or upload flows to use Document end-to-end.
Changing search, embedding, or chunking algorithms or APIs.
API contract changes for the web app.
Multi-source metadata reconciliation priority order (to be defined with the data-acquisition team).
Research catalog tables and links (Author, Keywords, Theme, evidence hierarchy, journal normalization, many-to-many themes)—see Relationship to draft research catalog model.
Further Notes
Align naming and relationships with the internal roadmap document that describes Document, IngestionRun, raw assets, parsed artifacts, and chunks as the ingestion spine.
If a future requirement introduces re-parsing with history, the 1:1 ParsedArtifact rule would need revisiting; current expectation is single parse per document over the tool's lifetime.
The metadata reconciliation priority across multiple API providers is deferred to a follow-up discussion with the data-acquisition team; this PRD establishes the field structure and the API-over-parsed rule only.
PRD: Canonical document model (DB schema + migrations)
Problem Statement
EU Fact Force needs a stable canonical representation of a document in the database so ingestion, parsing, embeddings, search, and future features (claims, graph, trends) can share one notion of "a document." Today ingestion centers on
SourceFile(storage + status) andDocumentChunktied only to files. That does not separate logical document identity (DOI or not, national reports, uploads) from physical file artifacts, and it does not capture where metadata came from or ingestion lineage for debugging and audit.Without this layer, later work (DOI fetch, upload paths, parsing, chunking) risks inconsistent data and expensive refactors.
Solution
Introduce a canonical
Documentmodel as the aggregate root for bibliographic and product-facing metadata, with relational polymorphism for future source types (not Django model inheritance). KeepSourceFileas the model for stored binary assets (e.g. PDF). AddIngestionRunto record each ingestion attempt with full stage and success-kind tracking. AddParsedArtifactfor the single parse output per document, storing the raw Docling output, postprocessed text, extracted metadata snapshot, and parser config. Persist raw provider payloads onIngestionRunso mappings can evolve without losing source truth.Two entry points are supported: DOI-initiated ingestion (with upfront DOI validation) and direct PDF upload. Both converge on the same parse → metadata reconciliation → chunk pipeline when a PDF is available. DOI-initiated ingestion without a PDF produces a metadata-only document.
Deliver this as database models and migrations (plus minimal admin registration for visibility). Do not fully rewire the ingestion pipeline in this change; that is follow-up work.
Ingestion Flow
flowchart TD Start([Start ingestion]) --> Input{Input type?} Input -->|DOI| R0[Check DOI] R0 --> |if DOI is unique and valid| R1["IngestionRun: Set status=running and stage=acquire"] R0 --> |if DOI is duplicate or invalid| R1prime[STOP: Invalid DOI or duplicate] R1 --> D1["Document: set DOI"] D1 --> Fetch["Fetch metadata APIs + attempt PDF acquisition"] Fetch --> PDF{PDF found ?} PDF -->|No| Fill["Write normalized metadata on Document"] Fill --> Skip["No SourceFile / ParsedArtifact / Chunks since no PDF"] Skip --> OKmeta["IngestionRun: success, success_kind=metadata_only, stage=done"] PDF -->|Yes| SFdoi["SourceFile: store blob, advance stage store"] Input -->|PDF upload| R2["IngestionRun: status=running, stage=acquire"] R2 --> SFup["SourceFile: create from upload"] SFup --> Dpdf["Document: link SourceFile"] SFdoi --> Parse["ParsedArtifact: Store resulting docling JSON + postprocessed text + metadata_extracted + parser_config"] Dpdf --> Parse Parse --> Q{For each metadata field: Is API value non-empty?} Q -->|Yes| UseAPI["Write API value on Document"] Q -->|No| Q2{Parsing value non-empty?} Q2 -->|Yes| UseParse["Write parsing value on Document"] Q2 -->|No| UseNull["Set field NULL on Document"] UseAPI --> Audit["If API ≠ parsing: keep snapshot in ParsedArtifact.metadata_extracted"] UseParse --> Audit UseNull --> Audit Audit --> More{More normalized fields?} More -->|Yes| Q More -->|No| Chunk["DocumentChunks + embeddings stages chunk → embed"] Chunk --> OKfull["IngestionRun: success, success_kind=full, stage=done"] Fetch -.->|error| Fail([IngestionRun: status=failed]) Parse -.->|error| Fail Chunk -.->|error| FailUser Stories
Documenttable with a required non-empty title, so that every canonical record has a human-readable label even when other metadata is incomplete.Documentto support optional DOI and other external identifiers, so that national reports and uploads without DOIs are first-class.Documentwith metadata only before any PDF is available, so that workflows can record bibliographic data as soon as it is known.SourceFileto represent only the stored file (e.g. S3 key, status), so that file lifecycle stays separate from canonical metadata.Documentto link to at most oneSourceFilewhen a file exists, so that the "metadata-only" and "file attached" states are explicit.SourceFileto delete the relatedDocumentwhen that document is tied to that file, so that storage cleanup matches the agreed cascade semantics (see implementation decisions for nullable vs attached cases).IngestionRunrow created at the very beginning of ingestion (before any fetch or file operation), so that every attempt is recorded even if it fails immediately.IngestionRunto expose astagefield that reflects the last-reached pipeline stage (acquire | store | parse | chunk | done), so that I can pinpoint exactly where a failed run stopped.IngestionRunto expose asuccess_kindfield (metadata_only | full) on success, so that I can distinguish runs that produced a full document from those that produced metadata only.IngestionRun, so that I can reprocess or audit without re-fetching from external APIs.IngestionRunis created, so that duplicate or malformed DOIs are rejected early with a clear error.ParsedArtifactperDocument(enforced in schema), so that there is no ambiguity about "which parse is current."ParsedArtifactto store the raw Docling JSON, postprocessed text, ametadata_extractedsnapshot, and theparser_configused, so that parsing is auditable and reproducible.Documentusing a deterministic priority rule (API value > parsed value > NULL), with divergences preserved inParsedArtifact.metadata_extracted, so that the canonical record reflects the best available source while the audit trail is complete.DocumentChunkto require aDocument, so that retrieval and future evidence linking are anchored on the canonical document.Relationship to draft research catalog model
A separate draft shared on Mattermost (shown above) sketches entities such as
ResearchPaper,Author, reference tables for document type and evidence hierarchy,Theme,Keywords, and chunk-level embedding/citation concepts. That draft is not final and describes a broader research-catalog and taxonomy layer than this work.The codebase uses
Documentas the canonical entity name instead ofResearchPaper. This is a voluntary choice for clarity and consistency with the ingestion roadmap; it does not change the intended role of the table as the logical "paper" or publication record.Scope: This PRD remains limited to the ingestion spine (canonical document, stored file, lineage, raw provider payload, single parse artifact, chunks). Normalized catalog concerns—including
DocType,HierarchyOfEvidence,Journal,Author,Keywords, and theme assignment—are out of scope here and are expected in follow-up work once stable document and chunk identifiers exist.Themes: The draft shows a single topic/theme identifier per paper; the product direction is to support many themes per document when applicable (e.g. a many-to-many relationship in a later schema). This PRD does not implement theme tables or links.
Implementation Decisions
Documentis the canonical entity for bibliographic/product metadata;SourceFileremains a physical artifact.SourceFilesubclasses.IngestionRunis created; invalid or duplicate DOIs are rejected immediately.Documentrows with title and other allowed fields before anySourceFileis given; attachment is optional until the file exists.IngestionRun-> createDocument(DOI set) -> fetch metadata APIs + attempt PDF acquisition -> if no PDF: write normalized metadata and close asmetadata_only; if PDF found: continue to full pipeline.IngestionRun-> createSourceFilefrom upload -> linkDocumenttoSourceFile-> full parse pipeline.status: running | success | failedstage: last-reached pipeline stage — acquire | store | parse | chunk | done — used to pinpoint where a failed run stoppedsuccess_kind: metadata_only | full (set on success; null otherwise)input_type: doi | pdf_uploadinput_identifier: DOI string or upload referenceprovider: metadata API(s) used, if applicableraw_provider_payload: verbatim API response (JSON); stored here, not onDocumenterror_messageanderror_stagefor debugging failed runspipeline_version: version tag for reproducibilitycreated_at,updated_atDocument(enforced uniqueness). Stores:docling_output: raw Docling JSONpostprocessed_text: text after postprocessingmetadata_extracted: structured snapshot of metadata as extracted by the parser — serves as audit record and fallback when API metadata is absent or diverges from parsed valuesparser_config: Docling parameters and model versions used, for reproducibilityDocument, apply a deterministic priority: API value (non-empty) > parsed value (non-empty) > NULL. When the API value and parsed value diverge, the parsed value is preserved inParsedArtifact.metadata_extractedfor audit. The reconciliation priority across multiple API providers (e.g. CrossRef vs PubMed) will be defined with the data-acquisition team in a follow-up.DocumentChunkmust have a required foreign key toDocument; adjust or retain linkage toSourceFile/ parse artifact as needed for migration continuity and provenance.SourceFiledeletes the associatedDocumentwhen the document is defined as dependent on that file—implemented with Djangoon_deleteand nullable FK where metadata-only documents exist without a file.Testing Decisions
IngestionRunstage and success_kind transitions: assert that a metadata-only run lands onsuccess_kind=metadata_only, stage=doneand a full run onsuccess_kind=full, stage=done.ParsedArtifact.metadata_extractedcaptures the divergence when both are non-empty and differ.IngestionRunrow is created.django_dbpatterns.Out of Scope
run_pipeline, fetch stubs, or upload flows to useDocumentend-to-end.Author,Keywords,Theme, evidence hierarchy, journal normalization, many-to-many themes)—see Relationship to draft research catalog model.Further Notes
Document,IngestionRun, raw assets, parsed artifacts, and chunks as the ingestion spine.ParsedArtifactrule would need revisiting; current expectation is single parse per document over the tool's lifetime.