Skip to content

ParsedArtifact model #45

@AymanL

Description

@AymanL

Parent PRD

#33

What to build

Introduce the ParsedArtifact model as the single parse output per Document (see PRD §Implementation Decisions — ParsedArtifact fields). This model stores the raw Docling output, postprocessed text, a metadata snapshot for audit, and the parser config used. No pipeline rewiring; models, migrations, and minimal admin only.

Acceptance criteria

  • ParsedArtifact model exists with: OneToOneField → Document (enforced at DB level), docling_output JSONField (raw Docling JSON), postprocessed_text TextField, metadata_extracted JSONField (snapshot of parser-extracted metadata, used for audit and reconciliation), parser_config JSONField (Docling parameters and model versions).
  • Uniqueness constraint enforced at DB level: at most one ParsedArtifact per Document.
  • Migration is generated and applies cleanly.
  • ParsedArtifact is registered in Django admin (minimal).
  • Tests: creating a second ParsedArtifact for the same Document raises an integrity error, all four fields are writable and retrievable.

Blocked by

User stories addressed

Reference by number from the parent PRD:

  • User story 12 (one ParsedArtifact per Document, enforced in schema)
  • User story 13 (ParsedArtifact stores docling JSON, postprocessed text, metadata_extracted, parser_config)
  • User story 14 (metadata reconciliation audit trail via metadata_extracted)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions