Skip to content

Introduce Document as canonical bibliographic record, replacing FileMetadata#48

Open
AymanL wants to merge 2 commits intomainfrom
44-document-model
Open

Introduce Document as canonical bibliographic record, replacing FileMetadata#48
AymanL wants to merge 2 commits intomainfrom
44-document-model

Conversation

@AymanL
Copy link
Copy Markdown
Collaborator

@AymanL AymanL commented Apr 10, 2026

Summary

  • Replaces FileMetadata (a file-centric metadata bag) with Document, a canonical bibliographic record (title, DOI, external_ids JSON) that exists independently of how many times the paper was fetched
  • SourceFile now carries a nullable FK to Document: it is set after parsing and allows multiple fetches of the same paper to converge on one record
  • DocumentChunk points to Document directly; the redundant SourceFile FK is removed (path is chunk -> document -> source_file)
  • Partial unique constraint on Document.doi (non-empty only) prevents duplicate records while allowing DOI-less imports

Test plan

  • test_document_requires_title — DB rejects NULL title
  • test_duplicate_nonempty_doi_rejected — unique constraint fires on duplicate non-empty DOI
  • test_multiple_documents_without_doi_allowed — empty DOI rows are not constrained
  • test_document_created_without_source_fileDocument can exist before any file is fetched
  • test_document_chunk_requires_document — chunk FK is NOT NULL
  • test_document_chunk_linked_to_documentdocument.chunks reverse relation works
  • Run manage.py migrate on a clean DB and verify no errors

@AymanL AymanL self-assigned this Apr 10, 2026
@AymanL AymanL linked an issue Apr 10, 2026 that may be closed by this pull request
7 tasks
Copy link
Copy Markdown
Collaborator

@cgoudet cgoudet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pas eu le temps de finir mais j'ai déjà quelques commentaires.


def __str__(self):
return f"Metadata for {self.source_file_id}"
return self.title[:self._TITLE_DISPLAY_LENGTH]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tu as mis une logique lorsque le chunk est trop long pour afficher mais pas ici. Ce serait bien de l'ajouter.

assert not s3_fn.exists()


# --- Document: title constraint ---
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tu peux mettre les tests reliés entre eux dans une class pytest au lieu d'un commentaire.



@pytest.mark.django_db
def test_document_requires_title():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ce serait top d'ajouter dans le graphe de flow à quel endroit en veut créer ce document. Comme le titre n'est peut-être pas quelque chose que l'on aura de suite dans le process.

@pytest.mark.django_db
def test_multiple_documents_without_doi_allowed():
"""Multiple Documents with empty DOI are allowed (partial unique constraint)."""
Document.objects.create(title="Report A", doi="")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Du coup "" et None sont la même chose lors de la création?

def test_document_created_without_source_file():
"""A Document can exist without a linked SourceFile."""
doc = Document.objects.create(title="Metadata-only paper", doi="10.9999/meta")
assert doc.pk is not None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tester que source file est bien null?

Copy link
Copy Markdown
Collaborator

@cgoudet cgoudet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dans l'ensemble c'est top. Juste quelques petits commentaires de style.

model = Document

source_file = factory.SubFactory(SourceFileFactory)
tags_pubmed = factory.LazyFunction(list)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pourrais tu créer des tickets pour ajouter petit à petit les metadonnées des articles dans Document?

@cgoudet
Copy link
Copy Markdown
Collaborator

cgoudet commented Apr 11, 2026

Et j'ai oublié d'explicité l'évidence des precommit qui échouent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Document model + DocumentChunk FK + FileMetadata removal

2 participants