Support PDF file build#110

Open

Mateus-Cordeiro wants to merge 5 commits intomainfrom

NEM-346-pdf-support

Collaborator

Mateus-Cordeiro commented Feb 3, 2026

This PR adds PDF support using Docling.

Changes

PDFPlugin

The plugin reads the PDF and converts it using Docling's DocumentConverter.
OCR and picture-related processing are explicitly disabled.
Output of build_file_context() is the Docling document object.

Docling Chunking Service

Reusable Docling chinking component.
Uses contextualization for text (includes surrounding structure like headings, improving retrieval)
Token-budget enforcement.
Separate indexing for tables, to improve their usability in dce.
- Small tables are converted to markdown and embedded that way with a TABLE header.
- Larger tables are exported to Dataframe and multiple row/column aware chunks are created.
- Rows that are too large are truncated (I expect this to be very rare)

Tables

If the whole table fits in the token budget, we embed one chunk whose embeddable_text looks like:

TABLE: (caption-if-any)
| col1 | col2 | col3 |
|------|------|------|
| ...  | ...  | ...  |
| ...  | ...  | ...  |

If the markdown would exceed the token budget, we embed multiple chunks. Each chunk repeats the table context so rows are interpretable on their own.

Each embeddable_text looks like:

TABLE: (caption-if-any)
COLUMNS: col1 | col2 | col3 | col4
ROWS:
- v11 | v12 | v13 | v14
- v21 | v22 | v23 | v24
- v31 | v32 | v33 | v34

The display text on both case is in markdown.

Mateus-Cordeiro added 3 commits

February 3, 2026 18:12


          Support PDF file build

c73bc0d


          Fix test

2c38b7d


          Fix test

eb939d9

JulienArzul previously approved these changes

View reviewed changes

src/databao_context_engine/plugins/files/pdf_plugin.py Outdated Show resolved Hide resolved

src/databao_context_engine/services/docling_chunker.py Outdated

+                          if self._is_table_chunk(chunk):
+                              continue
+                          display_text = getattr(chunk, "text", "") or ""

Collaborator

JulienArzul Feb 5, 2026

❓ In what case does a chunk have no "text" attribute?

Shouldn't we ignore this chunk if that's the case? It doesn't make sense IMO to have a display_text with an empty string. Even if we manage to contextualise the chunk below and get a good description, that means the result from the search would give an empty string as its result...

Collaborator Author

Mateus-Cordeiro Feb 5, 2026

Good point. I will adapt the implementation. Thank you for noticing

src/databao_context_engine/services/docling_chunker.py Outdated

+                          display_text = getattr(chunk, "text", "") or ""
+                          embed_text = chunker.contextualize(chunk=chunk)
+                          for part in self.splitter.split(embed_text, tokenizer=tokenizer):

Collaborator

JulienArzul Feb 5, 2026

Isn't there some parameters we can give to the chunker directly so that it knows what the max size of a contextualisation can be?

It feels a bit strange to apply this splitting after the "contextualisation" was computed: we'll be dividing that contextualisation arbitrarily without knowing the context => we might end up with embeddings that have lost all of their meanings. It would be great if the method that was creating the contextualisation (and hence knows the full context) would be aware of the limitation

Collaborator

JulienArzul Feb 5, 2026

Actually, you're already giving the model_name and max_tokens to the Tokenizer when initialised. Shouldn't that be enough for it to know that it shouldn't create a context bigger than the max_tokens?

Collaborator Author

Mateus-Cordeiro Feb 5, 2026

Indeed, splitting after contextualization can be suboptimal if we split away the context anchor (headings, etc). In practice, the tokenizer's max_toekns controls chunking, but contextualization can still push a chunk over budget, so we need a safety net. I'll keep the post-contextualization splitting, but I'll make it preserve the prefix. That way, each embedding slice retains the same semantic context instead of arbitrary fragments.

src/databao_context_engine/services/docling_chunker.py Show resolved Hide resolved

src/databao_context_engine/services/docling_chunker.py

+                  """
+                  model_name: str = "nomic-ai/nomic-embed-text-v1.5"
+                  tokens_budget: int = 1900

Collaborator

JulienArzul Feb 5, 2026

So those default values correspond to the defaults of "nomic-ai/nomic-embed-text-v1.5"?

I think we should move this class closer to where we choose which model to use for embeddings: the model_name could use the same constant we have for the embeddings model. And that would help remembering to change those hardcoded values if we ever change the default model we use

For now, I think it's fine to hardcode it in your Chunker but I guess that what you had in mind was that this Policy should be provided to the Plugin in the divide_context_into_chunks?

Collaborator Author

Mateus-Cordeiro Feb 5, 2026

Hi! Indeed, this is a piece that is hardcoded and could probably done better if we had access to the embedding model (most embedding models can tell you their maximum token budget, so it can even be configured dynamically). The reason I left it as is for now is that we always use the same model. I think moving it into a non-hardcoded version would be better when we create the functionality for allowing any model to do the embedding.

src/databao_context_engine/services/docling_chunker.py

+                      if tokenizer.count_tokens(fast_embed) <= self.policy.tokens_budget:
+                          return [EmbeddableChunk(embeddable_text=fast_embed, content=table_md)]
+                      df = table.export_to_dataframe(doc=doc)

Collaborator

JulienArzul Feb 5, 2026

Stupid question: is using dataframes like that better than simply dividing the Markdown into chunks? (and re-adding the table header to all chunks)

Collaborator Author

Mateus-Cordeiro Feb 5, 2026

That's definitely not a stupid question and it can be done. I used this functionality to avoid re-inventing the wheel.

src/databao_context_engine/plugins/files/pdf_plugin.py

+                      converter = DocumentConverter(format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=opts)})
+                      stream = DocumentStream(name=file_name, stream=BytesIO(pdf_bytes))
+                      return converter.convert(stream).document

Collaborator

JulienArzul Feb 5, 2026

Do you have an example of how bad this looks when serialised to Yaml? 🙂
And whether it could be understood by an LLM if directly given that Yaml context file

Collaborator

JulienArzul Feb 5, 2026

As we were talking about in Tuesday's meeting, we might need to come up with a solution to create a context that has some meaning but still allows to re-create the chunks later 🤔
Maybe we actually need to store the path of the PDF file in the context... So that the chunker re-reads the files. And the context could be simply about giving the outline of the file.

Something like:

context_metadata:
    original_file_path: my_absolute_path
context:
    table_of_contents:
        ...

Collaborator Author

Mateus-Cordeiro Feb 5, 2026

It looks really ugly. I can share it with you in slack. I decided to not care too much about it, because indeed it might make sense to get rid of the context for just files.


          Improved splitting strategy and skipped empty text from pdf

06a00a4

Mateus-Cordeiro dismissed JulienArzul’s stale review via

06a00a4

February 5, 2026 16:03


          Merge branch 'main' into NEM-346-pdf-support

c0af215

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet