Skip to content

Support PDF file build#110

Open
Mateus-Cordeiro wants to merge 5 commits intomainfrom
NEM-346-pdf-support
Open

Support PDF file build#110
Mateus-Cordeiro wants to merge 5 commits intomainfrom
NEM-346-pdf-support

Conversation

@Mateus-Cordeiro
Copy link
Collaborator

This PR adds PDF support using Docling.

Changes

PDFPlugin

  • The plugin reads the PDF and converts it using Docling's DocumentConverter.
  • OCR and picture-related processing are explicitly disabled.
  • Output of build_file_context() is the Docling document object.

Docling Chunking Service

  • Reusable Docling chinking component.
  • Uses contextualization for text (includes surrounding structure like headings, improving retrieval)
  • Token-budget enforcement.
  • Separate indexing for tables, to improve their usability in dce.
    • Small tables are converted to markdown and embedded that way with a TABLE header.
    • Larger tables are exported to Dataframe and multiple row/column aware chunks are created.
    • Rows that are too large are truncated (I expect this to be very rare)

Tables

If the whole table fits in the token budget, we embed one chunk whose embeddable_text looks like:

TABLE: (caption-if-any)
| col1 | col2 | col3 |
|------|------|------|
| ...  | ...  | ...  |
| ...  | ...  | ...  |

If the markdown would exceed the token budget, we embed multiple chunks. Each chunk repeats the table context so rows are interpretable on their own.

Each embeddable_text looks like:

TABLE: (caption-if-any)
COLUMNS: col1 | col2 | col3 | col4
ROWS:
- v11 | v12 | v13 | v14
- v21 | v22 | v23 | v24
- v31 | v32 | v33 | v34

The display text on both case is in markdown.

JulienArzul
JulienArzul previously approved these changes Feb 5, 2026
if self._is_table_chunk(chunk):
continue

display_text = getattr(chunk, "text", "") or ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ In what case does a chunk have no "text" attribute?

Shouldn't we ignore this chunk if that's the case? It doesn't make sense IMO to have a display_text with an empty string. Even if we manage to contextualise the chunk below and get a good description, that means the result from the search would give an empty string as its result...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will adapt the implementation. Thank you for noticing

display_text = getattr(chunk, "text", "") or ""
embed_text = chunker.contextualize(chunk=chunk)

for part in self.splitter.split(embed_text, tokenizer=tokenizer):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there some parameters we can give to the chunker directly so that it knows what the max size of a contextualisation can be?

It feels a bit strange to apply this splitting after the "contextualisation" was computed: we'll be dividing that contextualisation arbitrarily without knowing the context => we might end up with embeddings that have lost all of their meanings. It would be great if the method that was creating the contextualisation (and hence knows the full context) would be aware of the limitation

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, you're already giving the model_name and max_tokens to the Tokenizer when initialised. Shouldn't that be enough for it to know that it shouldn't create a context bigger than the max_tokens?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, splitting after contextualization can be suboptimal if we split away the context anchor (headings, etc). In practice, the tokenizer's max_toekns controls chunking, but contextualization can still push a chunk over budget, so we need a safety net. I'll keep the post-contextualization splitting, but I'll make it preserve the prefix. That way, each embedding slice retains the same semantic context instead of arbitrary fragments.

"""

model_name: str = "nomic-ai/nomic-embed-text-v1.5"
tokens_budget: int = 1900
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So those default values correspond to the defaults of "nomic-ai/nomic-embed-text-v1.5"?

I think we should move this class closer to where we choose which model to use for embeddings: the model_name could use the same constant we have for the embeddings model. And that would help remembering to change those hardcoded values if we ever change the default model we use

For now, I think it's fine to hardcode it in your Chunker but I guess that what you had in mind was that this Policy should be provided to the Plugin in the divide_context_into_chunks?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Indeed, this is a piece that is hardcoded and could probably done better if we had access to the embedding model (most embedding models can tell you their maximum token budget, so it can even be configured dynamically). The reason I left it as is for now is that we always use the same model. I think moving it into a non-hardcoded version would be better when we create the functionality for allowing any model to do the embedding.

if tokenizer.count_tokens(fast_embed) <= self.policy.tokens_budget:
return [EmbeddableChunk(embeddable_text=fast_embed, content=table_md)]

df = table.export_to_dataframe(doc=doc)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stupid question: is using dataframes like that better than simply dividing the Markdown into chunks? (and re-adding the table header to all chunks)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's definitely not a stupid question and it can be done. I used this functionality to avoid re-inventing the wheel.


converter = DocumentConverter(format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=opts)})
stream = DocumentStream(name=file_name, stream=BytesIO(pdf_bytes))
return converter.convert(stream).document
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example of how bad this looks when serialised to Yaml? 🙂
And whether it could be understood by an LLM if directly given that Yaml context file

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we were talking about in Tuesday's meeting, we might need to come up with a solution to create a context that has some meaning but still allows to re-create the chunks later 🤔
Maybe we actually need to store the path of the PDF file in the context... So that the chunker re-reads the files. And the context could be simply about giving the outline of the file.

Something like:

context_metadata:
    original_file_path: my_absolute_path
context:
    table_of_contents:
        ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks really ugly. I can share it with you in slack. I decided to not care too much about it, because indeed it might make sense to get rid of the context for just files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants