Conversation
| if self._is_table_chunk(chunk): | ||
| continue | ||
|
|
||
| display_text = getattr(chunk, "text", "") or "" |
There was a problem hiding this comment.
❓ In what case does a chunk have no "text" attribute?
Shouldn't we ignore this chunk if that's the case? It doesn't make sense IMO to have a display_text with an empty string. Even if we manage to contextualise the chunk below and get a good description, that means the result from the search would give an empty string as its result...
There was a problem hiding this comment.
Good point. I will adapt the implementation. Thank you for noticing
| display_text = getattr(chunk, "text", "") or "" | ||
| embed_text = chunker.contextualize(chunk=chunk) | ||
|
|
||
| for part in self.splitter.split(embed_text, tokenizer=tokenizer): |
There was a problem hiding this comment.
Isn't there some parameters we can give to the chunker directly so that it knows what the max size of a contextualisation can be?
It feels a bit strange to apply this splitting after the "contextualisation" was computed: we'll be dividing that contextualisation arbitrarily without knowing the context => we might end up with embeddings that have lost all of their meanings. It would be great if the method that was creating the contextualisation (and hence knows the full context) would be aware of the limitation
There was a problem hiding this comment.
Actually, you're already giving the model_name and max_tokens to the Tokenizer when initialised. Shouldn't that be enough for it to know that it shouldn't create a context bigger than the max_tokens?
There was a problem hiding this comment.
Indeed, splitting after contextualization can be suboptimal if we split away the context anchor (headings, etc). In practice, the tokenizer's max_toekns controls chunking, but contextualization can still push a chunk over budget, so we need a safety net. I'll keep the post-contextualization splitting, but I'll make it preserve the prefix. That way, each embedding slice retains the same semantic context instead of arbitrary fragments.
| """ | ||
|
|
||
| model_name: str = "nomic-ai/nomic-embed-text-v1.5" | ||
| tokens_budget: int = 1900 |
There was a problem hiding this comment.
So those default values correspond to the defaults of "nomic-ai/nomic-embed-text-v1.5"?
I think we should move this class closer to where we choose which model to use for embeddings: the model_name could use the same constant we have for the embeddings model. And that would help remembering to change those hardcoded values if we ever change the default model we use
For now, I think it's fine to hardcode it in your Chunker but I guess that what you had in mind was that this Policy should be provided to the Plugin in the divide_context_into_chunks?
There was a problem hiding this comment.
Hi! Indeed, this is a piece that is hardcoded and could probably done better if we had access to the embedding model (most embedding models can tell you their maximum token budget, so it can even be configured dynamically). The reason I left it as is for now is that we always use the same model. I think moving it into a non-hardcoded version would be better when we create the functionality for allowing any model to do the embedding.
| if tokenizer.count_tokens(fast_embed) <= self.policy.tokens_budget: | ||
| return [EmbeddableChunk(embeddable_text=fast_embed, content=table_md)] | ||
|
|
||
| df = table.export_to_dataframe(doc=doc) |
There was a problem hiding this comment.
Stupid question: is using dataframes like that better than simply dividing the Markdown into chunks? (and re-adding the table header to all chunks)
There was a problem hiding this comment.
That's definitely not a stupid question and it can be done. I used this functionality to avoid re-inventing the wheel.
|
|
||
| converter = DocumentConverter(format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=opts)}) | ||
| stream = DocumentStream(name=file_name, stream=BytesIO(pdf_bytes)) | ||
| return converter.convert(stream).document |
There was a problem hiding this comment.
Do you have an example of how bad this looks when serialised to Yaml? 🙂
And whether it could be understood by an LLM if directly given that Yaml context file
There was a problem hiding this comment.
As we were talking about in Tuesday's meeting, we might need to come up with a solution to create a context that has some meaning but still allows to re-create the chunks later 🤔
Maybe we actually need to store the path of the PDF file in the context... So that the chunker re-reads the files. And the context could be simply about giving the outline of the file.
Something like:
context_metadata:
original_file_path: my_absolute_path
context:
table_of_contents:
...
There was a problem hiding this comment.
It looks really ugly. I can share it with you in slack. I decided to not care too much about it, because indeed it might make sense to get rid of the context for just files.
This PR adds PDF support using Docling.
Changes
PDFPlugin
DocumentConverter.build_file_context()is the Docling document object.Docling Chunking Service
TABLEheader.Tables
If the whole table fits in the token budget, we embed one chunk whose
embeddable_textlooks like:If the markdown would exceed the token budget, we embed multiple chunks. Each chunk repeats the table context so rows are interpretable on their own.
Each
embeddable_textlooks like:The display text on both case is in markdown.