Skip to content

[BUG]: Tables/Chart has missing text #1615

@veerkumar

Description

@veerkumar

Version

26.1.2

Which installation method(s) does this occur on?

No response

Describe the bug.

For STRUCTURED content (charts and tables), content_replace is True, so the text field is set to location (the source_location string like [CHART] source=... page=...) instead of content (the actual extracted text from table_metadata.table_content).
For structured content (tables and charts), the server writes the source_location string as the parquet text field instead of the actual extracted content. The source_location is the [CHART] source=... page=... or [TABLE] source=... page=... placeholder.
The nv-ingest server's StoreEmbedTask intentionally replaces structured content text with source_location (a placeholder string) in the parquet output. The actual table/chart text is in metadata["content"] but gets swapped out for location when doc_type is STRUCTURED.

[Code]:
nv-ingest server's embed_text_upload.py:

writer.append_row(
{
"text": location if content_replace else content,
"source": metadata["source_metadata"],
"content_metadata": metadata["content_metadata"],
"vector": metadata["embedding"],
}
)
Where:

content_replace: bool = doc_type in [ContentTypeEnum.IMAGE, ContentTypeEnum.STRUCTURED]
location: str = metadata["source_metadata"]["source_location"]
content = metadata["content"]

Minimum reproducible example

Upload any document with tables/charts using V2 api's

Relevant log output

Other/Misc.

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions