You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
element_type - text / image / table / chart / infographic / audio / custom (can be populated by UDFs)
sequence_number - for citation references - this is page number for PDFs, but audio/video/text chunk number for other content types. "sequence_number" is less clear than page number for PDFs, but "page_number" is not clear for non pdf content types :)
bounding box - [x1, y1, x2, y2] coordinates, if applicable for image/table overlays
Is this a new feature, an improvement, or a change to existing functionality?
New Feature
How would you describe the priority of this feature request
Currently preventing usage
Please provide a clear description of problem this feature solves
Currently ingestor.ingest (in batch mode) returns a very large amount of data from Ray workers.
This takes a long time, and includes far more fields than desired.
Describe the feature, and optionally a solution or implementation and any alternatives
source_name - document name (filename)
source_location - fully qualified path to ingested file
raw_location - fully qualified path for accessing related page image, cropped images, audio/video chunks or frames - Related to (retriever) Add .store() task for persisting extracted images (#1675) #1714
element_type - text / image / table / chart / infographic / audio / custom (can be populated by UDFs)
sequence_number - for citation references - this is page number for PDFs, but audio/video/text chunk number for other content types. "sequence_number" is less clear than page number for PDFs, but "page_number" is not clear for non pdf content types :)
bounding box - [x1, y1, x2, y2] coordinates, if applicable for image/table overlays
page dimensions - W/H for bbox normalization
content_type - top-level file/content type