Skip to content

(retriever) add image file ingestion (PNG, JPEG, BMP, TIFF, SVG)#1524

Merged
jdye64 merged 5 commits intoNVIDIA:mainfrom
edknv:edwardk/retriever-image-input-format
Mar 10, 2026
Merged

(retriever) add image file ingestion (PNG, JPEG, BMP, TIFF, SVG)#1524
jdye64 merged 5 commits intoNVIDIA:mainfrom
edknv:edwardk/retriever-image-input-format

Conversation

@edknv
Copy link
Copy Markdown
Collaborator

@edknv edknv commented Mar 10, 2026

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

exts = ext_map.get(input_type, ["*.pdf"])
import glob as _glob

all_candidates = [str(input_path / e) for e in exts]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using local is fine for now but something we should consider later is Ray data allows users to specify object stores like S3 and this logic would break in that situation

# Raster formats handled natively by Pillow.
_RASTER_EXTENSIONS = {".png", ".jpg", ".jpeg", ".bmp", ".tiff", ".tif"}

# SVG requires cairosvg (optional dependency).
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add this in the pyproject.toml as an optional dependency then?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good idea. Added to pyproject.toml as an optional dep.

try:
import cairosvg # type: ignore[import-untyped]
except ImportError:
raise ImportError("cairosvg is required for SVG image support. " "Install it with: pip install cairosvg")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is good enough for now


def extract_image_files(self, params: ExtractParams | None = None, **kwargs: Any) -> "FusedIngestor":
"""Fused mode does not yet support extract_image_files."""
raise NotImplementedError("Fused mode does not yet support extract_image_files")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah probably gonna nuke fused mode for now anyway ...

@edknv edknv marked this pull request as ready for review March 10, 2026 18:20
@edknv edknv requested a review from a team as a code owner March 10, 2026 18:20
@edknv edknv requested a review from ChrisJar March 10, 2026 18:20
@jdye64 jdye64 merged commit bd46e5b into NVIDIA:main Mar 10, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants