Skip to content

Forward-port manifest-planned ingest branches for BMP/TIFF routing#2096

Open
jioffe502 wants to merge 1 commit into
NVIDIA:mainfrom
jioffe502:codex/main-manifest-router-bugfix
Open

Forward-port manifest-planned ingest branches for BMP/TIFF routing#2096
jioffe502 wants to merge 1 commit into
NVIDIA:mainfrom
jioffe502:codex/main-manifest-router-bugfix

Conversation

@jioffe502
Copy link
Copy Markdown
Collaborator

Summary

Forward-ports the 26.05 manifest-router fix onto main.

This replaces PR 2068's root CLI input-type routing with manifest-planned extraction branches inside GraphIngestor, while preserving the dedicated PDF fast path and fixing BMP/TIFF ingest.

Motivation

The 26.05 release fix should also exist on main so the release branch does not later merge a large routing change into a diverged codebase unexpectedly.

This PR gets ahead of that integration by applying the same manifest-planned routing behavior directly on top of current main.

Changes

  • Removes root retriever ingest --input-type.
  • Removes CLI-level effective input-type routing.
  • Adds manifest planning for supported input families.
  • Adds branch extraction execution outside the core linear graph.
  • Keeps single-family inputs on the existing dedicated typed graph path.
  • Keeps explicit extraction_mode="auto" as compatibility fallback to MultiTypeExtractOperator.
  • Normalizes branch output schemas before unioning mixed-modality rows.
  • Preserves PDF-only fast path with PDFExtractionCPUActor.

Validation

Forward-port result:

  • cherry-picked cleanly onto upstream/main
  • no merge conflicts

Focused test suite:

232 passed, 2 skipped, 2 deselected

## Checklist
- [ ] I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Retriever/blob/main/CONTRIBUTING.md).
- [ ] New or existing tests cover these changes.
- [ ] The documentation is up to date with these changes.
- [ ] If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

@jioffe502 jioffe502 requested review from a team as code owners May 22, 2026 14:40
@jioffe502 jioffe502 requested a review from ChrisJar May 22, 2026 14:40
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 22, 2026

Greptile Summary

This PR forward-ports the 26.05 manifest-planned extraction routing onto main, replacing the root CLI --input-type flag with a manifest planner inside GraphIngestor that classifies input files by extension and emits deterministic per-family extraction branches (pdf, image, txt, html, audio, video). Mixed-modality ingest now runs parallel branch graphs that are unioned before shared post-extraction stages, fixing the BMP/TIFF routing gap.

  • ingest_manifest.py and branch_extraction.py are new modules: the former plans typed branches from a classified input manifest; the latter executes them in both inprocess and Ray batch modes, normalising column schemas before union.
  • RayDataExecutor.ingest is split into build_dataset (lazy pipeline) and ingest (materialise), enabling branch datasets to be built and unioned before a single shared post-extract pass.
  • BatchTuningParams drops four table-structure tuning fields and the CLI drops several flags (--input-type, GPU-per-actor options); these are hard removals without a deprecation cycle affecting the public nemo_retriever.params API.

Confidence Score: 3/5

Two present-defect regressions need to be resolved before merging: .m4a files are silently dropped from audio processing in the explicit extraction_mode="auto" path, and four public BatchTuningParams fields are hard-removed without a deprecation cycle.

The manifest-planning architecture is well-designed and the new test suite is thorough. However, the .m4a omission from AUDIO_EXTENSIONS means .m4a files routed through explicit extraction_mode="auto" are silently skipped, and the hard removal of BatchTuningParams table-structure fields breaks any existing code that constructs those params — both are present-defect regressions on shipped code paths.

nemo_retriever/src/nemo_retriever/graph/multi_type_extract_operator.py (missing .m4a) and nemo_retriever/src/nemo_retriever/params/models.py (hard-removed public fields) need attention before merging.

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/branch_extraction.py New module for per-family extraction branch execution in inprocess and batch modes; schema normalisation before union is sound and well-tested.
nemo_retriever/src/nemo_retriever/ingest_manifest.py New manifest-planner module; clean design, no issues found.
nemo_retriever/src/nemo_retriever/graph_ingestor.py Core ingest refactored to route through manifest-planned branches or the existing single-mode path; _ensure_batch_runtime extracted cleanly.
nemo_retriever/src/nemo_retriever/graph/multi_type_extract_operator.py .m4a silently dropped from AUDIO_EXTENSIONS; VideoFrameTextDedupParams defaults changed to disabled.
nemo_retriever/src/nemo_retriever/graph/ingestor_runtime.py build_post_extract_graph added; video_text_dedup_params no longer forwarded to MultiTypeExtractOperator in build_graph.
nemo_retriever/src/nemo_retriever/params/models.py Hard-removes four public BatchTuningParams fields without a deprecation cycle.
nemo_retriever/src/nemo_retriever/adapters/cli/sdk_workflow.py CLI adapter simplified; internal function changes are acceptable for this adapter layer.
nemo_retriever/src/nemo_retriever/graph/executor.py RayDataExecutor.ingest split cleanly into lazy build_dataset and materialising ingest.
nemo_retriever/tests/test_ingest_manifest.py Good coverage of new manifest/branch logic; missing SPDX header.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[GraphIngestor.ingest] --> B{extraction_mode set?}
    B -- yes --> C[_resolve_effective_extraction_inputs]
    B -- no --> D[_plan_default_extraction_branches]
    D --> E{branches count}
    E -- 1 --> F[single-family path]
    E -- 2+ --> G[ExtractionBranchExecutor]
    C --> H[_execute_single_graph]
    F --> H
    G --> I{run_mode}
    I -- inprocess --> J[InprocessExecutor per branch -> concat -> post-graph]
    I -- batch --> K[build_dataset per branch -> union -> post-graph ingest]
    H --> L{run_mode}
    L -- inprocess --> M[InprocessExecutor.ingest]
    L -- batch --> N[RayDataExecutor.ingest]
Loading

Comments Outside Diff (2)

  1. nemo_retriever/src/nemo_retriever/params/models.py, line 252-265 (link)

    P1 Breaking removal of public BatchTuningParams fields

    BatchTuningParams is exported from nemo_retriever.params and is part of the public library API. This PR removes table_structure_workers, table_structure_batch_size, table_structure_cpus_per_actor, and gpu_table_structure without a deprecation cycle. Any existing library user who constructs BatchTuningParams(table_structure_workers=…) or reads these attributes will get a ValidationError or AttributeError at runtime after upgrading.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: nemo_retriever/src/nemo_retriever/params/models.py
    Line: 252-265
    
    Comment:
    **Breaking removal of public `BatchTuningParams` fields**
    
    `BatchTuningParams` is exported from `nemo_retriever.params` and is part of the public library API. This PR removes `table_structure_workers`, `table_structure_batch_size`, `table_structure_cpus_per_actor`, and `gpu_table_structure` without a deprecation cycle. Any existing library user who constructs `BatchTuningParams(table_structure_workers=…)` or reads these attributes will get a `ValidationError` or `AttributeError` at runtime after upgrading.
    
    How can I resolve this? If you propose a fix, please make it concise.
  2. nemo_retriever/src/nemo_retriever/graph/ingestor_runtime.py, line 681-693 (link)

    P2 video_text_dedup_params silently dropped for extraction_mode="auto"

    build_graph() still accepts a video_text_dedup_params argument but no longer threads it through to MultiTypeExtractOperator. Users who call GraphIngestor.extract(extraction_mode="auto") and pass custom VideoFrameTextDedupParams will have their configuration silently ignored — the operator will use VideoFrameTextDedupParams() defaults instead of the caller-supplied values.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: nemo_retriever/src/nemo_retriever/graph/ingestor_runtime.py
    Line: 681-693
    
    Comment:
    **`video_text_dedup_params` silently dropped for `extraction_mode="auto"`**
    
    `build_graph()` still accepts a `video_text_dedup_params` argument but no longer threads it through to `MultiTypeExtractOperator`. Users who call `GraphIngestor.extract(extraction_mode="auto")` and pass custom `VideoFrameTextDedupParams` will have their configuration silently ignored — the operator will use `VideoFrameTextDedupParams()` defaults instead of the caller-supplied values.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 4 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 4
nemo_retriever/src/nemo_retriever/graph/multi_type_extract_operator.py:58-62
**`.m4a` silently dropped from `AUDIO_EXTENSIONS`**

The old definition was `INPUT_TYPE_EXTENSIONS["audio"]` which equalled `frozenset({".mp3", ".wav", ".m4a"})`. The replacement literal omits `.m4a`. Any user who passes `.m4a` files with explicit `extraction_mode="auto"` will hit `_unsupported_extension_message` and have those files silently skipped rather than processed as audio. `input_type_for_path` still recognises `.m4a` as "audio" (used by the manifest planner), so the regression is confined to the `MultiTypeExtractOperator` code path — but that path is still reachable via `ingestor.extract(extraction_mode="auto")`.

### Issue 2 of 4
nemo_retriever/src/nemo_retriever/params/models.py:252-265
**Breaking removal of public `BatchTuningParams` fields**

`BatchTuningParams` is exported from `nemo_retriever.params` and is part of the public library API. This PR removes `table_structure_workers`, `table_structure_batch_size`, `table_structure_cpus_per_actor`, and `gpu_table_structure` without a deprecation cycle. Any existing library user who constructs `BatchTuningParams(table_structure_workers=…)` or reads these attributes will get a `ValidationError` or `AttributeError` at runtime after upgrading.

### Issue 3 of 4
nemo_retriever/src/nemo_retriever/graph/ingestor_runtime.py:681-693
**`video_text_dedup_params` silently dropped for `extraction_mode="auto"`**

`build_graph()` still accepts a `video_text_dedup_params` argument but no longer threads it through to `MultiTypeExtractOperator`. Users who call `GraphIngestor.extract(extraction_mode="auto")` and pass custom `VideoFrameTextDedupParams` will have their configuration silently ignored — the operator will use `VideoFrameTextDedupParams()` defaults instead of the caller-supplied values.

### Issue 4 of 4
nemo_retriever/tests/test_ingest_manifest.py:1
New Python source files must include the SPDX license header. `test_ingest_manifest.py` is the only new file in this PR that is missing it.

Reviews (1): Last reviewed commit: "Replace ingest input-type routing with m..." | Re-trigger Greptile

Comment on lines +58 to +62
PDF_EXTENSIONS = {".pdf", ".docx", ".pptx"}
TEXT_EXTENSIONS = {".txt"}
HTML_EXTENSIONS = {".html"}
AUDIO_EXTENSIONS = {".mp3", ".wav"}
IMAGE_EXTENSIONS = SUPPORTED_IMAGE_EXTENSIONS
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 .m4a silently dropped from AUDIO_EXTENSIONS

The old definition was INPUT_TYPE_EXTENSIONS["audio"] which equalled frozenset({".mp3", ".wav", ".m4a"}). The replacement literal omits .m4a. Any user who passes .m4a files with explicit extraction_mode="auto" will hit _unsupported_extension_message and have those files silently skipped rather than processed as audio. input_type_for_path still recognises .m4a as "audio" (used by the manifest planner), so the regression is confined to the MultiTypeExtractOperator code path — but that path is still reachable via ingestor.extract(extraction_mode="auto").

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/graph/multi_type_extract_operator.py
Line: 58-62

Comment:
**`.m4a` silently dropped from `AUDIO_EXTENSIONS`**

The old definition was `INPUT_TYPE_EXTENSIONS["audio"]` which equalled `frozenset({".mp3", ".wav", ".m4a"})`. The replacement literal omits `.m4a`. Any user who passes `.m4a` files with explicit `extraction_mode="auto"` will hit `_unsupported_extension_message` and have those files silently skipped rather than processed as audio. `input_type_for_path` still recognises `.m4a` as "audio" (used by the manifest planner), so the regression is confined to the `MultiTypeExtractOperator` code path — but that path is still reachable via `ingestor.extract(extraction_mode="auto")`.

How can I resolve this? If you propose a fix, please make it concise.

@@ -0,0 +1,266 @@
from __future__ import annotations
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 New Python source files must include the SPDX license header. test_ingest_manifest.py is the only new file in this PR that is missing it.

Rule Used: Python files added in this PR must include the SPD... (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/tests/test_ingest_manifest.py
Line: 1

Comment:
New Python source files must include the SPDX license header. `test_ingest_manifest.py` is the only new file in this PR that is missing it.

**Rule Used:** Python files added in this PR must include the SPD... ([source](https://app.greptile.com/review/custom-context?memory=spdx-license-header))

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant