Forward-port manifest-planned ingest branches for BMP/TIFF routing#2096
Forward-port manifest-planned ingest branches for BMP/TIFF routing#2096jioffe502 wants to merge 1 commit into
Conversation
(cherry picked from commit 74975d1)
Greptile SummaryThis PR forward-ports the 26.05 manifest-planned extraction routing onto
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/branch_extraction.py | New module for per-family extraction branch execution in inprocess and batch modes; schema normalisation before union is sound and well-tested. |
| nemo_retriever/src/nemo_retriever/ingest_manifest.py | New manifest-planner module; clean design, no issues found. |
| nemo_retriever/src/nemo_retriever/graph_ingestor.py | Core ingest refactored to route through manifest-planned branches or the existing single-mode path; _ensure_batch_runtime extracted cleanly. |
| nemo_retriever/src/nemo_retriever/graph/multi_type_extract_operator.py | .m4a silently dropped from AUDIO_EXTENSIONS; VideoFrameTextDedupParams defaults changed to disabled. |
| nemo_retriever/src/nemo_retriever/graph/ingestor_runtime.py | build_post_extract_graph added; video_text_dedup_params no longer forwarded to MultiTypeExtractOperator in build_graph. |
| nemo_retriever/src/nemo_retriever/params/models.py | Hard-removes four public BatchTuningParams fields without a deprecation cycle. |
| nemo_retriever/src/nemo_retriever/adapters/cli/sdk_workflow.py | CLI adapter simplified; internal function changes are acceptable for this adapter layer. |
| nemo_retriever/src/nemo_retriever/graph/executor.py | RayDataExecutor.ingest split cleanly into lazy build_dataset and materialising ingest. |
| nemo_retriever/tests/test_ingest_manifest.py | Good coverage of new manifest/branch logic; missing SPDX header. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[GraphIngestor.ingest] --> B{extraction_mode set?}
B -- yes --> C[_resolve_effective_extraction_inputs]
B -- no --> D[_plan_default_extraction_branches]
D --> E{branches count}
E -- 1 --> F[single-family path]
E -- 2+ --> G[ExtractionBranchExecutor]
C --> H[_execute_single_graph]
F --> H
G --> I{run_mode}
I -- inprocess --> J[InprocessExecutor per branch -> concat -> post-graph]
I -- batch --> K[build_dataset per branch -> union -> post-graph ingest]
H --> L{run_mode}
L -- inprocess --> M[InprocessExecutor.ingest]
L -- batch --> N[RayDataExecutor.ingest]
Comments Outside Diff (2)
-
nemo_retriever/src/nemo_retriever/params/models.py, line 252-265 (link)Breaking removal of public
BatchTuningParamsfieldsBatchTuningParamsis exported fromnemo_retriever.paramsand is part of the public library API. This PR removestable_structure_workers,table_structure_batch_size,table_structure_cpus_per_actor, andgpu_table_structurewithout a deprecation cycle. Any existing library user who constructsBatchTuningParams(table_structure_workers=…)or reads these attributes will get aValidationErrororAttributeErrorat runtime after upgrading.Prompt To Fix With AI
This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/params/models.py Line: 252-265 Comment: **Breaking removal of public `BatchTuningParams` fields** `BatchTuningParams` is exported from `nemo_retriever.params` and is part of the public library API. This PR removes `table_structure_workers`, `table_structure_batch_size`, `table_structure_cpus_per_actor`, and `gpu_table_structure` without a deprecation cycle. Any existing library user who constructs `BatchTuningParams(table_structure_workers=…)` or reads these attributes will get a `ValidationError` or `AttributeError` at runtime after upgrading. How can I resolve this? If you propose a fix, please make it concise.
-
nemo_retriever/src/nemo_retriever/graph/ingestor_runtime.py, line 681-693 (link)video_text_dedup_paramssilently dropped forextraction_mode="auto"build_graph()still accepts avideo_text_dedup_paramsargument but no longer threads it through toMultiTypeExtractOperator. Users who callGraphIngestor.extract(extraction_mode="auto")and pass customVideoFrameTextDedupParamswill have their configuration silently ignored — the operator will useVideoFrameTextDedupParams()defaults instead of the caller-supplied values.Prompt To Fix With AI
This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/graph/ingestor_runtime.py Line: 681-693 Comment: **`video_text_dedup_params` silently dropped for `extraction_mode="auto"`** `build_graph()` still accepts a `video_text_dedup_params` argument but no longer threads it through to `MultiTypeExtractOperator`. Users who call `GraphIngestor.extract(extraction_mode="auto")` and pass custom `VideoFrameTextDedupParams` will have their configuration silently ignored — the operator will use `VideoFrameTextDedupParams()` defaults instead of the caller-supplied values. How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 4 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 4
nemo_retriever/src/nemo_retriever/graph/multi_type_extract_operator.py:58-62
**`.m4a` silently dropped from `AUDIO_EXTENSIONS`**
The old definition was `INPUT_TYPE_EXTENSIONS["audio"]` which equalled `frozenset({".mp3", ".wav", ".m4a"})`. The replacement literal omits `.m4a`. Any user who passes `.m4a` files with explicit `extraction_mode="auto"` will hit `_unsupported_extension_message` and have those files silently skipped rather than processed as audio. `input_type_for_path` still recognises `.m4a` as "audio" (used by the manifest planner), so the regression is confined to the `MultiTypeExtractOperator` code path — but that path is still reachable via `ingestor.extract(extraction_mode="auto")`.
### Issue 2 of 4
nemo_retriever/src/nemo_retriever/params/models.py:252-265
**Breaking removal of public `BatchTuningParams` fields**
`BatchTuningParams` is exported from `nemo_retriever.params` and is part of the public library API. This PR removes `table_structure_workers`, `table_structure_batch_size`, `table_structure_cpus_per_actor`, and `gpu_table_structure` without a deprecation cycle. Any existing library user who constructs `BatchTuningParams(table_structure_workers=…)` or reads these attributes will get a `ValidationError` or `AttributeError` at runtime after upgrading.
### Issue 3 of 4
nemo_retriever/src/nemo_retriever/graph/ingestor_runtime.py:681-693
**`video_text_dedup_params` silently dropped for `extraction_mode="auto"`**
`build_graph()` still accepts a `video_text_dedup_params` argument but no longer threads it through to `MultiTypeExtractOperator`. Users who call `GraphIngestor.extract(extraction_mode="auto")` and pass custom `VideoFrameTextDedupParams` will have their configuration silently ignored — the operator will use `VideoFrameTextDedupParams()` defaults instead of the caller-supplied values.
### Issue 4 of 4
nemo_retriever/tests/test_ingest_manifest.py:1
New Python source files must include the SPDX license header. `test_ingest_manifest.py` is the only new file in this PR that is missing it.
Reviews (1): Last reviewed commit: "Replace ingest input-type routing with m..." | Re-trigger Greptile
| PDF_EXTENSIONS = {".pdf", ".docx", ".pptx"} | ||
| TEXT_EXTENSIONS = {".txt"} | ||
| HTML_EXTENSIONS = {".html"} | ||
| AUDIO_EXTENSIONS = {".mp3", ".wav"} | ||
| IMAGE_EXTENSIONS = SUPPORTED_IMAGE_EXTENSIONS |
There was a problem hiding this comment.
.m4a silently dropped from AUDIO_EXTENSIONS
The old definition was INPUT_TYPE_EXTENSIONS["audio"] which equalled frozenset({".mp3", ".wav", ".m4a"}). The replacement literal omits .m4a. Any user who passes .m4a files with explicit extraction_mode="auto" will hit _unsupported_extension_message and have those files silently skipped rather than processed as audio. input_type_for_path still recognises .m4a as "audio" (used by the manifest planner), so the regression is confined to the MultiTypeExtractOperator code path — but that path is still reachable via ingestor.extract(extraction_mode="auto").
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/graph/multi_type_extract_operator.py
Line: 58-62
Comment:
**`.m4a` silently dropped from `AUDIO_EXTENSIONS`**
The old definition was `INPUT_TYPE_EXTENSIONS["audio"]` which equalled `frozenset({".mp3", ".wav", ".m4a"})`. The replacement literal omits `.m4a`. Any user who passes `.m4a` files with explicit `extraction_mode="auto"` will hit `_unsupported_extension_message` and have those files silently skipped rather than processed as audio. `input_type_for_path` still recognises `.m4a` as "audio" (used by the manifest planner), so the regression is confined to the `MultiTypeExtractOperator` code path — but that path is still reachable via `ingestor.extract(extraction_mode="auto")`.
How can I resolve this? If you propose a fix, please make it concise.| @@ -0,0 +1,266 @@ | |||
| from __future__ import annotations | |||
There was a problem hiding this comment.
New Python source files must include the SPDX license header.
test_ingest_manifest.py is the only new file in this PR that is missing it.
Rule Used: Python files added in this PR must include the SPD... (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/tests/test_ingest_manifest.py
Line: 1
Comment:
New Python source files must include the SPDX license header. `test_ingest_manifest.py` is the only new file in this PR that is missing it.
**Rule Used:** Python files added in this PR must include the SPD... ([source](https://app.greptile.com/review/custom-context?memory=spdx-license-header))
How can I resolve this? If you propose a fix, please make it concise.
Summary
Forward-ports the 26.05 manifest-router fix onto
main.This replaces PR 2068's root CLI input-type routing with manifest-planned extraction branches inside
GraphIngestor, while preserving the dedicated PDF fast path and fixing BMP/TIFF ingest.Motivation
The 26.05 release fix should also exist on
mainso the release branch does not later merge a large routing change into a diverged codebase unexpectedly.This PR gets ahead of that integration by applying the same manifest-planned routing behavior directly on top of current
main.Changes
retriever ingest --input-type.extraction_mode="auto"as compatibility fallback toMultiTypeExtractOperator.PDFExtractionCPUActor.Validation
Forward-port result:
upstream/mainFocused test suite: