Streaming VDB writes by edknv · Pull Request #2041 · NVIDIA/NeMo-Retriever

edknv · 2026-05-15T05:35:23Z

Description

Users reported batch mode ingest chewing up 400 GB+. Profiling on bo767 reproduced a 200 GB peak.

The upload step was the bottleneck. It waited for the whole embedded corpus, then handed it to one worker that held it through several reshape passes simultaneously. The driver also pulled a full copy back at the end. Add in stale workers still holding their old memory, and you hit the huge memory peak.

This PR fixes the issue by making the upload streaming. Each block writes itself to LanceDB. The first block creates the table, and all the rest only append. The search index is built once on the driver after the graph finishes.

Measured on bo767, same host:

	Before	After
Wall clock	57.9 min	48.0 min (−17 %)
Peak `MemAvailable` delta	165.3 GiB	50.3 GiB (−70 %)

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

greptile-apps · 2026-05-15T05:40:32Z

Greptile Summary

This PR converts the LanceDB upload step from a memory-intensive global-batch write (collect entire embedded corpus → single worker → reshape → driver pull) to a streaming per-block write, achieving a 70% peak-memory reduction and 17% wall-clock speedup on the benchmark host.

Streaming VDB writes: IngestVdbOperator.REQUIRES_GLOBAL_BATCH flipped to False; each Ray Data block calls LanceDB.append() with overwrite=True on the first committed batch and table.add() on every subsequent one. build_index() is deferred to a driver-side GraphIngestor._finalize_vdb_upload() hook after the graph completes.
Driver-side memory fix: RayDataExecutor returns ds.materialize() (block refs only) instead of ds.to_pandas(). Results flow through a new IngestResult wrapper whose accessors are all serveable from streaming Ray Dataset primitives.
Connection caching + Lance commit-count control: The lance connection and table handle are cached on the LanceDB instance across batches; ingestor_runtime coalesces blocks to ~1000 rows to bound Lance manifest growth.

Confidence Score: 4/5

Safe to merge except for one backward-compat trap in LanceDB's constructor.

The build_index kwarg removal in LanceDB.__init__ is not guarded with a kwargs.pop before super().__init__(**kwargs). Because VDB.__init__ does self.__dict__.update(kwargs), any caller still using the old LanceDB(build_index=False) API will have the boolean land as an instance attribute, silently shadowing the new build_index() method. The first call to vdb.build_index() in _finalize_vdb_upload would then raise TypeError: 'bool' object is not callable. All other parts of the streaming contract are correct and well-tested.

nemo_retriever/src/nemo_retriever/vdb/lancedb.py — the LanceDB.__init__ kwarg-pop gap.

Important Files Changed

Filename	Overview
nemo_retriever/src/nemo_retriever/vdb/lancedb.py	Adds `append()` and `build_index()` streaming methods; removes legacy `build_index` constructor param but doesn't guard against it landing in `**kwargs` and shadowing the new method via `VDB.__dict__.update`.
nemo_retriever/src/nemo_retriever/vdb/operators.py	Switches `IngestVdbOperator` from global-batch `VDB.run()` to per-batch `VDB.append()`; adds overwrite-flag handshake, accounting-column projection, and `_vdb_uploadable` flag column.
nemo_retriever/src/nemo_retriever/pipeline/ingest_result.py	New module wrapping ingest output in a streaming-friendly accessor; correctly avoids `Dataset.to_pandas()` for all public accessors.
nemo_retriever/src/nemo_retriever/graph_ingestor.py	Adds `_finalize_vdb_upload` driver-side hook that constructs a fresh VDB instance and calls `build_index()` once after the graph finishes streaming writes.
nemo_retriever/src/nemo_retriever/graph/ingestor_runtime.py	Pins `IngestVdbOperator` concurrency to 1 (required for overwrite-flag correctness) and sets a default `batch_size=1000` to reduce Lance manifest commit count.
nemo_retriever/src/nemo_retriever/graph/executor.py	Replaces `ds.to_pandas()` with `ds.materialize()`, keeping block refs on the driver instead of pulling the full corpus into a single DataFrame.
nemo_retriever/tests/test_lancedb_streaming_append.py	New integration test pinning the streaming round-trip and connection-caching invariants against a real lancedb instance.

Prompt To Fix All With AI

Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
nemo_retriever/src/nemo_retriever/vdb/lancedb.py:354
**Legacy `build_index` kwarg still shadows the new `build_index()` method**

`VDB.__init__` performs `self.__dict__.update(kwargs)`. Any unknown kwarg surviving to that call becomes an instance attribute with the same name. The old `LanceDB` constructor accepted `build_index: bool | None = None` as an explicit parameter; the new one removes it — but leaves no `kwargs.pop("build_index", None)` guard before `super().__init__(**kwargs)`.

Consequence: a caller using the old API (`LanceDB(build_index=False)`) passes that key through `**kwargs` → `VDB.__init__(build_index=False)` → `self.build_index = False`. Python's attribute lookup then resolves the instance dict before the class, so `vdb.build_index` is the boolean `False` and `GraphIngestor._finalize_vdb_upload()` raises `TypeError: 'bool' object is not callable` on `vdb.build_index()`.

_{Reviews (4): Last reviewed commit: "fix test" | Re-trigger Greptile}

edknv added 4 commits May 14, 2026 15:37

streaming vdb writes

235171d

Merge branch 'main' into edwardk/stream-vdb-write

51b0fd6

fix merge

06e35ec

drop build_indeex constructor kwarg

dcac989

edknv requested review from a team as code owners May 15, 2026 05:35

edknv requested a review from charlesbluca May 15, 2026 05:35

greptile-apps Bot reviewed May 15, 2026

View reviewed changes

Comment thread nemo_retriever/src/nemo_retriever/pipeline/ingest_result.py

Comment thread nemo_retriever/src/nemo_retriever/vdb/adt_vdb.py Outdated

Comment thread nemo_retriever/src/nemo_retriever/vdb/lancedb.py Outdated

Comment thread nemo_retriever/src/nemo_retriever/vdb/adt_vdb.py Outdated

address greptile comments

63374b5

greptile-apps Bot reviewed May 15, 2026

View reviewed changes

Comment thread nemo_retriever/src/nemo_retriever/vdb/operators.py Outdated

gurad append for no table

c905812

greptile-apps Bot reviewed May 15, 2026

View reviewed changes

Comment thread nemo_retriever/tests/test_nv_ingest_vdb_operator.py Outdated

fix test

e23ade3

greptile-apps Bot reviewed May 15, 2026

View reviewed changes

Comment thread nemo_retriever/src/nemo_retriever/vdb/lancedb.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming VDB writes#2041

Streaming VDB writes#2041
edknv wants to merge 7 commits into
NVIDIA:mainfrom
edknv:edwardk/stream-vdb-write

edknv commented May 15, 2026

Uh oh!

greptile-apps Bot commented May 15, 2026 •

edited

Loading

Confidence Score: 4/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

edknv commented May 15, 2026

Description

Checklist

Uh oh!

greptile-apps Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented May 15, 2026 •

edited

Loading