Skip to content

Streaming VDB writes#2041

Open
edknv wants to merge 7 commits into
NVIDIA:mainfrom
edknv:edwardk/stream-vdb-write
Open

Streaming VDB writes#2041
edknv wants to merge 7 commits into
NVIDIA:mainfrom
edknv:edwardk/stream-vdb-write

Conversation

@edknv
Copy link
Copy Markdown
Collaborator

@edknv edknv commented May 15, 2026

Description

Users reported batch mode ingest chewing up 400 GB+. Profiling on bo767 reproduced a 200 GB peak.

The upload step was the bottleneck. It waited for the whole embedded corpus, then handed it to one worker that held it through several reshape passes simultaneously. The driver also pulled a full copy back at the end. Add in stale workers still holding their old memory, and you hit the huge memory peak.

This PR fixes the issue by making the upload streaming. Each block writes itself to LanceDB. The first block creates the table, and all the rest only append. The search index is built once on the driver after the graph finishes.

Measured on bo767, same host:

Before After
Wall clock 57.9 min 48.0 min (−17 %)
Peak MemAvailable delta 165.3 GiB 50.3 GiB (−70 %)

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

@edknv edknv requested review from a team as code owners May 15, 2026 05:35
@edknv edknv requested a review from charlesbluca May 15, 2026 05:35
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 15, 2026

Greptile Summary

This PR converts the LanceDB upload step from a memory-intensive global-batch write (collect entire embedded corpus → single worker → reshape → driver pull) to a streaming per-block write, achieving a 70% peak-memory reduction and 17% wall-clock speedup on the benchmark host.

  • Streaming VDB writes: IngestVdbOperator.REQUIRES_GLOBAL_BATCH flipped to False; each Ray Data block calls LanceDB.append() with overwrite=True on the first committed batch and table.add() on every subsequent one. build_index() is deferred to a driver-side GraphIngestor._finalize_vdb_upload() hook after the graph completes.
  • Driver-side memory fix: RayDataExecutor returns ds.materialize() (block refs only) instead of ds.to_pandas(). Results flow through a new IngestResult wrapper whose accessors are all serveable from streaming Ray Dataset primitives.
  • Connection caching + Lance commit-count control: The lance connection and table handle are cached on the LanceDB instance across batches; ingestor_runtime coalesces blocks to ~1000 rows to bound Lance manifest growth.

Confidence Score: 4/5

Safe to merge except for one backward-compat trap in LanceDB's constructor.

The build_index kwarg removal in LanceDB.__init__ is not guarded with a kwargs.pop before super().__init__(**kwargs). Because VDB.__init__ does self.__dict__.update(kwargs), any caller still using the old LanceDB(build_index=False) API will have the boolean land as an instance attribute, silently shadowing the new build_index() method. The first call to vdb.build_index() in _finalize_vdb_upload would then raise TypeError: 'bool' object is not callable. All other parts of the streaming contract are correct and well-tested.

nemo_retriever/src/nemo_retriever/vdb/lancedb.py — the LanceDB.__init__ kwarg-pop gap.

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/vdb/lancedb.py Adds append() and build_index() streaming methods; removes legacy build_index constructor param but doesn't guard against it landing in **kwargs and shadowing the new method via VDB.__dict__.update.
nemo_retriever/src/nemo_retriever/vdb/operators.py Switches IngestVdbOperator from global-batch VDB.run() to per-batch VDB.append(); adds overwrite-flag handshake, accounting-column projection, and _vdb_uploadable flag column.
nemo_retriever/src/nemo_retriever/pipeline/ingest_result.py New module wrapping ingest output in a streaming-friendly accessor; correctly avoids Dataset.to_pandas() for all public accessors.
nemo_retriever/src/nemo_retriever/graph_ingestor.py Adds _finalize_vdb_upload driver-side hook that constructs a fresh VDB instance and calls build_index() once after the graph finishes streaming writes.
nemo_retriever/src/nemo_retriever/graph/ingestor_runtime.py Pins IngestVdbOperator concurrency to 1 (required for overwrite-flag correctness) and sets a default batch_size=1000 to reduce Lance manifest commit count.
nemo_retriever/src/nemo_retriever/graph/executor.py Replaces ds.to_pandas() with ds.materialize(), keeping block refs on the driver instead of pulling the full corpus into a single DataFrame.
nemo_retriever/tests/test_lancedb_streaming_append.py New integration test pinning the streaming round-trip and connection-caching invariants against a real lancedb instance.
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
nemo_retriever/src/nemo_retriever/vdb/lancedb.py:354
**Legacy `build_index` kwarg still shadows the new `build_index()` method**

`VDB.__init__` performs `self.__dict__.update(kwargs)`. Any unknown kwarg surviving to that call becomes an instance attribute with the same name. The old `LanceDB` constructor accepted `build_index: bool | None = None` as an explicit parameter; the new one removes it — but leaves no `kwargs.pop("build_index", None)` guard before `super().__init__(**kwargs)`.

Consequence: a caller using the old API (`LanceDB(build_index=False)`) passes that key through `**kwargs``VDB.__init__(build_index=False)``self.build_index = False`. Python's attribute lookup then resolves the instance dict before the class, so `vdb.build_index` is the boolean `False` and `GraphIngestor._finalize_vdb_upload()` raises `TypeError: 'bool' object is not callable` on `vdb.build_index()`.

Reviews (4): Last reviewed commit: "fix test" | Re-trigger Greptile

Comment thread nemo_retriever/src/nemo_retriever/pipeline/ingest_result.py
Comment thread nemo_retriever/src/nemo_retriever/vdb/adt_vdb.py Outdated
Comment thread nemo_retriever/src/nemo_retriever/vdb/lancedb.py Outdated
Comment thread nemo_retriever/src/nemo_retriever/vdb/adt_vdb.py Outdated
Comment thread nemo_retriever/src/nemo_retriever/vdb/operators.py Outdated
Comment thread nemo_retriever/tests/test_nv_ingest_vdb_operator.py Outdated
Comment thread nemo_retriever/src/nemo_retriever/vdb/lancedb.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant