[FEA] Multi-node Out of Core Streaming KMeans API by tarang-jain · Pull Request #2066 · rapidsai/cuvs

tarang-jain · 2026-05-07T00:20:10Z

Merge after #2015 and #2017

Allows a stream of input matrices per worker, that are further batched using the streaming_batch_size parameter. Reasoning: We should be able to supply dask partitions (on host) directly without having to concatenate them into one consolidated matrix.

As a part of this PR, we also unify the multi-GPU implementations into one (earlier the out of core implementation was separate).
Tests: We get rid of the separate out of core test file. The single MG testing unit is taking care of both out of core and on device matrices.

…nto combine-batch

Co-authored-by: Victor Lafargue <viclafargue@nvidia.com>

…nto combine-batch

…g-streaming

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/src/cluster/detail/kmeans_mg.cuh`:
- Around line 188-197: The variable has_data is declared after it's used in the
if-block; move the declaration bool has_data = (n_local > 0); to before the if
statement that uses it (the block containing RAFT_LOG_WARN and
streaming_batch_size assignment) or replace uses of has_data with the expression
(n_local > 0) directly; ensure you only have one definition of has_data (no
shadowing) so functions/conditions like the if (data_on_device && has_data &&
streaming_batch_size < max_part_rows) see a valid variable.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 95c4ace5-47ce-4301-a2d0-ca00029462a5

📥 Commits

Reviewing files that changed from the base of the PR and between ad180ed and 28f6036.

📒 Files selected for processing (2)

cpp/src/cluster/detail/kmeans_mg.cuh
cpp/tests/cluster/kmeans_mg.cu

💤 Files with no reviewable changes (1)

cpp/tests/cluster/kmeans_mg.cu

…g-streaming

tarang-jain · 2026-05-31T21:09:36Z

/ok to test 089e970

viclafargue

Thanks for working on this @tarang-jain. Unifying the implementations is a good step for maintainability, but I’d like to raise a few concerns around initialization.

This PR changes regular MG KMeans initialization semantics in ways that should be discussed explicitly:

The truly distributed KMeans++ initialization is dropped. This limits the scaling of the initialization of centroids to what the root rank allows. This may be an acceptable change as there may be less communication steps with this simpler init, be it would need to be discussed.
init_size is documented as a host-only parameter, but regular KMeans now uses it or defaults to min(3 * n_clusters, global_n).
Some NCCL communication patterns (left from the multi-GPU Batched KMeans PR) should be revisited :
- Array init : Redundant broadcast from root. All ranks should normally already be correctly initialized with the user-defined centroids.
- Random init : Points are randomly sampled on ranks, and then merged with an allreduce operation, finally they are broadcasted again from root. We should remove the unnecessary broadcast.
- KMeans++ : If we keep the current root-rank implementation, the communication pattern should be adjusted. Only the root rank needs the full initialization sample, so gather-to-root would be a better fit than dense allreduce over the full sampled matrix on every rank. The resulting centroids can then be broadcast.

viclafargue · 2026-06-01T09:28:26Z

+void fit(
+  raft::resources const& handle,
+  const cuvs::cluster::kmeans::params& params,
+  const std::vector<raft::device_matrix_view<const float, int>>& X_parts,


Please expose the new API in cpp/include/cuvs/cluster/kmeans.hpp.

Yeah we need to find a good way to do that. The problem is that it would look identical to the single matrix, single GPU API. If a user passes an SG handle to this API, it would go through this MG impl

I added them in the ::mg namespace.

viclafargue · 2026-06-01T09:56:46Z

+      if (n_weights == 0) { continue; }
+
+      auto d_part_wt = raft::make_device_scalar<DataT>(dev_res, DataT{0});
+      cuvs::cluster::kmeans::detail::weightSum(dev_res, weights, d_part_wt.view());


Partition weights are validated per partition, not globally. weightSum throws an error when the partial sum is <= 0, but we should instead fail when the global sum (after reduce operation) is <= 0.

I intentionally removed the global assertion. Earlier we were doing a stream sync to bring the sum to host and validate on host. I dont think syncing the whole stream just for an assertion is worth the perf impact.

…eaming

tarang-jain and others added 30 commits April 10, 2026 15:54

combine impls

66d7fd3

Multi-GPU Batched KMeans

07707af

Merge branch 'main' into mg-batched-kmeans

efc270f

rm inertia_check

0a09e6f

change to warning

99a5730

style

a077406

add init_size param

d659875

Merge branch 'main' into combine-batch

ec2e8b7

docs

03a6473

Merge branch 'combine-batch' of https://github.com/tarang-jain/cuvs i…

42a8d9d

…nto combine-batch

rm direct cuda api calls

86af2fa

std::swap instead of raft::copy

d4e4e2c

cache batch norms

0819af5

centroid norms can also be cached per iteration

e0f079c

mg n_iter

c2f7390

pre-commit

b9c3102

do not break c abi

e3956c1

Merge branch 'main' into combine-batch

986d78a

cluster_cost on device

7197b71

Updated testing

84ab315

templating

47d4b94

Merge branch 'main' into combine-batch

a8e1d26

fix checkWeight

384d054

merge upstream:

455b286

Merge branch 'combine-batch' of https://github.com/tarang-jain/cuvs i…

5462809

…nto combine-batch

fix compilation

6ba759c

rel_tol

e76eaac

Co-authored-by: Victor Lafargue <viclafargue@nvidia.com>

pass workspace

afbefdf

Merge branch 'combine-batch' of https://github.com/tarang-jain/cuvs i…

e62a63c

…nto combine-batch

style

e4f08bf

tarang-jain requested a review from a team as a code owner May 28, 2026 20:38

tarang-jain requested a review from msarahan May 28, 2026 20:38

tarang-jain and others added 8 commits May 28, 2026 13:39

merge upstream

9851017

update cmakelists

f572877

merge upstream

edaa7e7

rm batched tests

588bb6a

Merge branch 'main' into mnmg-streaming

ad180ed

rm unnecessary test stream sycns

72cc34b

reset bs; assertion

ed50703

Merge branch 'mnmg-streaming' of github.com:tarang-jain/cuvs into mnm…

28f6036

…g-streaming

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

Comment thread cpp/src/cluster/detail/kmeans_mg.cuh Outdated

tarang-jain and others added 5 commits May 29, 2026 13:22

rm has_data flag

a811c56

Merge branch 'main' into mnmg-streaming

95f334c

fix export

d176314

Merge branch 'mnmg-streaming' of github.com:tarang-jain/cuvs into mnm…

1db9e02

…g-streaming

avoid pinned scalar;get_nccl_comms before omp

089e970

viclafargue requested a review from dantegd June 1, 2026 10:06

viclafargue reviewed Jun 1, 2026

View reviewed changes

tarang-jain added 6 commits June 1, 2026 09:34

use root from macro

6cc895c

avoid copy and rank alloc with initarray

4abe6f2

Merge branch 'main' of https://github.com/rapidsai/cuvs into mnmg-str…

ebf188a

…eaming

fix compilation; guardrail MG CMake flag

785e4a3

get n_features from centroids

9a526c8

add sigs to header

f08e581

cjnolet added this to cuVS Library & Integrations Roadmap Jun 2, 2026

github-project-automation Bot moved this to Todo in cuVS Library & Integrations Roadmap Jun 2, 2026

cjnolet moved this from Todo to Done in cuVS Library & Integrations Roadmap Jun 2, 2026

cjnolet removed this from cuVS Library & Integrations Roadmap Jun 2, 2026

Merge branch 'main' into mnmg-streaming

51efb42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Multi-node Out of Core Streaming KMeans API#2066

[FEA] Multi-node Out of Core Streaming KMeans API#2066
tarang-jain wants to merge 156 commits into
rapidsai:mainfrom
tarang-jain:mnmg-streaming

tarang-jain commented May 7, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

tarang-jain commented May 31, 2026

Uh oh!

viclafargue left a comment

Uh oh!

viclafargue Jun 1, 2026

Uh oh!

tarang-jain Jun 1, 2026

Uh oh!

tarang-jain Jun 1, 2026

Uh oh!

Uh oh!

viclafargue Jun 1, 2026

Uh oh!

tarang-jain Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tarang-jain commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tarang-jain commented May 31, 2026

Uh oh!

viclafargue left a comment

Choose a reason for hiding this comment

Uh oh!

viclafargue Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

tarang-jain Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

tarang-jain Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

viclafargue Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

tarang-jain Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tarang-jain commented May 7, 2026 •

edited

Loading