SNMG Batched KMeans Python API#2154
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (2)
📝 WalkthroughSummary by CodeRabbit
WalkthroughThis PR adds single-node multi-GPU k-means fitting via new C and Python APIs. The C layer declares and implements fitting with DLPack tensor validation and dtype dispatch; the Python layer wraps the C API with NumPy host-array validation and tests across dtypes, initialization methods, and error cases. ChangesMulti-GPU K-Means APIs
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (5)
c/tests/cluster/kmeans_mg_c.cu (1)
112-115: 💤 Low valueCentroid comparison may be sensitive to cluster ordering.
The test compares centroids positionally, expecting
{1.5, 1.5, 10.5, 10.5}. While the well-separated initial centroids{0,0}and{12,12}should converge deterministically to the two clusters, k-means implementations may still reorder clusters internally. If this test ever becomes flaky, consider sorting centroids before comparison or using an order-invariant comparison.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@c/tests/cluster/kmeans_mg_c.cu` around lines 112 - 115, The positional comparison of centroids (centroids_data vs kExpectedCentroids) is sensitive to cluster ordering; change the test to perform an order-invariant comparison by grouping centroids into kNClusters vectors of length kNFeatures (using centroids_data and kNFeatures to slice), sort those centroid vectors using a deterministic key (e.g., lexicographic compare on feature values or by the first feature), do the same sorting for the expected centroids (kExpectedCentroids), and then run EXPECT_NEAR pairwise on the sorted lists to ensure the test is robust to cluster reordering.c/src/cluster/mg_kmeans.cpp (1)
144-144: 💤 Low valueUnqualified
Arrayatmg_kmeans.cppisn’t an issue, but can be made clearer
Arrayis an enumerator from the C API unscoped enumcuvsKMeansInitMethod(defined inc/include/cuvs/cluster/kmeans.hand pulled in viamg_kmeans.h), so it’s expected thatif (params.init == Array)uses an unqualified name—there’s nocuvsKMeansInitMethod::Arrayform to qualify.- Optional: move
convert_params(params)before the check and comparekmeans_params.inittocuvs::cluster::kmeans::params::InitMethod::Arrayfor consistency with the C++ enum.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@c/src/cluster/mg_kmeans.cpp` at line 144, The condition uses the unscoped C enum value Array (params.init == Array); to make it clearer/consistent, call convert_params(params) first to build the C++ struct and then compare the C++ enum: replace the direct check of params.init with a check against kmeans_params.init == cuvs::cluster::kmeans::params::InitMethod::Array (use convert_params to produce kmeans_params), so the code references the C++ scoped enum rather than the unqualified C enumerator.python/cuvs/cuvs/cluster/mg/kmeans/__init__.py (1)
1-9: ⚡ Quick winConsider adding a module docstring.
This module serves as the public API entry point for single-node multi-GPU k-means. A brief docstring would help users understand the package's purpose and available exports.
📝 Suggested module docstring
# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION. # SPDX-License-Identifier: Apache-2.0 +"""Single-node multi-GPU (SNMG) k-means clustering. + +This module provides k-means fitting across multiple GPUs on a single node, +with host-memory input arrays distributed across available devices. +""" + from cuvs.cluster.kmeans import KMeansParams🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/cuvs/cuvs/cluster/mg/kmeans/__init__.py` around lines 1 - 9, Add a concise module docstring at the top of the module to describe that this package is the public API entry point for single-node multi-GPU k-means and list the primary exports; update the module containing KMeansParams, FitOutput, and fit to include a short triple-quoted string explaining purpose, intended use, and the exported symbols (FitOutput, KMeansParams, fit) so users see what this subpackage provides.python/cuvs/cuvs/tests/test_mg_kmeans.py (1)
31-47: ⚡ Quick winConsider adding docstrings to test helper functions.
The helper functions
make_inputs,make_sample_weights, andpredict_labels_hostlack documentation explaining their purpose, parameters, and return values, which would improve test maintainability.Also applies to: 50-57
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/cuvs/cuvs/tests/test_mg_kmeans.py` around lines 31 - 47, Add concise docstrings to the test helper functions make_inputs, make_sample_weights, and predict_labels_host describing their purpose, parameters (dtype, n_rows, n_cols, n_clusters where applicable), return values (e.g., X and centroids for make_inputs; sample weights array for make_sample_weights; predicted labels for predict_labels_host), and any important behavior (e.g., deterministic RNG seeds and array contiguity). Place the docstring immediately under each function signature using a short triple-quoted string.python/cuvs/cuvs/tests/test_kmeans.py (1)
18-28: ⚡ Quick winConsider adding a docstring to the helper function.
The
make_well_separated_kmeans_inputhelper generates synthetic test data but lacks documentation explaining its purpose, parameters, and return values.📝 Suggested docstring
def make_well_separated_kmeans_input(rng, n_rows, n_cols, n_clusters, dtype): + """Generate well-separated synthetic k-means input with deterministic structure. + + Creates cluster centers with large separation (scale=10.0) and adds small + Gaussian noise (scale=0.01) to ensure clusters remain distinct. + + Args: + rng: NumPy random generator + n_rows: Number of data points + n_cols: Number of features + n_clusters: Number of clusters + dtype: NumPy dtype for the output arrays + + Returns: + Tuple of (X, initial_centroids) as contiguous NumPy arrays + """ labels = np.arange(n_rows) % n_clusters🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/cuvs/cuvs/tests/test_kmeans.py` around lines 18 - 28, Add a clear docstring to the helper function make_well_separated_kmeans_input describing its purpose (generate well-separated KMeans test data), parameters (rng: random generator, n_rows: int, n_cols: int, n_clusters: int, dtype: numpy dtype), and return values (X: contiguous ndarray of shape (n_rows, n_cols) with clustered samples, initial_centroids: ndarray of shape (n_clusters, n_cols) containing the initial centroids copied from X); place the docstring immediately below the def line and mention that X is returned as a contiguous array and that initial_centroids is a copy used for initialization.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@c/include/cuvs/cluster/mg_kmeans.h`:
- Around line 34-36: Update the Doxygen for the function in mg_kmeans.h so the
documented parameter type matches the actual signature: replace references to
cuvsMultiGpuResources_t with cuvsResources_t and add a short note that the
cuvsResources_t must represent a multi-GPU resource created by
cuvsMultiGpuResourcesCreate or cuvsMultiGpuResourcesCreateWithDeviceIds (the
implementation already validates this). Ensure the comment references the exact
symbol cuvsResources_t and mentions the creation functions
cuvsMultiGpuResourcesCreate / cuvsMultiGpuResourcesCreateWithDeviceIds to avoid
confusion.
In `@fern/pages/python_api/index.md`:
- Line 8: The sidebar shows two identical "[Kmeans]" link labels; update the
link label for the multi-GPU page to a distinct name (e.g., "Kmeans (multi-GPU)"
or "Kmeans — multi‑GPU") so the entry referencing
"/api-reference/python-api-cluster-mg-kmeans" is unambiguous; edit the link text
in fern/pages/python_api/index.md where the label "[Kmeans]" appears and leave
the URL unchanged.
In `@fern/pages/python_api/python-api-cluster-mg-kmeans.md`:
- Line 27: The `resources` parameter on the KMeans cluster API is undocumented;
update the `resources` row in python-api-cluster-mg-kmeans.md to describe its
behavior: state that `resources` is an optional cuvs.common.Resources object
controlling compute resources (CPUs, GPUs, memory) for training/inference,
document the default when omitted, list accepted fields/units (e.g., cpu_count,
gpu_count, memory_gb) and how the cluster uses them (scheduling/training
limits), and add a short usage example showing how to pass a
cuvs.common.Resources instance; reference the `resources` symbol and its type
`cuvs.common.Resources` in the description.
---
Nitpick comments:
In `@c/src/cluster/mg_kmeans.cpp`:
- Line 144: The condition uses the unscoped C enum value Array (params.init ==
Array); to make it clearer/consistent, call convert_params(params) first to
build the C++ struct and then compare the C++ enum: replace the direct check of
params.init with a check against kmeans_params.init ==
cuvs::cluster::kmeans::params::InitMethod::Array (use convert_params to produce
kmeans_params), so the code references the C++ scoped enum rather than the
unqualified C enumerator.
In `@c/tests/cluster/kmeans_mg_c.cu`:
- Around line 112-115: The positional comparison of centroids (centroids_data vs
kExpectedCentroids) is sensitive to cluster ordering; change the test to perform
an order-invariant comparison by grouping centroids into kNClusters vectors of
length kNFeatures (using centroids_data and kNFeatures to slice), sort those
centroid vectors using a deterministic key (e.g., lexicographic compare on
feature values or by the first feature), do the same sorting for the expected
centroids (kExpectedCentroids), and then run EXPECT_NEAR pairwise on the sorted
lists to ensure the test is robust to cluster reordering.
In `@python/cuvs/cuvs/cluster/mg/kmeans/__init__.py`:
- Around line 1-9: Add a concise module docstring at the top of the module to
describe that this package is the public API entry point for single-node
multi-GPU k-means and list the primary exports; update the module containing
KMeansParams, FitOutput, and fit to include a short triple-quoted string
explaining purpose, intended use, and the exported symbols (FitOutput,
KMeansParams, fit) so users see what this subpackage provides.
In `@python/cuvs/cuvs/tests/test_kmeans.py`:
- Around line 18-28: Add a clear docstring to the helper function
make_well_separated_kmeans_input describing its purpose (generate well-separated
KMeans test data), parameters (rng: random generator, n_rows: int, n_cols: int,
n_clusters: int, dtype: numpy dtype), and return values (X: contiguous ndarray
of shape (n_rows, n_cols) with clustered samples, initial_centroids: ndarray of
shape (n_clusters, n_cols) containing the initial centroids copied from X);
place the docstring immediately below the def line and mention that X is
returned as a contiguous array and that initial_centroids is a copy used for
initialization.
In `@python/cuvs/cuvs/tests/test_mg_kmeans.py`:
- Around line 31-47: Add concise docstrings to the test helper functions
make_inputs, make_sample_weights, and predict_labels_host describing their
purpose, parameters (dtype, n_rows, n_cols, n_clusters where applicable), return
values (e.g., X and centroids for make_inputs; sample weights array for
make_sample_weights; predicted labels for predict_labels_host), and any
important behavior (e.g., deterministic RNG seeds and array contiguity). Place
the docstring immediately under each function signature using a short
triple-quoted string.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7a754ca0-8545-4e19-a93e-76804f4836ed
📒 Files selected for processing (23)
c/CMakeLists.txtc/include/cuvs/cluster/mg_kmeans.hc/include/cuvs/core/all.hc/src/cluster/mg_kmeans.cppc/tests/CMakeLists.txtc/tests/cluster/kmeans_mg_c.cufern/docs.ymlfern/pages/c_api/c-api-cluster-mg-kmeans.mdfern/pages/c_api/index.mdfern/pages/python_api/index.mdfern/pages/python_api/python-api-cluster-mg-kmeans.mdpython/cuvs/cuvs/cluster/CMakeLists.txtpython/cuvs/cuvs/cluster/__init__.pypython/cuvs/cuvs/cluster/kmeans/kmeans.pxdpython/cuvs/cuvs/cluster/kmeans/kmeans.pyxpython/cuvs/cuvs/cluster/mg/CMakeLists.txtpython/cuvs/cuvs/cluster/mg/__init__.pypython/cuvs/cuvs/cluster/mg/kmeans/CMakeLists.txtpython/cuvs/cuvs/cluster/mg/kmeans/__init__.pypython/cuvs/cuvs/cluster/mg/kmeans/kmeans.pxdpython/cuvs/cuvs/cluster/mg/kmeans/kmeans.pyxpython/cuvs/cuvs/tests/test_kmeans.pypython/cuvs/cuvs/tests/test_mg_kmeans.py
💤 Files with no reviewable changes (1)
- python/cuvs/cuvs/cluster/kmeans/kmeans.pyx
| * @param[in] res cuvsMultiGpuResources_t opaque C handle | ||
| * created by cuvsMultiGpuResourcesCreate or | ||
| * cuvsMultiGpuResourcesCreateWithDeviceIds. |
There was a problem hiding this comment.
Documentation refers to cuvsMultiGpuResources_t but parameter is cuvsResources_t.
The Doxygen comment at line 34 mentions cuvsMultiGpuResources_t as the expected handle type, but the actual parameter type in the function signature (line 50) is cuvsResources_t. While the implementation validates that the handle is a multi-GPU resource, the documentation could cause confusion for API consumers.
📝 Suggested documentation fix
-* `@param`[in] res cuvsMultiGpuResources_t opaque C handle
+* `@param`[in] res cuvsResources_t opaque C handle (must be a multi-GPU resource)
created by cuvsMultiGpuResourcesCreate or
cuvsMultiGpuResourcesCreateWithDeviceIds.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| * @param[in] res cuvsMultiGpuResources_t opaque C handle | |
| * created by cuvsMultiGpuResourcesCreate or | |
| * cuvsMultiGpuResourcesCreateWithDeviceIds. | |
| * `@param`[in] res cuvsResources_t opaque C handle (must be a multi-GPU resource) | |
| * created by cuvsMultiGpuResourcesCreate or | |
| * cuvsMultiGpuResourcesCreateWithDeviceIds. |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@c/include/cuvs/cluster/mg_kmeans.h` around lines 34 - 36, Update the Doxygen
for the function in mg_kmeans.h so the documented parameter type matches the
actual signature: replace references to cuvsMultiGpuResources_t with
cuvsResources_t and add a short note that the cuvsResources_t must represent a
multi-GPU resource created by cuvsMultiGpuResourcesCreate or
cuvsMultiGpuResourcesCreateWithDeviceIds (the implementation already validates
this). Ensure the comment references the exact symbol cuvsResources_t and
mentions the creation functions cuvsMultiGpuResourcesCreate /
cuvsMultiGpuResourcesCreateWithDeviceIds to avoid confusion.
| ## Cluster | ||
|
|
||
| - [Kmeans](/api-reference/python-api-cluster-kmeans) | ||
| - [Kmeans](/api-reference/python-api-cluster-mg-kmeans) |
There was a problem hiding this comment.
Use a distinct label for the multi-GPU page.
This adds a second identical "Kmeans" label, making the two links ambiguous in the sidebar.
🧭 Proposed fix
-- [Kmeans](/api-reference/python-api-cluster-mg-kmeans)
+- [Multi-GPU Kmeans](/api-reference/python-api-cluster-mg-kmeans)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - [Kmeans](/api-reference/python-api-cluster-mg-kmeans) | |
| - [Multi-GPU Kmeans](/api-reference/python-api-cluster-mg-kmeans) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@fern/pages/python_api/index.md` at line 8, The sidebar shows two identical
"[Kmeans]" link labels; update the link label for the multi-GPU page to a
distinct name (e.g., "Kmeans (multi-GPU)" or "Kmeans — multi‑GPU") so the entry
referencing "/api-reference/python-api-cluster-mg-kmeans" is unambiguous; edit
the link text in fern/pages/python_api/index.md where the label "[Kmeans]"
appears and leave the URL unchanged.
| | `X` | `host array-like` | Training instances, shape (m, k). Must be C-contiguous float32 or float64 host data. | | ||
| | `centroids` | `host array-like, optional` | Initial centroids when ``params.init_method == "Array"`` and output centroids for all init methods. If omitted, a host NumPy output array is allocated unless ``init_method == "Array"``. | | ||
| | `sample_weights` | `host array-like, optional` | Optional weights per observation. Must be C-contiguous and have the same dtype as X. | | ||
| | `resources` | `cuvs.common.Resources, optional` | | |
There was a problem hiding this comment.
Document the resources parameter behavior.
resources is currently undocumented (empty description), which makes the API contract incomplete for users.
📝 Proposed doc fix
-| `resources` | `cuvs.common.Resources, optional` | |
+| `resources` | `cuvs.common.Resources, optional` | Multi-GPU resources handle. If omitted, default resources are used by the wrapper. |📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| | `resources` | `cuvs.common.Resources, optional` | | | |
| | `resources` | `cuvs.common.Resources, optional` | Multi-GPU resources handle. If omitted, default resources are used by the wrapper. | |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@fern/pages/python_api/python-api-cluster-mg-kmeans.md` at line 27, The
`resources` parameter on the KMeans cluster API is undocumented; update the
`resources` row in python-api-cluster-mg-kmeans.md to describe its behavior:
state that `resources` is an optional cuvs.common.Resources object controlling
compute resources (CPUs, GPUs, memory) for training/inference, document the
default when omitted, list accepted fields/units (e.g., cpu_count, gpu_count,
memory_gb) and how the cluster uses them (scheduling/training limits), and add a
short usage example showing how to pass a cuvs.common.Resources instance;
reference the `resources` symbol and its type `cuvs.common.Resources` in the
description.
| * closest cluster center. | ||
| * @param[out] n_iter Number of iterations run. | ||
| */ | ||
| CUVS_EXPORT cuvsError_t cuvsMultiGpuKMeansFit_v2(cuvsResources_t res, |
There was a problem hiding this comment.
we dont need this suffix. Can add breaking changes in this release.
Closes #2149 and #2155