Skip to content

Check input dimensions#76

Open
hcho3 wants to merge 3 commits intorapidsai:mainfrom
hcho3:check_input_dims
Open

Check input dimensions#76
hcho3 wants to merge 3 commits intorapidsai:mainfrom
hcho3:check_input_dims

Conversation

@hcho3
Copy link
Contributor

@hcho3 hcho3 commented Mar 5, 2026

Closes #72

@hcho3 hcho3 requested a review from a team as a code owner March 5, 2026 23:29
@hcho3 hcho3 added improvement Improves an existing functionality non-breaking Introduces a non-breaking change and removed Cython / Python labels Mar 5, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 5, 2026

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Added ability to query the number of features expected by a model.
  • Improvements

    • Added input validation to reject predictions with mismatched feature dimensions and provide clear error messages.
  • Tests

    • Added test coverage verifying that shape mismatches raise descriptive errors.

Walkthrough

Adds a num_features property to the ForestInference interface and implementations, and enforces input dimension validation (2D and matching feature count) in predict/predict_per_tree/apply; includes tests that assert a ValueError is raised for mismatched feature counts.

Changes

Cohort / File(s) Summary
Interface
python/nvforest/nvforest/_base.py
Adds abstract property num_features(self) -> int to ForestInference.
Implementations
python/nvforest/nvforest/_forest_inference.py
Adds public num_features property to CPU/GPU classifier and regressor inference classes, delegating to self.forest.num_features.
Cython validation
python/nvforest/nvforest/detail/forest_inference.pyx
Adds _validate_input_dims(X) to ForestInference_impl and ForestInferenceImpl; calls it at start of predict, predict_per_tree, and apply to enforce 2D input and matching feature count, raising ValueError on mismatch.
Tests
python/nvforest/tests/test_nvforest.py
Adds test_incorrect_data_shape() parametrized test to assert a ValueError with expected-feature message when input feature count differs from the model's.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.88% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Check input dimensions' accurately describes the main change: adding validation to check input data dimensions against model training features.
Description check ✅ Passed The description 'Closes #72' is related to the changeset, referencing the issue that outlines the input dimension validation requirements.
Linked Issues check ✅ Passed The PR implements all requirements from issue #72: adds num_features property to ForestInference classes, implements input dimension validation in predict/apply methods, and includes test coverage for dimension mismatch errors.
Out of Scope Changes check ✅ Passed All changes are directly related to issue #72's requirements: adding num_features property, validating input dimensions, and testing the validation logic.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
python/nvforest/nvforest/_base.py (1)

93-97: Consider adding a docstring for the new num_features property.

While other abstract properties in this file also lack docstrings, the coding guidelines recommend NumPy-style docstrings for public functions/properties. Consider adding a brief docstring for completeness, such as:

`@property`
`@abstractmethod`
def num_features(self) -> int:
    """Return the number of features expected by the model."""
    pass

As per coding guidelines: "All public functions should have NumPy-style docstrings."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/nvforest/nvforest/_base.py` around lines 93 - 97, Add a NumPy-style
docstring to the abstract property num_features in class _base (the `@property`
`@abstractmethod` def num_features(self) -> int) describing that it returns the
number of features expected by the model; keep it brief (one-line summary and
optional short description/returns section) consistent with other public APIs.
python/nvforest/tests/test_nvforest.py (1)

861-873: Good test coverage for the new feature validation.

The test correctly verifies:

  1. The num_features property returns the expected value.
  2. A ValueError is raised with a helpful message when input dimensions mismatch.

Consider extending coverage for completeness:

  • Test with both CPU and GPU devices (parameterize with @pytest.mark.parametrize("device", ("cpu", "gpu"))).
  • Test other inference methods (predict_proba, predict_per_tree, apply) that also validate input dimensions.
  • Test with too many features (e.g., 6 columns instead of 5), not just too few.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/nvforest/tests/test_nvforest.py` around lines 861 - 873, Extend
test_incorrect_data_shape to parametrize over device and to exercise all
inference methods that validate input shape: add
`@pytest.mark.parametrize`("device", ("cpu","gpu")) and call
nvforest.load_from_sklearn(clf, device=device) to get fm; keep the existing
assert on fm.num_features, then for each method name in
("predict","predict_proba","predict_per_tree","apply") use
pytest.raises(ValueError, match=f"Expected {n_features} features") to call
getattr(fm, method)(np.zeros((1, 4))) and also test too-many-features by calling
each method with np.zeros((1, n_features+1))); ensure you reference the existing
test_incorrect_data_shape and the fm object when adding these checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@python/nvforest/nvforest/_base.py`:
- Around line 93-97: Add a NumPy-style docstring to the abstract property
num_features in class _base (the `@property` `@abstractmethod` def
num_features(self) -> int) describing that it returns the number of features
expected by the model; keep it brief (one-line summary and optional short
description/returns section) consistent with other public APIs.

In `@python/nvforest/tests/test_nvforest.py`:
- Around line 861-873: Extend test_incorrect_data_shape to parametrize over
device and to exercise all inference methods that validate input shape: add
`@pytest.mark.parametrize`("device", ("cpu","gpu")) and call
nvforest.load_from_sklearn(clf, device=device) to get fm; keep the existing
assert on fm.num_features, then for each method name in
("predict","predict_proba","predict_per_tree","apply") use
pytest.raises(ValueError, match=f"Expected {n_features} features") to call
getattr(fm, method)(np.zeros((1, 4))) and also test too-many-features by calling
each method with np.zeros((1, n_features+1))); ensure you reference the existing
test_incorrect_data_shape and the fm object when adding these checks.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 33ad5435-6a1f-490a-9f35-905774c01955

📥 Commits

Reviewing files that changed from the base of the PR and between d8aa3ee and 47025ff.

📒 Files selected for processing (4)
  • python/nvforest/nvforest/_base.py
  • python/nvforest/nvforest/_forest_inference.py
  • python/nvforest/nvforest/detail/forest_inference.pyx
  • python/nvforest/tests/test_nvforest.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@python/nvforest/tests/test_nvforest.py`:
- Around line 877-880: The test currently only exercises
nvforest.load_from_sklearn; add a saved-model test that goes through the
serialized-load path (use nvforest.save_model or ForestInference.save if
available to write the model to a temporary file, then call nvforest.load_model
or ForestInference.load with device="cpu") to reproduce issue-72's validation
path; ensure the loaded model's feature count is validated by asserting expected
behavior (e.g., raises or matches num_features) after loading from disk so any
regression in the load_model / ForestInference.load path is caught.
- Around line 882-884: The test currently only asserts the error starts with
"Expected {n_features} features"; update the pytest.raises match to require the
actual received feature count (input_size) too so the exception message includes
both expected and actual counts. Locate the block using predict_func, fm,
input_size and n_features and change the match to assert something like
"Expected {n_features} features, got {input_size}" (or the project's chosen
wording) so the raised ValueError contains expected vs actual values.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 963bc909-d573-46fb-ae7c-8a886be0d654

📥 Commits

Reviewing files that changed from the base of the PR and between 47025ff and 1c077cf.

📒 Files selected for processing (1)
  • python/nvforest/tests/test_nvforest.py

Comment on lines +877 to +880
clf = RandomForestClassifier(max_features="sqrt", n_estimators=10)
clf.fit(X, y)

fm = nvforest.load_from_sklearn(clf, device="cpu")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Cover the actual issue-72 loading path here.

This only exercises load_from_sklearn(..., device="cpu"), but issue #72 is about feature-count validation after loading a serialized model via ForestInference.load / nvforest.load_model. A loader-specific num_features regression would still slip through, so please add at least one saved-model case that goes through load_model and reproduces the user-facing path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/nvforest/tests/test_nvforest.py` around lines 877 - 880, The test
currently only exercises nvforest.load_from_sklearn; add a saved-model test that
goes through the serialized-load path (use nvforest.save_model or
ForestInference.save if available to write the model to a temporary file, then
call nvforest.load_model or ForestInference.load with device="cpu") to reproduce
issue-72's validation path; ensure the loaded model's feature count is validated
by asserting expected behavior (e.g., raises or matches num_features) after
loading from disk so any regression in the load_model / ForestInference.load
path is caught.

Comment on lines +882 to +884
with pytest.raises(ValueError, match=f"Expected {n_features} features"):
X_test = np.zeros((1, input_size))
_ = predict_func(fm, X_test)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Assert the received feature count in the error message too.

Right now any message that starts with Expected 5 features passes. Matching the actual width (input_size) here would enforce the expected-vs-actual detail that users need when debugging shape errors.

Suggested assertion tightening
-    with pytest.raises(ValueError, match=f"Expected {n_features} features"):
+    with pytest.raises(
+        ValueError,
+        match=rf"Expected {n_features} features.*got {input_size}",
+    ):

As per coding guidelines "Error messages should be helpful and include expected vs actual values".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
with pytest.raises(ValueError, match=f"Expected {n_features} features"):
X_test = np.zeros((1, input_size))
_ = predict_func(fm, X_test)
with pytest.raises(
ValueError,
match=rf"Expected {n_features} features.*got {input_size}",
):
X_test = np.zeros((1, input_size))
_ = predict_func(fm, X_test)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/nvforest/tests/test_nvforest.py` around lines 882 - 884, The test
currently only asserts the error starts with "Expected {n_features} features";
update the pytest.raises match to require the actual received feature count
(input_size) too so the exception message includes both expected and actual
counts. Locate the block using predict_func, fm, input_size and n_features and
change the match to assert something like "Expected {n_features} features, got
{input_size}" (or the project's chosen wording) so the raised ValueError
contains expected vs actual values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Cython / Python improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] Check the dimensions of the input data passed to nvForest

1 participant