Skip to content

Refactor tensor class in C++ unit tests#2962

Open
timmoon10 wants to merge 12 commits intoNVIDIA:mainfrom
timmoon10:tmoon/refactor-cpp-test-tensor
Open

Refactor tensor class in C++ unit tests#2962
timmoon10 wants to merge 12 commits intoNVIDIA:mainfrom
timmoon10:tmoon/refactor-cpp-test-tensor

Conversation

@timmoon10
Copy link
Copy Markdown
Collaborator

Description

The tensor wrapper in the C++ unit tests has become unwieldy, with complicated interactions between recipes and memory management. This has recently resulted in bugs where we accidently didn't allocate a required buffer (#2943). This PR disentangles the memory management from the recipe logic by adding a simple RAII class to manage GPU and CPU buffers. I've also added more explicit checks, e.g. when we assume a tensor is a single FP32.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

  • Add class to manage GPU buffer, CPU buffer, and memory transfers between them.
  • Remove memory management logic from tensor class in C++ tests.
  • Add checks to accessors that make implicit assumptions on buffer size and dtype.

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

timmoon10 and others added 7 commits April 30, 2026 01:51
Refactor test tensor wrapper by removing recipe-specific logic whenever possible.

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
- Fix syntax error in switch case (:: -> :)
- Fix double-underscore typo in variable name
- Fix wrong buffer passed to set_amax_columnwise
- Fix unique_ptr assignment from raw pointer (use reset())
- Remove dead duplicate NVTE_MXFP8_1D_SCALING branch in get_scales()
- Rename cpu_data -> cpu_buffer to match Buffer class API
- Remove const from Tensor::to_cpu/from_cpu and their callers,
  since both methods write to the CPU buffer

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
CPU and GPU types are inconsistent, so the type checks cause too many problems.

Signed-off-by: Tim Moon <tmoon@nvidia.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 6, 2026

Greptile Summary

This PR refactors the C++ unit test tensor infrastructure by extracting a new inner Buffer class that owns matching GPU and CPU allocations, simplifying the Tensor class to delegate memory transfers through it, and adding explicit type/size guards to accessors. The operator tests are also updated to capture scale values before GPU operations run, avoiding a race between output.scale() calls and the kernel overwriting the scale buffer.

  • New Tensor::Buffer RAII class manages paired GPU/CPU buffers and cudaMemcpy transfers; Tensor now holds std::optional<Buffer> members instead of raw unique_ptr pairs, eliminating the previous tangle of manual alloc/free logic.
  • Operator test fixes capture ref_scale from output.scale() before nvte_quantize runs so the reference value is stable.
  • Added accessor guards (NVTE_CHECK on size and dtype) in amax(), scale(), set_scale_inv(), etc. ensure mis-use gives a clear error instead of a silent bad read.

Confidence Score: 4/5

Safe to merge after fixing the one-line null dereference in the Tensor constructor's columnwise-only FP8 path.

In the new Tensor constructor, the shared scale-inverse buffer is correctly created, but when rowwise=false and columnwise=true the if(rowwise) guard means scale_inv_rowwise_ is never assigned. The following if(columnwise) block then calls scale_inv_rowwise_->gpu_buffer() — a null shared_ptr dereference — crashing any test that constructs a columnwise-only FP8 tensor with delayed scaling. The rest of the PR is clean.

tests/cpp/test_common.cu — specifically the NVTE_DELAYED_TENSOR_SCALING branch of the Tensor constructor around the scale_inv_columnwise_ assignment.

Important Files Changed

Filename Overview
tests/cpp/test_common.cu Core refactor: adds Buffer class, rewrites Tensor constructor with scaling-mode switch; contains a null-dereference when rowwise=false and columnwise=true under NVTE_DELAYED_TENSOR_SCALING with an FP8 dtype.
tests/cpp/test_common.h New Buffer inner class declared with clean RAII; accessor guards added to rowwise_cpu_dptr/columnwise_cpu_dptr; shared_ptr used for scale_inv to allow shared ownership between rowwise and columnwise.
tests/cpp/operator/test_act.cu Captures ref_scale before GPU ops to avoid reading scale after the kernel may have modified it.
tests/cpp/operator/test_cast.cu Captures output_c.scale() before nvte_quantize; uses ref_scale consistently for compute_ref and scale_inv comparison.
tests/cpp/operator/test_cast_current_scaling.cu Initializes ref_scale=1.0 before conditional assignment and replaces isFp8Type(otype) with the already-cached is_out_fp8 guard.
tests/cpp_distributed/test_comm_gemm.cu Guards set_scale/set_scale_inv with isFp8Type check to avoid calling the new NVTE_CHECK-gated setters on non-FP8 tensors that lack a scale buffer.

Class Diagram

%%{init: {'theme': 'neutral'}}%%
classDiagram
    class Tensor {
        -TensorWrapper tensor_
        -optional~Buffer~ data_rowwise_
        -optional~Buffer~ data_columnwise_
        -shared_ptr~Buffer~ scale_inv_rowwise_
        -shared_ptr~Buffer~ scale_inv_columnwise_
        -optional~Buffer~ amax_rowwise_
        -optional~Buffer~ amax_columnwise_
        -optional~Buffer~ scale_
        -bool rowwise_
        -bool columnwise_
        +to_cpu()
        +from_cpu()
        +rowwise_cpu_dptr~T~()
        +columnwise_cpu_dptr~T~()
        +set_scale_inv(float)
        +amax() float
        +scale() float
    }
    class Buffer {
        -unique_ptr~unsigned char[]~ cpu_buffer_
        -unique_ptr~unsigned char[], GPUDeleter~ gpu_buffer_
        -size_t size_
        -DType dtype_
        -size_t bytes_
        +to_cpu()
        +from_cpu()
        +cpu_buffer~T~() T*
        +gpu_buffer~T~() T*
        +size() size_t
        +dtype() DType
    }
    class GPUDeleter {
        +operator()(void* ptr)
    }
    Tensor "1" *-- "0..7" Buffer : owns (optional/shared)
    Buffer *-- GPUDeleter : uses
Loading

Reviews (5): Last reviewed commit: "Merge branch 'main' into tmoon/refactor-..." | Re-trigger Greptile

Comment thread tests/cpp/test_common.h
Comment thread tests/cpp/test_common.cu Outdated
Also adopt review suggestions from @greptile-apps.

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Comment thread tests/cpp/test_common.cu
Comment on lines +940 to +950
// Fill scales
if (t->scaling_mode() == NVTE_DELAYED_TENSOR_SCALING) {
if (isFp8Type(t->dtype())) {
// FP8 tensor scale is set to 1
t->set_scale_inv(1.0);
}
} else {
// Block scales are filled randomly
t->fill_uniform_rowwise_scale_inv();
t->fill_uniform_columnwise_scale_inv();
}
Copy link
Copy Markdown
Collaborator Author

@timmoon10 timmoon10 May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird, but it approximates the previous behavior.

Comment thread tests/cpp/test_common.cu Outdated
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Comment thread tests/cpp/test_common.h Outdated
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
@timmoon10
Copy link
Copy Markdown
Collaborator Author

/te-ci core L1

Copy link
Copy Markdown
Collaborator

@Oleg-Goncharov Oleg-Goncharov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, this looks much cleaner now, but the cast+transpose current scaling tests are failing with a segmentation fault.

Comment thread tests/cpp/test_common.cu
@timmoon10
Copy link
Copy Markdown
Collaborator Author

/te-ci core L1

Comment thread tests/cpp/test_common.cu
Comment on lines 370 to 373
if (columnwise) {
tensor_.set_columnwise_scale_inv(rowwise_scale_inv, DType::kFloat32,
std::vector<size_t>{1});
columnwise_scale_inv_cpu_data_ = std::make_unique<unsigned char[]>(sizeof(float));
std::fill_n(columnwise_scale_inv_cpu_data_.get(), sizeof(float), 0);
}
} else {
if (scaling_mode == NVTE_NVFP4_1D_SCALING) {
// Used for NVFP4 second stage scaling
amax_cpu_data_ = std::make_shared<float>(0);
amax_cpu_data_columnwise_ = std::make_shared<float>(0);
cudaMalloc((void**)&amax, sizeof(float)); // NOLINT(*)
cudaMalloc((void**)&amax_columnwise, sizeof(float)); // NOLINT(*)
cudaMemset(amax, 0, sizeof(float));
cudaMemset(amax_columnwise, 0, sizeof(float));
tensor_.set_amax(amax, DType::kFloat32, std::vector<size_t>{1});
tensor_.set_columnwise_amax(amax_columnwise, DType::kFloat32, std::vector<size_t>{1});
scale_inv_columnwise_ = scale_inv;
tensor_.set_columnwise_scale_inv(scale_inv_rowwise_->gpu_buffer(), DType::kFloat32, std::vector<size_t>{1});
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 When a tensor is constructed with rowwise=false and columnwise=true for an FP8 dtype under NVTE_DELAYED_TENSOR_SCALING, the if (rowwise) branch is skipped so scale_inv_rowwise_ stays nullptr. The very next block then calls scale_inv_rowwise_->gpu_buffer() to register the columnwise scale-inverse, dereferencing a null shared_ptr and crashing. The intent is to share the one scale_inv buffer; just use scale_inv->gpu_buffer() directly (as is already done two lines above for the rowwise path).

Suggested change
if (columnwise) {
tensor_.set_columnwise_scale_inv(rowwise_scale_inv, DType::kFloat32,
std::vector<size_t>{1});
columnwise_scale_inv_cpu_data_ = std::make_unique<unsigned char[]>(sizeof(float));
std::fill_n(columnwise_scale_inv_cpu_data_.get(), sizeof(float), 0);
}
} else {
if (scaling_mode == NVTE_NVFP4_1D_SCALING) {
// Used for NVFP4 second stage scaling
amax_cpu_data_ = std::make_shared<float>(0);
amax_cpu_data_columnwise_ = std::make_shared<float>(0);
cudaMalloc((void**)&amax, sizeof(float)); // NOLINT(*)
cudaMalloc((void**)&amax_columnwise, sizeof(float)); // NOLINT(*)
cudaMemset(amax, 0, sizeof(float));
cudaMemset(amax_columnwise, 0, sizeof(float));
tensor_.set_amax(amax, DType::kFloat32, std::vector<size_t>{1});
tensor_.set_columnwise_amax(amax_columnwise, DType::kFloat32, std::vector<size_t>{1});
scale_inv_columnwise_ = scale_inv;
tensor_.set_columnwise_scale_inv(scale_inv_rowwise_->gpu_buffer(), DType::kFloat32, std::vector<size_t>{1});
}
if (columnwise) {
scale_inv_columnwise_ = scale_inv;
tensor_.set_columnwise_scale_inv(scale_inv->gpu_buffer(), DType::kFloat32, std::vector<size_t>{1});
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants