Refactor tensor class in C++ unit tests#2962
Conversation
Refactor test tensor wrapper by removing recipe-specific logic whenever possible. Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
- Fix syntax error in switch case (:: -> :) - Fix double-underscore typo in variable name - Fix wrong buffer passed to set_amax_columnwise - Fix unique_ptr assignment from raw pointer (use reset()) - Remove dead duplicate NVTE_MXFP8_1D_SCALING branch in get_scales() - Rename cpu_data -> cpu_buffer to match Buffer class API - Remove const from Tensor::to_cpu/from_cpu and their callers, since both methods write to the CPU buffer Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
CPU and GPU types are inconsistent, so the type checks cause too many problems. Signed-off-by: Tim Moon <tmoon@nvidia.com>
Greptile SummaryThis PR refactors the C++ unit test tensor infrastructure by extracting a new inner
Confidence Score: 4/5Safe to merge after fixing the one-line null dereference in the Tensor constructor's columnwise-only FP8 path. In the new Tensor constructor, the shared scale-inverse buffer is correctly created, but when rowwise=false and columnwise=true the if(rowwise) guard means scale_inv_rowwise_ is never assigned. The following if(columnwise) block then calls scale_inv_rowwise_->gpu_buffer() — a null shared_ptr dereference — crashing any test that constructs a columnwise-only FP8 tensor with delayed scaling. The rest of the PR is clean. tests/cpp/test_common.cu — specifically the NVTE_DELAYED_TENSOR_SCALING branch of the Tensor constructor around the scale_inv_columnwise_ assignment. Important Files Changed
Class Diagram%%{init: {'theme': 'neutral'}}%%
classDiagram
class Tensor {
-TensorWrapper tensor_
-optional~Buffer~ data_rowwise_
-optional~Buffer~ data_columnwise_
-shared_ptr~Buffer~ scale_inv_rowwise_
-shared_ptr~Buffer~ scale_inv_columnwise_
-optional~Buffer~ amax_rowwise_
-optional~Buffer~ amax_columnwise_
-optional~Buffer~ scale_
-bool rowwise_
-bool columnwise_
+to_cpu()
+from_cpu()
+rowwise_cpu_dptr~T~()
+columnwise_cpu_dptr~T~()
+set_scale_inv(float)
+amax() float
+scale() float
}
class Buffer {
-unique_ptr~unsigned char[]~ cpu_buffer_
-unique_ptr~unsigned char[], GPUDeleter~ gpu_buffer_
-size_t size_
-DType dtype_
-size_t bytes_
+to_cpu()
+from_cpu()
+cpu_buffer~T~() T*
+gpu_buffer~T~() T*
+size() size_t
+dtype() DType
}
class GPUDeleter {
+operator()(void* ptr)
}
Tensor "1" *-- "0..7" Buffer : owns (optional/shared)
Buffer *-- GPUDeleter : uses
Reviews (5): Last reviewed commit: "Merge branch 'main' into tmoon/refactor-..." | Re-trigger Greptile |
Also adopt review suggestions from @greptile-apps. Signed-off-by: Tim Moon <tmoon@nvidia.com>
| // Fill scales | ||
| if (t->scaling_mode() == NVTE_DELAYED_TENSOR_SCALING) { | ||
| if (isFp8Type(t->dtype())) { | ||
| // FP8 tensor scale is set to 1 | ||
| t->set_scale_inv(1.0); | ||
| } | ||
| } else { | ||
| // Block scales are filled randomly | ||
| t->fill_uniform_rowwise_scale_inv(); | ||
| t->fill_uniform_columnwise_scale_inv(); | ||
| } |
There was a problem hiding this comment.
This is weird, but it approximates the previous behavior.
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
|
/te-ci core L1 |
Oleg-Goncharov
left a comment
There was a problem hiding this comment.
LGTM, this looks much cleaner now, but the cast+transpose current scaling tests are failing with a segmentation fault.
Signed-off-by: Tim Moon <tmoon@nvidia.com>
|
/te-ci core L1 |
| if (columnwise) { | ||
| tensor_.set_columnwise_scale_inv(rowwise_scale_inv, DType::kFloat32, | ||
| std::vector<size_t>{1}); | ||
| columnwise_scale_inv_cpu_data_ = std::make_unique<unsigned char[]>(sizeof(float)); | ||
| std::fill_n(columnwise_scale_inv_cpu_data_.get(), sizeof(float), 0); | ||
| } | ||
| } else { | ||
| if (scaling_mode == NVTE_NVFP4_1D_SCALING) { | ||
| // Used for NVFP4 second stage scaling | ||
| amax_cpu_data_ = std::make_shared<float>(0); | ||
| amax_cpu_data_columnwise_ = std::make_shared<float>(0); | ||
| cudaMalloc((void**)&amax, sizeof(float)); // NOLINT(*) | ||
| cudaMalloc((void**)&amax_columnwise, sizeof(float)); // NOLINT(*) | ||
| cudaMemset(amax, 0, sizeof(float)); | ||
| cudaMemset(amax_columnwise, 0, sizeof(float)); | ||
| tensor_.set_amax(amax, DType::kFloat32, std::vector<size_t>{1}); | ||
| tensor_.set_columnwise_amax(amax_columnwise, DType::kFloat32, std::vector<size_t>{1}); | ||
| scale_inv_columnwise_ = scale_inv; | ||
| tensor_.set_columnwise_scale_inv(scale_inv_rowwise_->gpu_buffer(), DType::kFloat32, std::vector<size_t>{1}); | ||
| } |
There was a problem hiding this comment.
When a tensor is constructed with
rowwise=false and columnwise=true for an FP8 dtype under NVTE_DELAYED_TENSOR_SCALING, the if (rowwise) branch is skipped so scale_inv_rowwise_ stays nullptr. The very next block then calls scale_inv_rowwise_->gpu_buffer() to register the columnwise scale-inverse, dereferencing a null shared_ptr and crashing. The intent is to share the one scale_inv buffer; just use scale_inv->gpu_buffer() directly (as is already done two lines above for the rowwise path).
| if (columnwise) { | |
| tensor_.set_columnwise_scale_inv(rowwise_scale_inv, DType::kFloat32, | |
| std::vector<size_t>{1}); | |
| columnwise_scale_inv_cpu_data_ = std::make_unique<unsigned char[]>(sizeof(float)); | |
| std::fill_n(columnwise_scale_inv_cpu_data_.get(), sizeof(float), 0); | |
| } | |
| } else { | |
| if (scaling_mode == NVTE_NVFP4_1D_SCALING) { | |
| // Used for NVFP4 second stage scaling | |
| amax_cpu_data_ = std::make_shared<float>(0); | |
| amax_cpu_data_columnwise_ = std::make_shared<float>(0); | |
| cudaMalloc((void**)&amax, sizeof(float)); // NOLINT(*) | |
| cudaMalloc((void**)&amax_columnwise, sizeof(float)); // NOLINT(*) | |
| cudaMemset(amax, 0, sizeof(float)); | |
| cudaMemset(amax_columnwise, 0, sizeof(float)); | |
| tensor_.set_amax(amax, DType::kFloat32, std::vector<size_t>{1}); | |
| tensor_.set_columnwise_amax(amax_columnwise, DType::kFloat32, std::vector<size_t>{1}); | |
| scale_inv_columnwise_ = scale_inv; | |
| tensor_.set_columnwise_scale_inv(scale_inv_rowwise_->gpu_buffer(), DType::kFloat32, std::vector<size_t>{1}); | |
| } | |
| if (columnwise) { | |
| scale_inv_columnwise_ = scale_inv; | |
| tensor_.set_columnwise_scale_inv(scale_inv->gpu_buffer(), DType::kFloat32, std::vector<size_t>{1}); | |
| } |
Description
The tensor wrapper in the C++ unit tests has become unwieldy, with complicated interactions between recipes and memory management. This has recently resulted in bugs where we accidently didn't allocate a required buffer (#2943). This PR disentangles the memory management from the recipe logic by adding a simple RAII class to manage GPU and CPU buffers. I've also added more explicit checks, e.g. when we assume a tensor is a single FP32.
Type of change
Changes
Checklist: