- Introduction
- Background: Why Floating-Point Exception Detection Matters
- System Requirements
- Installation
- Basic Usage
- Environment Variables Reference
- Advanced Features
- Understanding the Output
- Case Studies and Debugging Workflows
- Performance Considerations
- Troubleshooting
- References
Nixnan is a binary instrumentation tool for detecting floating-point exceptional values (NaN, Infinity, Subnormals, Division-by-Zero) in NVIDIA CUDA programs. Built on top of NVBit (NVIDIA Binary Instrumentation Tool), nixnan provides runtime detection capabilities without requiring source code modification or recompilation.
- Binary-level instrumentation: Works with closed-source CUDA libraries
- Multiple precision support: Detects exceptions in FP16, FP32, and FP64 operations
- Tensor Core support: Monitors MMA (Matrix Multiply-Accumulate) instructions including HMMA operations
- Exponent histogram tracking: Monitors numerical ranges during execution
- Source line information: Reports exception locations with file and line numbers (when debug info available)
- Low overhead modes: Sampling support for reduced performance impact
- Exceptions being written into memory: Reports exceptions flowing into memory via STG ("store global") [HE]
GPUs are now the dominant platform for machine learning and high-performance computing workloads. Unfortunately, NVIDIA GPUs do not have hardware-level exception trap mechanisms. This means:
- Silent failures: Exceptional values (NaN, INF) can propagate through computations undetected
- Unreliable results: Programs may produce normal-looking outputs that are actually corrupted
- Difficult debugging: Without trapping, locating the source of exceptions is extremely challenging
- Closed-source barriers: Many GPU libraries are binary-only, making source-level debugging impossible
According to IEEE 754, there are five types of floating-point exceptions:
| Exception | Description | Exceptional Value |
|---|---|---|
| Invalid Operation | Mathematically undefined (e.g., sqrt(-1), 0/0) | NaN |
| Division by Zero | Non-zero divided by zero | Infinity (INF) |
| Overflow | Result exceeds representable range | Infinity (INF) |
| Underflow | Result too small to represent normally | Subnormal [HE] |
| Inexact | Result requires rounding | Rounded value |
Consider this common scenario in machine learning:
# Uninitialized tensor - carries garbage values
x = torch.FloatTensor(20, 32, 128).cuda()
# This may contain uninitialized values that may propagate, later generating NaNs [HE]Or in numerical algorithms:
// Division without zero-check
const float recipPrecision = 0.5f / eb; // If eb is subnormal or zero, this couldexplode [HE]Tools like nixnan help identify these issues before they cause training failures or incorrect scientific results.
Unlike source-level analysis, binary instrumentation:
- Works on closed-source code: Libraries like cuBLAS, cuSPARSE, cuDNN
- Sees optimized code: Catches issues introduced by compiler optimizations
- Detects precision changes: Finds when FP64 operations are downgraded to FP32
- Monitors actual execution: Not static analysis - catches runtime-dependent issues
- Operating System: Linux on x86_64
- CUDA Version: 12.x or compatible
- Compute Capability: >= 8.6 (Ampere or newer recommended)
- GPU Driver: Compatible with CUDA 12
- Build Tools: GCC, Make
# Clone the repository
git clone https://github.com/parfloat/nixnan.git
cd nixnan
# Build the instrumentation library
make
# This produces nixnan.so in nvbit_release/tools/nixnan/# Compile the basic example
cd examples
nvcc -arch=sm_86 -lineinfo basic.cu -o basic [HE: changed compute_86]
# Run with nixnan instrumentation
LD_PRELOAD=../nvbit_release/tools/nixnan/nixnan.so ./basicThe simplest way to use nixnan is via LD_PRELOAD:
LD_PRELOAD=/path/to/nixnan.so ./your_cuda_program [args]LD_PRELOAD=/path/to/nixnan.so python train.py- This is a great source of info covering NixNan + other tools.
- Ask to be included in more projects in progress - send email to ganeshutah at gmail.
- Our Private Github
--- NVBit (NVidia Binary Instrumentation Tool v1.7.2) Loaded ---
Running #nixnan: kernel [ampere_sgemm_32x128_nn] ...
#nixnan LOC-EXCEP INFO: Warning: in kernel [ampere_sgemm_32x128_nn],
(SUB) found @ /unknown_path in [ampere_sgemm_32x128_nn]:0 [FP32]
#nixnan LOC-EXCEP INFO: in kernel [ampere_sgemm_32x128_nn],
NaN found @ /source/file.cu:120 [FP32]
------------ Nixnan Report -----------
--- FP16 Operations ---
Total NaN found: 0
Total INF found: 0
Total underflow (subnormal): 0
Total Division by 0: 0
--- FP32 Operations ---
Total NaN found: 2
Total INF found: 1
Total underflow (subnormal): 2
Total Division by 0: 1
--- FP64 Operations ---
Total NaN found: 0
Total INF found: 0
Total underflow (subnormal): 0
Total Division by 0: 0
--- Other Stats ---
Kernels: 4
The total number of exceptions are: 128
Nixnan's behavior is controlled through environment variables. These are read at initialization using the NVBit GET_VAR_INT and GET_VAR_STR macros.
| Variable | Type | Default | Description |
|---|---|---|---|
INSTR_BEGIN |
Integer | 0 | Beginning of the instruction interval where to apply instrumentation |
INSTR_END |
Integer | UINT32_MAX | End of the instruction interval where to apply instrumentation |
SAMPLING |
Integer | 0 | Instrument a repeat kernel every SAMPLING times. Set to N to instrument only every Nth kernel invocation (reduces overhead for repeatedly-called kernels) |
| Variable | Type | Default | Description |
|---|---|---|---|
TOOL_VERBOSE |
Integer | 0 | Enable verbosity inside the tool. Set to 1 for detailed instrumentation logs |
ENABLE_FUN_DETAIL |
Integer | 0 | Enable detailed function information for kernel. Shows additional context about instrumented functions |
PRINT_ILL_INSTR |
Integer | 0 | Print the instruction which caused the exception. Useful for debugging specific SASS instructions |
LINE_INFO |
Integer | 0 | Enable debug information for source code locations. Warning: May cause crashes on some programs; set to 0 if you encounter issues |
LOGFILE |
String | (stderr) | Path to the optional log file. Default is to print to stderr. Useful when the instrumented program is capturing stderr |
| Variable | Type | Default | Description |
|---|---|---|---|
INSTR_MEM |
Integer | 0 | Instrument memory instructions for NaN/Inf detection. Monitors load/store operations for exceptional values |
| Variable | Type | Default | Description |
|---|---|---|---|
HISTOGRAM |
Integer | 0 | Enable whole-program exponent range tracking. Generates reports like "Exponent range for f16: [-5, 3]" |
BIN_SPEC_FILE |
String | (none) | Path to JSON specification file for targeted range monitoring |
# Basic usage with verbose output
TOOL_VERBOSE=1 LD_PRELOAD=./nixnan.so ./my_program
# Enable source line information (compile with -lineinfo)
LINE_INFO=1 LD_PRELOAD=./nixnan.so ./my_program
# Sample every 64th kernel invocation (for long-running programs)
SAMPLING=64 LD_PRELOAD=./nixnan.so ./my_program
# Log to file instead of stderr
LOGFILE=/tmp/nixnan.log LD_PRELOAD=./nixnan.so ./my_program
# Enable memory instrumentation
INSTR_MEM=1 LD_PRELOAD=./nixnan.so ./my_program
# Limit instrumentation to specific instruction range
INSTR_BEGIN=100 INSTR_END=500 LD_PRELOAD=./nixnan.so ./my_program
# Enable histogram tracking
HISTOGRAM=1 LD_PRELOAD=./nixnan.so ./my_program
# Combined: verbose, line info, and logging
TOOL_VERBOSE=1 LINE_INFO=1 LOGFILE=./debug.log LD_PRELOAD=./nixnan.so ./my_programNixnan supports instrumentation of Tensor Core operations, including:
- HMMA instructions: Half-precision Matrix Multiply-Accumulate
- IMMA instructions: Integer Matrix Multiply-Accumulate
- Various formats: F16, BF16, TF32, F32 accumulation
Example detection output:
HMMA.1688.F32.TF32 R4, R132.reuse, R2, R4 ; : MMA being used!
#nixnan LOC-EXCEP INFO: in kernel [void cutlass::Kernel],
NaN found @ /unknown_path in [void cutlass::Kernel]:0 [FP32]
HISTOGRAM=1 LD_PRELOAD=./nixnan.so ./my_programOutput:
Exponent range for f16: [-5, 3]
Exponent range for f32: [-12, 15]
Exponent range for f64: [-50, 100]
Create a JSON specification file:
{
"f32": {
"ranges": [
{"min": -126, "max": -120, "report_frequency": 1000},
{"min": 120, "max": 127, "report_frequency": 100}
]
},
"f16": {
"ranges": [
{"min": -14, "max": -10, "report_frequency": 500}
]
}
}Run with specification:
BIN_SPEC_FILE=./ranges.json LD_PRELOAD=./nixnan.so ./my_programWhen INSTR_MEM=1, nixnan also monitors memory operations:
INSTR_MEM=1 LD_PRELOAD=./nixnan.so ./my_programThis detects exceptional values being loaded from or stored to GPU memory, helping identify:
- Uninitialized memory containing NaN patterns
- Corrupted data in global memory
- Exception propagation through memory
#nixnan LOC-EXCEP INFO: in kernel [kernel_name],
NaN found @ /path/to/source.cu:120 [FP32]
Components:
- kernel_name: CUDA kernel where exception occurred
- path/to/source.cu:120: Source file and line (if compiled with
-lineinfo) - FP32: Floating-point precision (FP16, FP32, or FP64)
------------ Nixnan Report -----------
--- FP16 Operations ---
Total NaN found: X
Total INF found: X
Total underflow (subnormal): X
Total Division by 0: X
--- FP32 Operations ---
...
--- FP64 Operations ---
...
--- Other Stats ---
Kernels: N
The total number of exceptions are: M
| Exception | Severity | Typical Impact |
|---|---|---|
| NaN | High | Computation is corrupted; NaN propagates |
| INF | High | Overflow occurred; may cascade to NaN |
| Division by 0 | High | Usually indicates logic error |
| Subnormal | Medium | Precision loss; may be flushed to zero |
Problem: NaN values appearing at the output of a PyTorch-based neural network.
Detection:
LD_PRELOAD=./nixnan.so python run_sru.pyOutput:
Running #nixnan: kernel [ampere_sgemm_32x128_nn] ...
#nixnan LOC-EXCEP INFO: in kernel [ampere_sgemm_32x128_nn],
NaN found in [ampere_sgemm_32x128_nn]:0 [FP32]
Root Cause: The input tensor was created with uninitialized memory:
x = torch.FloatTensor(20, 32, 128).cuda() # WRONG: uninitializedFix:
x = torch.randn(20, 32, 128).cuda() # CORRECT: initializedProblem: NaN exceptions in a GPU-based data compressor.
Detection with line info:
LINE_INFO=1 LD_PRELOAD=./nixnan.so ./compressorOutput:
#nixnan LOC-EXCEP INFO: NaN appears at the destination @
/home/user/compressor/main1.cu:120
Instruction: FFMA R3, R4, -R0, 1 ;
Root Cause: Line 120 contained:
const float recipPrecision = 0.5f / eb; // eb was subnormal, causing INFFix: Add input validation for the error bound parameter.
Problem: Residual always NaN from the first iteration.
Detection:
LD_PRELOAD=./nixnan.so ./gmres_solverOutput:
#nixnan LOC-EXCEP INFO: in kernel [csrsv2_solve_upper_nontrans_byLevel_kernel],
DIV0 found @ /unknown_path:0 [FP64]
#nixnan LOC-EXCEP INFO: in kernel [MassIPTwoVec],
NaN found @ /home/user/customKernels.cu:31 [FP64]
Root Cause: Division by zero in LU factorization due to near-singular matrix.
Fix: Used cuSparse's matrix diagonal boosting API:
cusparseSetMatFillMode(descr, CUSPARSE_FILL_MODE_LOWER);
cusparseXcsrilu02_zeroPivot(handle, info, &position);
// Boost small pivots-
Initial Detection:
LD_PRELOAD=./nixnan.so ./your_program
-
Enable Line Information (recompile with
-lineinfo):nvcc -lineinfo -g your_program.cu -o your_program LINE_INFO=1 LD_PRELOAD=./nixnan.so ./your_program
-
Identify First Exception: Look for the first
LOC-EXCEP INFOmessage -
Analyze Exception Flow: Check if exceptions:
- Appear (generated fresh)
- Propagate (passed through operations)
- Disappear (masked by operations like FSEL)
-
Examine Instruction Context:
PRINT_ILL_INSTR=1 LD_PRELOAD=./nixnan.so ./your_program
-
For Long-Running Programs, Use Sampling:
SAMPLING=64 LD_PRELOAD=./nixnan.so ./your_program
Binary instrumentation inherently adds overhead. Typical slowdowns:
| Mode | Slowdown | Use Case |
|---|---|---|
| Basic detection | 10-50x | Development/debugging |
| With line info | 20-100x | Detailed debugging |
| With sampling=64 | 2-10x | Long-running programs |
| Memory instrumentation | 50-200x | Deep analysis |
-
Use Sampling for Repeated Kernels:
SAMPLING=256 LD_PRELOAD=./nixnan.so ./my_program
This instruments only every 256th invocation of a kernel.
-
Limit Instruction Range:
INSTR_BEGIN=1000 INSTR_END=2000 LD_PRELOAD=./nixnan.so ./my_program
-
Disable Line Info (if causing issues):
LINE_INFO=0 LD_PRELOAD=./nixnan.so ./my_program
-
Two-Phase Approach:
- First run: Fast detection to identify problematic kernels
- Second run: Detailed analysis on specific kernels
On a benchmark of 151 HPC and ML programs:
- Over 60% experienced less than 10x slowdown
- Sampling with factor 64 reduced geometric mean slowdown to ~5x
- Compared to BinFPE: 16x faster geometric-mean runtime
Symptom: Program crashes when enabling source line information.
Solution:
LINE_INFO=0 LD_PRELOAD=./nixnan.so ./my_programThe line info feature may not work with all programs. Use without it for initial detection.
Symptom: Exception locations show /unknown_path instead of source files.
Solution: Recompile your CUDA code with debug information:
nvcc -lineinfo -g your_program.cu -o your_programSymptom: Tool fails to load or produces errors about NVBit version.
Solution: Ensure your CUDA driver and NVBit versions are compatible. Check:
nvidia-smi # Check driver version
nvcc --version # Check CUDA toolkit versionSymptom: Exceptions detected but no source location available.
Explanation: For closed-source libraries (cuBLAS, cuDNN, etc.), source information is unavailable. The tool still detects exceptions but can only report kernel names.
Workaround: Use the kernel name to identify which library function is causing issues, then check your inputs to that function.
Symptom: Program runs extremely slowly.
Solution: Use sampling:
SAMPLING=128 LD_PRELOAD=./nixnan.so ./my_programSymptom: Nixnan output interferes with program output.
Solution: Redirect nixnan output to a file:
LOGFILE=/tmp/nixnan.log LD_PRELOAD=./nixnan.so ./my_program-
GPU-FPX Paper: Li, X., Laguna, I., Fang, B., Swirydowicz, K., Li, A., & Gopalakrishnan, G. (2023). "Design and Evaluation of GPU-FPX: A Low-Overhead tool for Floating-Point Exception Detection in NVIDIA GPUs." HPDC '23. https://doi.org/10.1145/3588195.3592991
-
Array Programming Paper: Li, X., Baranowski, M., Dam, H., & Gopalakrishnan, G. (2025). "Array Programming on GPUs: Challenges and Opportunities." ARRAY '25. https://doi.org/10.1145/3736112.3736144
-
NVBit: Villa, O., Stephenson, M., Nellans, D., & Keckler, S. W. (2019). "NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs." MICRO '19.
- GPU-FPX: https://github.com/LLNL/GPU-FPX
- FPChecker: LLVM-based exception detection for Clang-compiled CUDA
- BinFPE: Earlier SASS-level binary instrumentation tool
- FloatGuard: Exception detection for AMD GPUs
- IEEE 754-2008: Standard for Floating-Point Arithmetic
- IEEE 754-2019: Latest revision with updated NaN handling
- NVIDIA CUDA Floating-Point Documentation: https://docs.nvidia.com/cuda/floating-point/
- IEEE-754 Floating Point Converter: https://www.h-schmidt.net/FloatConverter/IEEE754.html
Nixnan instruments the following SASS floating-point instructions:
| Instruction | Description |
|---|---|
| FADD | FP32 Add |
| FADD32I | FP32 Add (immediate) |
| FFMA | FP32 Fused Multiply and Add |
| FFMA32I | FP32 Fused Multiply and Add (immediate) |
| FMUL | FP32 Multiply |
| FMUL32I | FP32 Multiply (immediate) |
| MUFU | FP32 Multi Function Operation (sin, cos, sqrt, rcp, etc.) |
| DADD | FP64 Add |
| DFMA | FP64 Fused Multiply Add |
| DMUL | FP64 Multiply |
| Instruction | Description |
|---|---|
| FSEL | Floating Point Select |
| FSET | FP32 Compare And Set |
| FSETP | FP32 Compare And Set Predicate |
| FMNMX | FP32 Minimum/Maximum |
| DSETP | FP64 Compare And Set Predicate |
| Instruction | Description |
|---|---|
| HMMA | Half-precision Matrix Multiply-Accumulate |
| IMMA | Integer Matrix Multiply-Accumulate |
This tutorial is part of the nixnan project. For the latest updates, visit: https://github.com/parfloat/nixnan