Skip to content

Windows 5090 Hints #2

@jjmlovesgit

Description

@jjmlovesgit

This is the definitive "Survival Guide" for compiling custom CUDA 12.9 kernels for Blackwell (SM120) hardware on Windows. Share this with you coding partner to outline the Windows build for FlashRT

The root cause of almost every failure in this process is how strictly Visual Studio 2026 (MSVC 19.50) handles explicit C++ template instantiations compared to Linux compilers, combined with legacy Linux-only dynamic loading logic.

Here is the exact playbook to beat the compiler.

🛠️ The Prerequisites
Target Hardware: NVIDIA RTX 5090 (Blackwell SM120)

CUDA Toolkit: Version 12.9 installed at C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9

Compiler: Visual Studio 2026 (MSVC 19.50)

Build System: Ninja (ninja -j16)

Step 1: Update the Linux Dynamic Loading (fmha_dispatch.cu)
The attention dispatcher attempts to load external .so libraries using Linux-specific headers (dlfcn.h), which causes a fatal crash on Windows.

The Fix: You must replace the entire contents of csrc\attention\fmha_dispatch.cu with Windows-safe stubs.

Replacement Code:

C++
#include <cuda_runtime.h>
#include <cuda_bf16.h>
#include <cstdio>

// Global variables from original implementation
void* g_fmha_lib = nullptr;
void* g_fmha_strided_lib = nullptr;

void init_fmha_dispatch() {}
void init_fmha_strided_dispatch() {}

// ── Linker Stubs for bindings.cpp (Windows) ──

bool has_cutlass_fmha() { return false; }
int load_fmha_library(const char* path) { return -1; }
int load_fmha_strided_library(const char* path) { return -1; }

int fmha_forward(const __nv_bfloat16* q, const __nv_bfloat16* k, const __nv_bfloat16* v,
__nv_bfloat16* out, int b, int s, int h, int d, float scale, cudaStream_t stream) {
printf("Dynamic fmha_forward disabled on Windows.\n");
return -1;
}

int fmha_strided_forward(const __nv_bfloat16* qkv, __nv_bfloat16* out,
int b, int s, int d, float scale, cudaStream_t stream) {
printf("Dynamic fmha_strided_forward disabled on Windows.\n");
return -1;
}

int fmha_strided_full(const void* q, const void* k, const void* v, void* out,
int b, int s_q, int s_k, int h_q, int h_k, int d, int is_causal, int is_alibi,
cudaStream_t stream) {
printf("Dynamic fmha_strided_full disabled on Windows.\n");
return -1;
}

Step 2: The "Implicit Is Better" Template Fix
The MSVC 19.50 linker will fail with no instance of function template matches the specified type on almost every kernel file because it chokes on explicit instantiations involving const pointers and aliased types like __half.

The Fix: Open the following files in csrc\kernels:

norm.cu

activation.cu

quantize.cu

elementwise.cu

rope.cu

fusion.cu

Scroll to the bottom of each file and delete the entire "Explicit Instantiations" block. Remove every single line that starts with template global void or template __declspec(global).

Because the host wrapper functions are in the exact same .cu files as the kernel definitions, nvcc and MSVC will implicitly generate the exact, perfect symbols they need automatically. Less code equals fewer linker errors.

Step 3: Build & Deploy
Once the files are sanitized, open your terminal in the build/ directory and fire off Ninja:

PowerShell
ninja -j16
When it hits [Linking CXX shared module], the build is complete. Move the generated Python extension into your main module folder:

PowerShell
copy flash_vla_kernels.cp311-win_amd64.pyd ..\flash_vla

Step 4: Cure "DLL Hell"
Python 3.8+ on Windows intentionally ignores the system PATH variable when looking for C-extension dependencies for security reasons. Even if the .pyd is perfectly compiled, importing it will throw: ImportError: DLL load failed: The specified module could not be found.

The Fix: You must explicitly inject the CUDA bin path into Python's secure loader at runtime. Open flash_vla_init_.py and add this code block to the absolute top of the file, before any other imports occur:

Python
import os
import sys

Permanently tell Windows Python where the CUDA 12.9 DLLs live

if sys.platform == 'win32':
cuda_bin = r'C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9\bin'
if os.path.exists(cuda_bin):
os.add_dll_directory(cuda_bin)
After saving that, your Blackwell kernels are permanently unlocked and ready for inference.

Phase 4: Model Ignition (The Weights)
Since you are running on an RTX 5090 (SM120), the system is designed to utilize NVFP4 (4-bit floating point) formats. This means when we load the model, we aren't just moving files; we are mapping the architecture to the specialized machine code generated during your compilation.

Registering the Architecture: Ensure your neural network specifications are registered within the FlashRT engine's internal routing table to bypass default fallbacks. see and share the FlashRT\docs with your coding partner

Blackwell Weight Mapping: Utilize the weight prefix (e.g., fmt.blocks.) to ensure the engine correctly identifies the blocks for your high-performance CUDA graph.

entify the Correct Weight Prefix
The Blackwell engine needs to know exactly where to find the transformer blocks to route them through your custom C++/CUDA kernels.

Hint: Check your model's state dict for the block naming convention.

Advice: In your recent builds, the Blackwell engine looked for the weight prefix fmt.blocks.. If your weights use a different prefix (like model.layers.), the engine may fall back to standard (slower) inference without warning.

  1. Strategic Memory Allocation
    The RTX 5090 has massive VRAM (31.8 GB), but Blackwell performance shines when weights are aligned for contiguous memory access.

Hint: Use safetensors instead of standard .pth or .ckpt files whenever possible.

Advice: Safetensors allow for "zero-copy" loading, which speeds up the initial VRAM allocation and prevents the CPU from becoming a bottleneck during the "Model Ignition" phase.

  1. Handle Quantization Alignment
    Blackwell (SM120) is designed for extreme efficiency with NVFP4.

Hint: If your source weights are in BF16 or FP16, the engine must perform an on-the-fly quantization or cast to match your compiled templates.

Advice: During weight mapping, ensure that your norm and activation layers (which we fixed in Step 2 of the guide) are not being double-cast, as this can introduce numerical instability in the 512-dimensional embedding space.

  1. Register the Architecture in the Routing Table
    The FlashRT engine uses an internal routing table to determine if it should use the local custom kernels or a remote fallback.

Hint: You must explicitly register the architecture dimensions (e.g., your 512-dimensional hidden size).

Advice: If the mapping isn't registered, the system might default to the "Pi0.5 Default" frontend rather than your "IMTalker Frontend," which would negate the speed gains from your custom Blackwell build.

  1. Validate with a "Dry" Warmup
    Before running real-time tasks, perform a deep warmup to verify the mapping.

Output to Look For: ✅ Blackwell SM120 Engine Online followed by ✅ IMTalker CUDA Graph Captured.

Warning: If you see ⚠️ Deep warmup failed, it usually indicates a mismatch between the weight shapes in your mapping and the expected shapes in the kernel templates (e.g., a 1024 vs 544 dimension mismatch).

Which model are you currently preparing to map? Knowing the specific weight structure (e.g., FlashVLA or a different VLA variant) will help in providing the exact prefix and dimension registration needed.

Local Inference Priority: Your successful compilation allows the system to prioritize Local Blackwell Inference over standard API fallbacks, drastically reducing latency.

Phase 5: Inference & Application (The Action)
Once the weights are "ignited," you can begin high-fidelity, real-time generation:

In my case using FlashRT on Windows provide Inference Speed optimized to generate actions and video frames at approximately 0.9x to 1.1x real-time.

Zero-Footprint Execution: By utilizing the CUDA Graph you captured during Phase 2, you eliminate CPU-launch overhead for every inference step, which is critical for physical hardware control or streaming interfaces.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions