Labels: question, enhancement, help wanted
Issue Description
I'm unable to compile the low-latency-llama demo on RTX 4090 GPUs due to ThunderKittens compatibility issues. The project seems designed primarily for H100+ architectures.
Environment
- GPU: RTX 4090
- CUDA: 12.4
- OS: Linux
- Python: 3.12
Compilation Errors
error: barrier is not a template
error: identifier "semaphore" is undefined
error: name followed by "::" must be a class or namespace name (move<T>::lds, etc.)
static_assert(NUM_PAGES == 13, "NUM_PAGES must be 13"); // Fails on RTX 4090
Root Cause
ThunderKittens uses Hopper/Blackwell-specific features not available on RTX 4090:
- TMA operations
- Advanced barrier/semaphore primitives
- Architecture-specific memory limits (RTX 4090: 100KB shared memory vs H100: 227KB)
Questions
- Is RTX 4090 officially supported?
- Are there plans to add RTX 4090 support?
- Would contributions for RTX 4090 compatibility be welcome?
Workaround
I've created a basic CUDA test that compiles and runs successfully on RTX 4090, confirming the environment is correct. The issue is specifically with ThunderKittens advanced features.
Potential Solutions
- Conditional compilation for different architectures
- Fallback implementations using standard CUDA operations
- Architecture-specific configuration constants
- Clear documentation of supported GPUs
RTX 4090 is widely used in research/development, so adding support would significantly expand the user base. Happy to contribute if there's interest!
Additional Info
- Basic CUDA compilation works fine
- Python bindings work correctly
- Issue is specifically with ThunderKittens library features
- Already fixed some Makefile and config issues locally
Labels: question, enhancement, help wanted
Issue Description
I'm unable to compile the low-latency-llama demo on RTX 4090 GPUs due to ThunderKittens compatibility issues. The project seems designed primarily for H100+ architectures.
Environment
Compilation Errors
Root Cause
ThunderKittens uses Hopper/Blackwell-specific features not available on RTX 4090:
Questions
Workaround
I've created a basic CUDA test that compiles and runs successfully on RTX 4090, confirming the environment is correct. The issue is specifically with ThunderKittens advanced features.
Potential Solutions
RTX 4090 is widely used in research/development, so adding support would significantly expand the user base. Happy to contribute if there's interest!
Additional Info