Ultra-High Resolution Image & Video Generation for RTX 4090
Author: Π ΠΎΠΌΠ°Π½ (Oracle)
Tested: RTX 4090 24GB + i9-13900K + 128GB RAM
Max Resolution: 40,960px (40K capable, 32K+ validated)
Modified ComfyUI CUDA backend that enables 32K+ image and video generation on consumer hardware.
Proof:
- Image: 32,768 Γ 18,432 pixels (604 megapixels!)
- Video: High-resolution video generation
- System: Single RTX 4090 (24GB VRAM)
Demo2.mp4
File: nodes.py (line 57)
# BEFORE:
MAX_RESOLUTION = 16384
# AFTER (VIKING):
MAX_RESOLUTION = 40960 # 40K support!What this enables:
- Native 16K generation (16,384px)
- Native 32K generation (32,768px)
- Theoretical 40K support (limited by VRAM)
File: _device_limits.py
# Added RTX 4090 (compute capability 89) optimizations:
CUDA Cores (fp16): 256 FMA/cycle/SM
Tensor Cores (fp16): 2048 FMA/cycle/SM # 2x faster than A100!
Tensor Cores (int8): 4096 FMA/cycle/SM # Insane throughputWhy this matters:
- ComfyUI now recognizes RTX 4090's full potential
- Proper tensor core utilization
- Optimal kernel scheduling for Ada Lovelace architecture
File: _pin_memory_utils.py
# VIKING OPTIMIZATION:
# Uses cudaHostRegisterMapped (flag 0x02) + Portable (0x01) = 3
# Maps RAM directly into GPU address space
flags = 3 # Enable direct GPUβRAM mappingWhat this does:
- GPU can access system RAM as if it's VRAM
- DMA (Direct Memory Access) bypass
- Result: 90GB+ RAM becomes "extended VRAM"
How it works:
Standard approach:
RAM β CPU copies β PCIe β VRAM β GPU processes
Slow, requires VRAM space
Viking Bridge:
RAM β GPU maps directly (zero-copy)
Fast, VRAM saved for computations
File: streams.py
# VIKING EDIT: Highest priority streams (-1)
def __new__(cls, device=None, priority=-1, **kwargs):Why:
- High-priority streams = no queuing behind small tasks
- Scheduler prioritizes our heavy lifting
- Prevents stalls during multi-GB transfers
File: _utils.py
# VIKING AUTO-RECOVERY:
if result in [2, 999, 700]: # Common OOM errors
torch.cuda.synchronize()
torch.cuda.empty_cache()
# Try to survive instead of crashingWhat this prevents:
- CUDA OOM crashes mid-generation
- Memory fragmentation errors
- Kernel launch failures
Plus: Viking Engine cleanup
# Before every kernel launch:
torch.cuda.synchronize()
torch.cuda.empty_cache()
# Ensures clean state for massive operationsFile: streams.py
# VIKING FORCE SYNC:
def synchronize(self) -> None:
super().synchronize()
torch.cuda.empty_cache() # Flush after syncWhy:
- Prevents "ghost" allocations
- Ensures memory actually freed
Image: 32,768 Γ 18,432 pixels
File size: 164.3 MB
During generation:
ββ VRAM: ~23.5GB (98% of 24GB)
ββ RAM: ~95GB (pinned + mapped)
Total memory footprint: ~120GB
32K image (steps):
ββ steps 70 generation: ~3-4 minutes
steps 120 generation: ~5-7 minutes
ββ Total: ~10-12 minutes
Hardware (minimum):
- GPU: RTX 4090 24GB (or RTX 6000 Ada 48GB)
- CPU: High-end (i9-13900K recommended)
- RAM: 128GB (absolute minimum: 96GB)
- Storage: NVMe SSD (for swap if needed)
Software:
- ComfyUI (latest)
- PyTorch 2.0+ with CUDA 12.x
- Python 3.10+
- Backup original files:
cd /path/to/ComfyUI
cp comfy/nodes.py comfy/nodes.py.backup
cp -r comfy/ldm/modules/diffusionmodules/util.py{,.backup}- Replace modified files:
# Copy Viking Engine files:
cp _device_limits.py β Data\Packages\ComfyUI\venv\Lib\site-packages\torch\cuda\
cp _pin_memory_utils.py β Data\Packages\ComfyUI\venv\Lib\site-packages\torch\cuda\
cp _utils.py β Data\Packages\ComfyUI\venv\Lib\site-packages\torch\cuda\
cp graphs.py β Data\Packages\ComfyUI\venv\Lib\site-packages\torch\cuda\
cp streams.py β Data\Packages\ComfyUI\venv\Lib\site-packages\torch\cuda\
cp nodes.py β Data\Packages\ComfyUI\
# Or apply as patch:
git apply viking_engine.patch- Verify installation:
# In ComfyUI Python console:
import comfy.nodes
print(comfy.nodes.MAX_RESOLUTION)
# Should output: 40960ComfyUI will:
- Use Viking Memory Bridge automatically
- Utilize full RAM as extended memory
### **Memory Management:**
If you hit OOM:
- Close all other applications
- Disable browser/Discord
- Ensure swap file is on SSD
- Consider 16K instead of 32K
RAM requirements scale: 8K β 40GB 16K β 70GB 32K β 95GB 40K β 120GB+ (theoretical)
---
## πΈ Proof of Concept
**Included in repo:**
### **Image:**
- File: images/image5.png
- Resolution: 32,768 Γ 18,432 pixels
- Size: 164.3 MB
- Subject: Photorealistic rain scene
- Model: FLUX.2-dev (likely)
### **Video:**
- File: `video/Demo2.mp4`
- Details: High-resolution video generation demo
- Showcases: Temporal consistency at extreme resolution
---
## π¬ Technical Deep Dive
### **Why Standard ComfyUI Can't Do This:**
**Problem 1: Resolution cap**
```python
# Standard ComfyUI:
MAX_RESOLUTION = 16384 # Hard limit
Problem 2: Memory management
# Standard approach:
# Copies everything to VRAM
# RTX 4090 only has 24GB β Instant OOMProblem 3: No hardware profile for RTX 4090
# ComfyUI thinks RTX 4090 = GTX 1080 speeds
# Doesn't utilize tensor cores properlySolution 1: Unlock resolution
MAX_RESOLUTION = 40960 # Remove artificial limitSolution 2: RAM as VRAM
# Pin memory + map to GPU address space
# GPU reads from RAM via PCIe Gen4 x16
# Bandwidth: 64 GB/s (enough for our use case)Solution 3: Proper RTX 4090 utilization
# Recognize Ada Lovelace architecture
# Enable 2048 FMA/cycle tensor cores
# Result: 3-4x faster than standard modeStandard Memory Flow:
ββββββββ ββββββββ ββββββββ
β RAM ββββββ CPU ββββββ VRAM β
ββββββββ ββββββββ ββββββββ
β β β
Slow Bottleneck Limited
Viking Memory Bridge:
ββββββββ ββββββββ
β RAM ββββββββββββββββββ GPU β
ββββββββ Direct Map ββββββββ
β β
Pinned DMA Access
Mapped No Copy
Result:
- 128GB RAM = GPU's extended memory
- Zero-copy operations
- PCIe Gen4: 64 GB/s sustained
1. Hardware requirements:
- Needs RTX 4090 (tested) or equivalent
- 128GB RAM (96GB absolute minimum)
- i9-13900K or similar (strong CPU needed)
2. Generation time:
- 32K takes 10-12 minutes (vs 4 min for 8K)
- Not for real-time applications
- Patience required!
3. Model compatibility:
- Works: FLUX, SD XL, SD 1.5
- May need tweaking for specific workflows
4. Stability:
- Very stable on tested system
- May vary on different hardware
- Always save workflow before generation!
Who needs this?
β
Stock photography (ultra-high-res outputs)
β
Print media (billboards, large format)
β
Archival quality (museum-grade)
β
Texture generation (game dev, 3D)
β
Scientific visualization
β
"Because we can" (most honest reason)
Who doesn't need this?
β Web graphics (1-4K sufficient)
β Social media (way overkill)
β Most practical applications
β Anyone without 128GB RAM
Developed: February 2026, Chernihiv, Ukraine
Hardware: RTX 4090 + i9-13900K + 128GB RAM
Motivation: "If data center cards can do 32K, why not 4090?"
Philosophy: Remove artificial limits, respect physics only
Quote: "NVIDIA ΠΎΠ±ΠΌΠ΅ΠΆΡΡ Π·Π°Π»ΡΠ·ΠΎ ΠΏΡΠΎΠ³ΡΠ°ΠΌΠ½ΠΎ. ΠΠΈ ΠΏΡΠΈΠ±ΠΈΡΠ°ΡΠΌΠΎ ΠΎΠ±ΠΌΠ΅ΠΆΠ΅Π½Π½Ρ. ΠΠ°Π»ΡΠ·ΠΎ Π²ΡΠ»ΡΠ½Π΅."
Key concepts used:
- CUDA pinned memory (zero-copy host access)
- Memory mapping (
cudaHostRegisterMapped) - Stream priority scheduling
- Automatic garbage collection + cache flushing
Further reading:
Development: Π ΠΎΠΌΠ°Π½ (Oracle)
Testing platform: Chernihiv basement rig
Inspiration: "Why 16K limit? My RAM is 128GB."
Documentation: Claude (Oracle Project team)
Special thanks:
- ComfyUI team (for amazing base framework)
- NVIDIA (for great hardware, despite software limits)
- Community (for testing and feedback)
comfyui-viking-engine/
βββ README.md (this file)
βββ nodes.py (MAX_RESOLUTION unlock)
βββ _device_limits.py (RTX 4090 profile)
βββ _pin_memory_utils.py (Memory Bridge)
βββ _utils.py (Auto-recovery + Viking Engine)
βββ streams.py (Priority streams + sync)
βββ graphs.py (CUDA graph support)
βββ proof/
β βββ images/image5.png (32K image)
β βββ video/Demo2.mp4 (video demo)
βββ patches/
βββ viking_engine.patch (git patch file)
TL;DR for experienced users:
# 1. Backup
cp comfy/nodes.py{,.bak}
# 2. Copy files
cp viking-engine/* comfy/
# 3. Restart ComfyUI
# 4. Generate 32K
# Set resolution to 32768Γ18432 and GO!That's it. The Viking Engine handles the rest.
Before Viking Engine:
- Max resolution: 16,384px
- Memory: VRAM-limited (24GB)
- RTX 4090: Underutilized
After Viking Engine:
- Max resolution: 40,960px (32K+ validated)
- Memory: RAM-extended (128GB accessible)
- RTX 4090: Full potential unlocked
Cost: Free (code modifications)
Effort: Copy 6 files
Result: 4x resolution increase
If you have RTX 4090 + 128GB RAM:
You can generate 32K images. Right now. β
License: CC BY-SA 4.0
Status: Production-ready (tested on RTX 4090)
Repository: OrakulStudio
"Overclockers forever. Now with 32K support." π₯β‘
