Nodes to run Hunyuan Image 3 locally with BF16 and NF4 quantized options in Comfyui
-
Updated
Apr 30, 2026 - Python
Nodes to run Hunyuan Image 3 locally with BF16 and NF4 quantized options in Comfyui
Stateless LLM runtime that dynamically routes, loads, executes, and unloads models per request with bounded VRAM caching and intelligent model selection.
ComfyUI custom node that controls the order of node execution with linear routing of any data type through infinite I/O slots + option to free VRAM & RAM at any point in a workflow with device-agnostic memory management utilities managed by ComfyUI that safely unload all models, while preserving all connected data & models through to the next node.
KeSSie HUGE Context Semantic recall for Large Language Models
Ultra-Low Bit KV-Cache Compression optimization layer built on top of llama.cpp for LLM inference. Reduces VRAM overhead by ~75-80% using custom CUDA kernels.
LEMA (Layer-wise Efficient Memory Abstraction): A hardware-aware framework for fine-tuning LLMs in VRAM-constrained environments using asynchronous binary pre-fetching and triple-tier memory orchestration.
Predictive VRAM Virtualization Engine
WIP....Turn your PC into a private, autonomous AI lab, without melting your GPU.
NMOS (Neural Memory OS) is a predictive partial execution engine enabling 70B-level reasoning on 4GB VRAM. It uses the “Zero-Lag” hypothesis, leveraging typing latency as a compute window to mask memory limits via async layer prefetching and speculative decoding.
Perkunas AI Training Platform is a memory-aware model training and serving system for serious language model experimentation under tight hardware limits. It combines streaming training, rich telemetry, guarded recovery, checkpoint export, and OpenAI-compatible serving.
Sticky-block topology lottery scheduler for transformer fine-tuning.
INT8 Sparse Tensor Core GEMM for PyTorch — built for Windows
Know before you train — VRAM estimation for LLM fine-tuning.
PyTorch/Hugging Face batching utility that sorts variable-length text by difficulty, then dynamically increases batch size on easier samples using a pre-trained VRAM predictor to improve GPU utilization and throughput while reducing OOM risk with fallback handling.
Adaptive dual-tier serving for Gemma 4 on consumer 16GB GPUs. Complexity + real-time VRAM routing between vLLM E4B and llama.cpp 27B. Production stack with OpenWebUI, monitoring, and more.
Secure & device-agnostic ComfyUI custom nodes for unloading a model or all models at any point in a workflow, followed by a robust set of VRAM & RAM clean up operations using ComfyUI's own memory management utilities. Optionally persists & routes any given data throughout the memory cleanup, letting the user control the execution order of nodes.
A Proof of Concept for the LEMA (Layer-wise Efficient Memory Abstraction) framework. Enables stable fine-tuning of Llama-2-7B on consumer-grade hardware (16GB VRAM) through layer-wise weight streaming and triple-buffer memory virtualization.
BoneMemory: Universal Async Core for AI-Toolkit. The hardware-agnostic VRAM manager for every model: Image, Video, and Audio. Whether you’re training a Rank 16 LoRA on an entry-level GPU or pushing Rank 1024 on an RTX 4090, BoneMemory eliminates memory bottlenecks. Architecture that makes any hardware punch above its weight. Zero OOM.
Accelerate INT8 sparse inference in PyTorch on Windows with minimal setup. Achieve high performance using Sparse Tensor Cores without Linux dependencies.
ComfyUI Custom Node for running Transformer LLMs with zero dependency conflicts. Provides device-agnostic VRAM/RAM cleanup options post-run & optional dynamic LLM prompt formatting with variable inputs of any type.
Add a description, image, and links to the vram-optimization topic page so that developers can more easily learn about it.
To associate your repository with the vram-optimization topic, visit your repo's landing page and select "manage topics."