-
-
Notifications
You must be signed in to change notification settings - Fork 47
Open
Description
I am attempting to run a model on a second GPU using a local ComfyUI worker, but I encounter a CUDA-related error. The master GPU is able to generate images without issues. The model I am testing is realismSDXL.
Additionally, I am exploring running the Flux-Dev model in FP8, which is slightly larger than my P100 GPU memory. I would like to know if there is a recommended way to split the model across two P100 GPUs.
==================================================
=== ComfyUI Worker Session Started ===
Worker: Worker 1
Port: 8189
CUDA Device: 1
Started: 2025-12-02 01:42:15
Command: /usr/local/bin/python /app/main.py --port 8189 --enable-cors-header --base-dir /app --listen
Note: Worker will stop when master shuts down
==============================
[Worker Monitor] Monitoring master PID: 1
[Distributed] Started worker PID: 76
[Distributed] Monitoring master PID: 1
Checkpoint files will always be loaded safely.
Total VRAM 16384 MB, total RAM 100554 MB
pytorch version: 2.4.1+cu118
Set vram state to: NORMAL_VRAM
Device: cuda:0 GRID P100-16Q : cudaMallocAsync
Using pytorch attention
Python version: 3.12.12 (main, Nov 18 2025, 05:56:04) [GCC 14.2.0]
ComfyUI version: 0.3.36
ComfyUI frontend version: 1.20.5
[Prompt Server] web root: /usr/local/lib/python3.12/site-packages/comfyui_frontend_package/static
Import times for custom nodes:
0.0 seconds: /app/custom_nodes/websocket_image_save.py
0.0 seconds: /app/custom_nodes/ComfyUI-Distributed
Starting server
To see the GUI go to: http://0.0.0.0:8189
To see the GUI go to: http://[::]:8189
got prompt
model weight dtype torch.float16, manual cast: torch.float32
model_type EPS
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.float32
Requested to load SDXLClipModel
loaded completely 9.5367431640625e+25 1560.802734375 True
CLIP/text encoder model load device: cpu, offload device: cpu, current: cpu, dtype: torch.float16
Token indices sequence length is longer than the specified maximum sequence length for this model (82 > 77). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (82 > 77). Running this sequence through the model will result in indexing errors
!!! Exception during processing !!! CUDA error: operation not supported
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Traceback (most recent call last):
File "/app/execution.py", line 349, in execute
output_data, output_ui, has_subgraph = get_output_data(obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/execution.py", line 224, in get_output_data
return_values = _map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/execution.py", line 196, in _map_node_over_list
process_inputs(input_dict, i)
File "/app/execution.py", line 185, in process_inputs
results.append(getattr(obj, func)(**inputs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/nodes.py", line 1516, in sample
return common_ksampler(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=denoise)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/nodes.py", line 1483, in common_ksampler
samples = comfy.sample.sample(model, noise, steps, cfg, sampler_name, scheduler, positive, negative, latent_image,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/comfy/sample.py", line 43, in sample
sampler = comfy.samplers.KSampler(model, steps=steps, device=model.load_device, sampler=sampler_name, scheduler=scheduler, denoise=denoise, model_options=model.model_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/comfy/samplers.py", line 1083, in __init__
self.set_steps(steps, denoise)
File "/app/comfy/samplers.py", line 1104, in set_steps
self.sigmas = self.calculate_sigmas(steps).to(self.device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: operation not supported
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Prompt executed in 24.10 seconds
Environment Details:
- ComfyUI version: 0.3.36
- Frontend version: 1.20.5
- PyTorch: 2.4.1+cu118
- Python: 3.12.12
- GPU: NVIDIA GRID P100-16Q (16 GB VRAM)
- Worker port: 8189
Additional Information:
- The master GPU can successfully generate images.
- I am using a local worker for the second GPU (CUDA Device: 1).
- I would like guidance on running larger models (like Flux-Dev in FP8) across multiple P100 GPUs, if possible.
Questions:
- Is there a recommended way to split a model across two P100 GPUs for ComfyUI?
- Are there any known issues with running FP16/FP8 models on GRID P100 vGPUs that could cause CUDA error: operation not supported?
Metadata
Metadata
Assignees
Labels
No labels