Skip to content

Unable to run model on second GPU; CUDA error: operation not supported #57

@mr-gorjan

Description

@mr-gorjan

I am attempting to run a model on a second GPU using a local ComfyUI worker, but I encounter a CUDA-related error. The master GPU is able to generate images without issues. The model I am testing is realismSDXL.

Additionally, I am exploring running the Flux-Dev model in FP8, which is slightly larger than my P100 GPU memory. I would like to know if there is a recommended way to split the model across two P100 GPUs.

==================================================
=== ComfyUI Worker Session Started ===
Worker: Worker 1
Port: 8189
CUDA Device: 1
Started: 2025-12-02 01:42:15
Command: /usr/local/bin/python /app/main.py --port 8189 --enable-cors-header --base-dir /app --listen
Note: Worker will stop when master shuts down
==============================

[Worker Monitor] Monitoring master PID: 1
[Distributed] Started worker PID: 76
[Distributed] Monitoring master PID: 1
Checkpoint files will always be loaded safely.
Total VRAM 16384 MB, total RAM 100554 MB
pytorch version: 2.4.1+cu118
Set vram state to: NORMAL_VRAM
Device: cuda:0 GRID P100-16Q : cudaMallocAsync
Using pytorch attention
Python version: 3.12.12 (main, Nov 18 2025, 05:56:04) [GCC 14.2.0]
ComfyUI version: 0.3.36
ComfyUI frontend version: 1.20.5
[Prompt Server] web root: /usr/local/lib/python3.12/site-packages/comfyui_frontend_package/static

Import times for custom nodes:
   0.0 seconds: /app/custom_nodes/websocket_image_save.py
   0.0 seconds: /app/custom_nodes/ComfyUI-Distributed

Starting server

To see the GUI go to: http://0.0.0.0:8189
To see the GUI go to: http://[::]:8189
got prompt
model weight dtype torch.float16, manual cast: torch.float32
model_type EPS
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.float32
Requested to load SDXLClipModel
loaded completely 9.5367431640625e+25 1560.802734375 True
CLIP/text encoder model load device: cpu, offload device: cpu, current: cpu, dtype: torch.float16
Token indices sequence length is longer than the specified maximum sequence length for this model (82 > 77). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (82 > 77). Running this sequence through the model will result in indexing errors
!!! Exception during processing !!! CUDA error: operation not supported
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "/app/execution.py", line 349, in execute
    output_data, output_ui, has_subgraph = get_output_data(obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/execution.py", line 224, in get_output_data
    return_values = _map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/execution.py", line 196, in _map_node_over_list
    process_inputs(input_dict, i)
  File "/app/execution.py", line 185, in process_inputs
    results.append(getattr(obj, func)(**inputs))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nodes.py", line 1516, in sample
    return common_ksampler(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=denoise)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nodes.py", line 1483, in common_ksampler
    samples = comfy.sample.sample(model, noise, steps, cfg, sampler_name, scheduler, positive, negative, latent_image,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/comfy/sample.py", line 43, in sample
    sampler = comfy.samplers.KSampler(model, steps=steps, device=model.load_device, sampler=sampler_name, scheduler=scheduler, denoise=denoise, model_options=model.model_options)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/comfy/samplers.py", line 1083, in __init__
    self.set_steps(steps, denoise)
  File "/app/comfy/samplers.py", line 1104, in set_steps
    self.sigmas = self.calculate_sigmas(steps).to(self.device)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: operation not supported
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


Prompt executed in 24.10 seconds

Environment Details:

  • ComfyUI version: 0.3.36
  • Frontend version: 1.20.5
  • PyTorch: 2.4.1+cu118
  • Python: 3.12.12
  • GPU: NVIDIA GRID P100-16Q (16 GB VRAM)
  • Worker port: 8189

Additional Information:

  • The master GPU can successfully generate images.
  • I am using a local worker for the second GPU (CUDA Device: 1).
  • I would like guidance on running larger models (like Flux-Dev in FP8) across multiple P100 GPUs, if possible.

Questions:

  1. Is there a recommended way to split a model across two P100 GPUs for ComfyUI?
  2. Are there any known issues with running FP16/FP8 models on GRID P100 vGPUs that could cause CUDA error: operation not supported?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions