Hi, there,
wonderful work and thanks for sharing.
I tried to run toka with Qwen2.5 32b model, tp_size works well, but pp_size failed.
Hardware: Hopper GPUs with NVLink. pp_size = 2. GPU memory is enough.
Error1:
Starting 7 processes: ['model_worker_pp0', 'model_worker_pp1', 'model_worker_pp2', 'model_worker_pp3', 'fanout_worker', 'manager', 'server']
Running in the main process: server
2025-10-09 09:14:22 | INFO | server | Starting web server
W1009 09:14:27.610000 297511 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W1009 09:14:27.610000 297511 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W1009 09:14:28.191000 297514 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W1009 09:14:28.191000 297514 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W1009 09:14:28.300000 297513 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W1009 09:14:28.300000 297513 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W1009 09:14:28.433000 297510 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W1009 09:14:28.433000 297510 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W1009 09:14:28.454000 297512 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W1009 09:14:28.454000 297512 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W1009 09:14:28.946000 297515 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W1009 09:14:28.946000 297515 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
2025-10-09 09:14:29 | INFO | model_worker_pp1 | Pipeline worker 1 started!
2025-10-09 09:14:30 | INFO | fanout_worker | Fanout worker started!
2025-10-09 09:14:30 | INFO | model_worker_pp3 | Pipeline worker 3 started!
2025-10-09 09:14:30 | INFO | model_worker_pp0 | Pipeline worker 0 started!
2025-10-09 09:14:30 | INFO | model_worker_pp2 | Pipeline worker 2 started!
2025-10-09 09:14:30 | INFO | manager | Manager started
2025-10-09 09:14:31 | INFO | model_worker_pp2 | Creating model on device cuda:2 with dtype torch.bfloat16
Building layers 32 to 48
Loading from safetensors
Loading safetensors files: 0%| | 0/14 [00:00<?, ?it/s]2025-10-09 09:14:31 | INFO | model_worker_pp1 | Creating model on device cuda:1 with dtype torch.bfloat16
2025-10-09 09:14:31 | INFO | model_worker_pp3 | Creating model on device cuda:3 with dtype torch.bfloat16
Building layers 16 to 32
2025-10-09 09:14:31 | INFO | model_worker_pp0 | Creating model on device cuda:0 with dtype torch.bfloat16
Building layers 48 to 64
Building layers 0 to 16
Loading from safetensors
Loading safetensors files: 0%| | 0/14 [00:00<?, ?it/s]Loading from safetensors
Loading safetensors files: 0%| | 0/14 [00:00<?, ?it/s]Loading from safetensors
Loading safetensors files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:05<00:00, 2.65it/s]
Loading safetensors files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:04<00:00, 2.95it/s]
Loading safetensors files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:04<00:00, 2.97it/s]
2025-10-09 09:14:36 | INFO | model_worker_pp1 | Created model
2025-10-09 09:14:36 | INFO | model_worker_pp2 | Created model
Capturing cudagraphs for model_worker_pp1: 0%| | 0/8 [00:00<?, ?it/s]2025-10-09 09:14:36,832 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
[rank2]:W1009 09:14:36.832000 297512 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank2]:W1009 09:14:36.832000 297512 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
2025-10-09 09:14:36,834 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
[rank1]:W1009 09:14:36.835000 297511 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank1]:W1009 09:14:36.835000 297511 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
2025-10-09 09:14:36 | INFO | model_worker_pp3 | Created model
Capturing cudagraphs for model_worker_pp3: 0%| | 0/8 [00:00<?, ?it/s][rank2]:W1009 09:14:36.860000 297512 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank2]:W1009 09:14:36.860000 297512 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
2025-10-09 09:14:36,869 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-10-09 09:14:36,872 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
[rank3]:W1009 09:14:36.872000 297513 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank3]:W1009 09:14:36.872000 297513 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[rank3]:W1009 09:14:36.880000 297513 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank3]:W1009 09:14:36.880000 297513 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
2025-10-09 09:14:36,889 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
[rank1]:W1009 09:14:36.946000 297511 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank1]:W1009 09:14:36.946000 297511 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
2025-10-09 09:14:36,955 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
Loading safetensors files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:04<00:00, 2.90it/s]
2025-10-09 09:14:37 | INFO | model_worker_pp0 | Created model
Capturing cudagraphs for model_worker_pp0: 0%| | 0/8 [00:00<?, ?it/s]2025-10-09 09:14:37,075 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
[rank0]:W1009 09:14:37.075000 297510 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank0]:W1009 09:14:37.075000 297510 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[rank0]:W1009 09:14:37.083000 297510 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank0]:W1009 09:14:37.083000 297510 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
2025-10-09 09:14:37,093 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-10-09 09:14:37,169 - INFO - flashinfer.jit: Loading JIT ops: sampling
[rank3]:W1009 09:14:37.169000 297513 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank3]:W1009 09:14:37.169000 297513 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[rank3]:W1009 09:14:37.176000 297513 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank3]:W1009 09:14:37.176000 297513 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
2025-10-09 09:14:37,185 - INFO - flashinfer.jit: Finished loading JIT ops: sampling
Capturing cudagraphs for model_worker_pp1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00, 3.54it/s]
Warmup loop for model_worker_pp1: 0%| | 0/84 [00:00<?, ?it/s]2025-10-09 09:14:39,117 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
[rank1]:W1009 09:14:39.117000 297511 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank1]:W1009 09:14:39.117000 297511 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[rank1]:W1009 09:14:39.125000 297511 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank1]:W1009 09:14:39.125000 297511 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
2025-10-09 09:14:39,140 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
2025-10-09 09:14:39,149 - INFO - flashinfer.jit: Loading JIT ops: page
[rank1]:W1009 09:14:39.150000 297511 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank1]:W1009 09:14:39.150000 297511 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[rank1]:W1009 09:14:39.156000 297511 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank1]:W1009 09:14:39.156000 297511 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
Capturing cudagraphs for model_worker_pp0: 88%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 7/8 [00:02<00:00, 3.69it/s]2025-10-09 09:14:39,163 - INFO - flashinfer.jit: Finished loading JIT ops: page
Capturing cudagraphs for model_worker_pp2: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00, 3.41it/s]
Warmup loop for model_worker_pp2: 0%| | 0/84 [00:00<?, ?it/s]2025-10-09 09:14:39,213 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
[rank2]:W1009 09:14:39.213000 297512 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank2]:W1009 09:14:39.213000 297512 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[rank2]:W1009 09:14:39.220000 297512 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank2]:W1009 09:14:39.220000 297512 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
Capturing cudagraphs for model_worker_pp3: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00, 3.38it/s]
2025-10-09 09:14:39,229 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
2025-10-09 09:14:39,238 - INFO - flashinfer.jit: Loading JIT ops: page
[rank2]:W1009 09:14:39.238000 297512 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank2]:W1009 09:14:39.238000 297512 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[rank2]:W1009 09:14:39.244000 297512 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank2]:W1009 09:14:39.244000 297512 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
2025-10-09 09:14:39,251 - INFO - flashinfer.jit: Finished loading JIT ops: page
Warmup loop for model_worker_pp3: 0%| | 0/84 [00:00<?, ?it/s]2025-10-09 09:14:39,267 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
[rank3]:W1009 09:14:39.267000 297513 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank3]:W1009 09:14:39.267000 297513 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[rank3]:W1009 09:14:39.274000 297513 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank3]:W1009 09:14:39.274000 297513 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
2025-10-09 09:14:39,282 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
2025-10-09 09:14:39,291 - INFO - flashinfer.jit: Loading JIT ops: page
[rank3]:W1009 09:14:39.291000 297513 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank3]:W1009 09:14:39.291000 297513 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[rank3]:W1009 09:14:39.297000 297513 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank3]:W1009 09:14:39.297000 297513 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
2025-10-09 09:14:39,304 - INFO - flashinfer.jit: Finished loading JIT ops: page
Capturing cudagraphs for model_worker_pp0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00, 3.45it/s]
Warmup loop for model_worker_pp0: 0%| | 0/84 [00:00<?, ?it/s]2025-10-09 09:14:39,425 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
[rank0]:W1009 09:14:39.425000 297510 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank0]:W1009 09:14:39.425000 297510 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
Warmup loop for model_worker_pp1: 1%|█▊ | 1/84 [00:00<00:26, 3.13it/s][rank0]:W1009 09:14:39.432000 297510 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank0]:W1009 09:14:39.432000 297510 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
2025-10-09 09:14:39,441 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
2025-10-09 09:14:39,451 - INFO - flashinfer.jit: Loading JIT ops: page
[rank0]:W1009 09:14:39.451000 297510 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank0]:W1009 09:14:39.451000 297510 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[rank0]:W1009 09:14:39.457000 297510 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank0]:W1009 09:14:39.457000 297510 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
2025-10-09 09:14:39,463 - INFO - flashinfer.jit: Finished loading JIT ops: page
Warmup loop for model_worker_pp1: 27%|████████████████████████████████████████▊
Traceback (most recent call last):
File "/opt/conda/envs/tokasaurus/lib/python3.10/site-packages/tokasaurus/utils.py", line 502, in wrapper
return func(*args, **kwargs)
File "/opt/conda/envs/tokasaurus/lib/python3.10/site-packages/tokasaurus/model/pipeline_worker.py", line 278, in pipeline_worker_model_loop
setup_and_run_loop(
File "/opt/conda/envs/tokasaurus/lib/python3.10/site-packages/tokasaurus/model/utils.py", line 896, in setup_and_run_loop
run_warmup_batches(
File "/opt/conda/envs/tokasaurus/lib/python3.10/site-packages/tokasaurus/model/utils.py", line 483, in run_warmup_batches
run_overlapped_loop(
File "/opt/conda/envs/tokasaurus/lib/python3.10/site-packages/tokasaurus/model/utils.py", line 296, in run_overlapped_loop
run_model(run_work)
File "/opt/conda/envs/tokasaurus/lib/python3.10/site-packages/tokasaurus/model/pipeline_worker.py", line 213, in run_model
output_batch_state = model_runner.run(
File "/opt/conda/envs/tokasaurus/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/envs/tokasaurus/lib/python3.10/site-packages/tokasaurus/model/utils.py", line 849, in run
return self.graphs[graph_index].run(input_batch_state, non_blocking)
File "/opt/conda/envs/tokasaurus/lib/python3.10/site-packages/tokasaurus/model/utils.py", line 618, in run
assert self.output_batch_state.outputs is not None
AssertionError
Error2: if I commented out the assertion, it finally failed on next replace op
input_batch_state.outputs = replace(self.output_batch_state.outputs)
File "/opt/conda/envs/tokasaurus/lib/python3.10/dataclasses.py", line 1424, in replace
raise TypeError("replace() should be called on dataclass instances")
TypeError: replace() should be called on dataclass instances
Hi, there,
wonderful work and thanks for sharing.
I tried to run toka with Qwen2.5 32b model, tp_size works well, but pp_size failed.
Hardware: Hopper GPUs with NVLink. pp_size = 2. GPU memory is enough.
Error1:
Starting 7 processes: ['model_worker_pp0', 'model_worker_pp1', 'model_worker_pp2', 'model_worker_pp3', 'fanout_worker', 'manager', 'server']
Running in the main process: server
2025-10-09 09:14:22 | INFO | server | Starting web server
W1009 09:14:27.610000 297511 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W1009 09:14:27.610000 297511 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W1009 09:14:28.191000 297514 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W1009 09:14:28.191000 297514 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W1009 09:14:28.300000 297513 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W1009 09:14:28.300000 297513 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W1009 09:14:28.433000 297510 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W1009 09:14:28.433000 297510 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W1009 09:14:28.454000 297512 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W1009 09:14:28.454000 297512 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W1009 09:14:28.946000 297515 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W1009 09:14:28.946000 297515 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
2025-10-09 09:14:29 | INFO | model_worker_pp1 | Pipeline worker 1 started!
2025-10-09 09:14:30 | INFO | fanout_worker | Fanout worker started!
2025-10-09 09:14:30 | INFO | model_worker_pp3 | Pipeline worker 3 started!
2025-10-09 09:14:30 | INFO | model_worker_pp0 | Pipeline worker 0 started!
2025-10-09 09:14:30 | INFO | model_worker_pp2 | Pipeline worker 2 started!
2025-10-09 09:14:30 | INFO | manager | Manager started
2025-10-09 09:14:31 | INFO | model_worker_pp2 | Creating model on device cuda:2 with dtype torch.bfloat16
Building layers 32 to 48
Loading from safetensors
Loading safetensors files: 0%| | 0/14 [00:00<?, ?it/s]2025-10-09 09:14:31 | INFO | model_worker_pp1 | Creating model on device cuda:1 with dtype torch.bfloat16
2025-10-09 09:14:31 | INFO | model_worker_pp3 | Creating model on device cuda:3 with dtype torch.bfloat16
Building layers 16 to 32
2025-10-09 09:14:31 | INFO | model_worker_pp0 | Creating model on device cuda:0 with dtype torch.bfloat16
Building layers 48 to 64
Building layers 0 to 16
Loading from safetensors
Loading safetensors files: 0%| | 0/14 [00:00<?, ?it/s]Loading from safetensors
Loading safetensors files: 0%| | 0/14 [00:00<?, ?it/s]Loading from safetensors
Loading safetensors files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:05<00:00, 2.65it/s]
Loading safetensors files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:04<00:00, 2.95it/s]
Loading safetensors files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:04<00:00, 2.97it/s]
2025-10-09 09:14:36 | INFO | model_worker_pp1 | Created model
2025-10-09 09:14:36 | INFO | model_worker_pp2 | Created model
Capturing cudagraphs for model_worker_pp1: 0%| | 0/8 [00:00<?, ?it/s]2025-10-09 09:14:36,832 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
[rank2]:W1009 09:14:36.832000 297512 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank2]:W1009 09:14:36.832000 297512 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
2025-10-09 09:14:36,834 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
[rank1]:W1009 09:14:36.835000 297511 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank1]:W1009 09:14:36.835000 297511 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
2025-10-09 09:14:36 | INFO | model_worker_pp3 | Created model
Capturing cudagraphs for model_worker_pp3: 0%| | 0/8 [00:00<?, ?it/s][rank2]:W1009 09:14:36.860000 297512 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank2]:W1009 09:14:36.860000 297512 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
2025-10-09 09:14:36,869 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-10-09 09:14:36,872 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
[rank3]:W1009 09:14:36.872000 297513 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank3]:W1009 09:14:36.872000 297513 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[rank3]:W1009 09:14:36.880000 297513 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank3]:W1009 09:14:36.880000 297513 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
2025-10-09 09:14:36,889 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
[rank1]:W1009 09:14:36.946000 297511 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank1]:W1009 09:14:36.946000 297511 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
2025-10-09 09:14:36,955 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
Loading safetensors files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:04<00:00, 2.90it/s]
2025-10-09 09:14:37 | INFO | model_worker_pp0 | Created model
Capturing cudagraphs for model_worker_pp0: 0%| | 0/8 [00:00<?, ?it/s]2025-10-09 09:14:37,075 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
[rank0]:W1009 09:14:37.075000 297510 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank0]:W1009 09:14:37.075000 297510 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[rank0]:W1009 09:14:37.083000 297510 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank0]:W1009 09:14:37.083000 297510 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
2025-10-09 09:14:37,093 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-10-09 09:14:37,169 - INFO - flashinfer.jit: Loading JIT ops: sampling
[rank3]:W1009 09:14:37.169000 297513 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank3]:W1009 09:14:37.169000 297513 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[rank3]:W1009 09:14:37.176000 297513 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank3]:W1009 09:14:37.176000 297513 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
2025-10-09 09:14:37,185 - INFO - flashinfer.jit: Finished loading JIT ops: sampling
Capturing cudagraphs for model_worker_pp1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00, 3.54it/s]
Warmup loop for model_worker_pp1: 0%| | 0/84 [00:00<?, ?it/s]2025-10-09 09:14:39,117 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
[rank1]:W1009 09:14:39.117000 297511 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank1]:W1009 09:14:39.117000 297511 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[rank1]:W1009 09:14:39.125000 297511 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank1]:W1009 09:14:39.125000 297511 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
2025-10-09 09:14:39,140 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
2025-10-09 09:14:39,149 - INFO - flashinfer.jit: Loading JIT ops: page
[rank1]:W1009 09:14:39.150000 297511 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank1]:W1009 09:14:39.150000 297511 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[rank1]:W1009 09:14:39.156000 297511 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank1]:W1009 09:14:39.156000 297511 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
Capturing cudagraphs for model_worker_pp0: 88%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 7/8 [00:02<00:00, 3.69it/s]2025-10-09 09:14:39,163 - INFO - flashinfer.jit: Finished loading JIT ops: page
Capturing cudagraphs for model_worker_pp2: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00, 3.41it/s]
Warmup loop for model_worker_pp2: 0%| | 0/84 [00:00<?, ?it/s]2025-10-09 09:14:39,213 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
[rank2]:W1009 09:14:39.213000 297512 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank2]:W1009 09:14:39.213000 297512 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[rank2]:W1009 09:14:39.220000 297512 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank2]:W1009 09:14:39.220000 297512 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
Capturing cudagraphs for model_worker_pp3: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00, 3.38it/s]
2025-10-09 09:14:39,229 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
2025-10-09 09:14:39,238 - INFO - flashinfer.jit: Loading JIT ops: page
[rank2]:W1009 09:14:39.238000 297512 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank2]:W1009 09:14:39.238000 297512 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[rank2]:W1009 09:14:39.244000 297512 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank2]:W1009 09:14:39.244000 297512 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
2025-10-09 09:14:39,251 - INFO - flashinfer.jit: Finished loading JIT ops: page
Warmup loop for model_worker_pp3: 0%| | 0/84 [00:00<?, ?it/s]2025-10-09 09:14:39,267 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
[rank3]:W1009 09:14:39.267000 297513 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank3]:W1009 09:14:39.267000 297513 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[rank3]:W1009 09:14:39.274000 297513 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank3]:W1009 09:14:39.274000 297513 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
2025-10-09 09:14:39,282 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
2025-10-09 09:14:39,291 - INFO - flashinfer.jit: Loading JIT ops: page
[rank3]:W1009 09:14:39.291000 297513 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank3]:W1009 09:14:39.291000 297513 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[rank3]:W1009 09:14:39.297000 297513 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank3]:W1009 09:14:39.297000 297513 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
2025-10-09 09:14:39,304 - INFO - flashinfer.jit: Finished loading JIT ops: page
Capturing cudagraphs for model_worker_pp0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00, 3.45it/s]
Warmup loop for model_worker_pp0: 0%| | 0/84 [00:00<?, ?it/s]2025-10-09 09:14:39,425 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
[rank0]:W1009 09:14:39.425000 297510 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank0]:W1009 09:14:39.425000 297510 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
Warmup loop for model_worker_pp1: 1%|█▊ | 1/84 [00:00<00:26, 3.13it/s][rank0]:W1009 09:14:39.432000 297510 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank0]:W1009 09:14:39.432000 297510 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
2025-10-09 09:14:39,441 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90
2025-10-09 09:14:39,451 - INFO - flashinfer.jit: Loading JIT ops: page
[rank0]:W1009 09:14:39.451000 297510 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank0]:W1009 09:14:39.451000 297510 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[rank0]:W1009 09:14:39.457000 297510 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
[rank0]:W1009 09:14:39.457000 297510 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
2025-10-09 09:14:39,463 - INFO - flashinfer.jit: Finished loading JIT ops: page
Warmup loop for model_worker_pp1: 27%|████████████████████████████████████████▊
Traceback (most recent call last):
File "/opt/conda/envs/tokasaurus/lib/python3.10/site-packages/tokasaurus/utils.py", line 502, in wrapper
return func(*args, **kwargs)
File "/opt/conda/envs/tokasaurus/lib/python3.10/site-packages/tokasaurus/model/pipeline_worker.py", line 278, in pipeline_worker_model_loop
setup_and_run_loop(
File "/opt/conda/envs/tokasaurus/lib/python3.10/site-packages/tokasaurus/model/utils.py", line 896, in setup_and_run_loop
run_warmup_batches(
File "/opt/conda/envs/tokasaurus/lib/python3.10/site-packages/tokasaurus/model/utils.py", line 483, in run_warmup_batches
run_overlapped_loop(
File "/opt/conda/envs/tokasaurus/lib/python3.10/site-packages/tokasaurus/model/utils.py", line 296, in run_overlapped_loop
run_model(run_work)
File "/opt/conda/envs/tokasaurus/lib/python3.10/site-packages/tokasaurus/model/pipeline_worker.py", line 213, in run_model
output_batch_state = model_runner.run(
File "/opt/conda/envs/tokasaurus/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/envs/tokasaurus/lib/python3.10/site-packages/tokasaurus/model/utils.py", line 849, in run
return self.graphs[graph_index].run(input_batch_state, non_blocking)
File "/opt/conda/envs/tokasaurus/lib/python3.10/site-packages/tokasaurus/model/utils.py", line 618, in run
assert self.output_batch_state.outputs is not None
AssertionError
Error2: if I commented out the assertion, it finally failed on next replace op
input_batch_state.outputs = replace(self.output_batch_state.outputs)
File "/opt/conda/envs/tokasaurus/lib/python3.10/dataclasses.py", line 1424, in replace
raise TypeError("replace() should be called on dataclass instances")
TypeError: replace() should be called on dataclass instances