-
Notifications
You must be signed in to change notification settings - Fork 628
Open
Description
I’m running a customized version of art-e with additional tools locally. Training proceeds normally until the first validation stage, where it gets stuck for hours.
I am running Qwen2.5-1.5B-Instruct on single A100. Below is the code I use to load the model.
from art.local.backend import LocalBackend
backend = LocalBackend(path="./.art")
model = art.TrainableModel(
name="Qwen2.5-1.5B",
project="my-model",
base_model="Qwen/Qwen2.5-1.5B-Instruct",
)
model._internal_config = art.dev.InternalModelConfig(
init_args=art.dev.InitArgs(
max_seq_length=4096,
),
peft_args=art.dev.PeftArgs(
r=8,
lora_alpha=8,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
),
trainer_args=art.dev.TrainerArgs(
per_device_train_batch_size=1,
gradient_accumulation_steps=2,
),
engine_args=art.dev.EngineArgs(
gpu_memory_utilization=0.8,
enforce_eager=True,
),
)
await model.register(backend)
The model shows this and stuck
train: 0%| | 0/3 [00:00<?, ?it/s]�[A==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1
\\ /| Num examples = 10,000,000 | Num Epochs = 3 | Total steps = 15,000,000
O^O/ \_/ \ Batch size per device = 2 | Gradient accumulation steps = 2
\ / Data Parallel GPUs = 1 | Total batch size (2 x 2 x 1) = 4
"-____-" Trainable parameters = 2,179,072 of 1,545,893,376 (0.14% trained)
I’ve been monitoring VLLM.log, and it’s showing the same entries.
From VLLM.log:
INFO: 127.0.0.1:46160 - "GET /metrics HTTP/1.1" 200
INFO: 127.0.0.1:46166 - "GET /metrics HTTP/1.1" 200
INFO: 127.0.0.1:46172 - "GET /metrics HTTP/1.1" 200
INFO: 127.0.0.1:46178 - "GET /metrics HTTP/1.1" 200
INFO: 127.0.0.1:46184 - "GET /metrics HTTP/1.1" 200
INFO: 127.0.0.1:46190 - "GET /metrics HTTP/1.1" 200
INFO: 127.0.0.1:46196 - "GET /metrics HTTP/1.1" 200
INFO: 127.0.0.1:46204 - "GET /metrics HTTP/1.1" 200
From debug_internal.log:
{"time":"2025-11","level":"ERROR","msg":"request failed","error":"Get \"http://localhost:8000/metrics\": dial tcp [::1]:8000: connect: connection refused","method":"GET","url":"http://localhost:8000/metrics"}
{"time":"2025-11","level":"ERROR","msg":"request failed","error":"Get \"http://localhost:8000/metrics\": dial tcp [::1]:8000: connect: connection refused","method":"GET","url":"http://localhost:8000/metrics"}
{"time":"2025-11","level":"ERROR","msg":"request failed","error":"Get \"http://localhost:8000/metrics\": dial tcp [::1]:8000: connect: connection refused","method":"GET","url":"http://localhost:8000/metrics"}
{"time":"2025-11","level":"ERROR","msg":"request failed","error":"Get \"http://localhost:8000/metrics\": dial tcp [::1]:8000: connect: connection refused","method":"GET","url":"http://localhost:8000/metrics"}
{"time":"2025-11","level":"ERROR","msg":"monitor: error sampling metrics: GET http://localhost:8000/metrics giving up after 4 attempt(s): Get \"http://localhost:8000/metrics\": dial tcp [::1]:8000: connect: connection refused"}
Similar issue to 329, I tried installing the latest version from GitHub, but the issue still persists. pointers or examples would be greatly appreciated!
Metadata
Metadata
Assignees
Labels
No labels