Skip to content

Training gets stuck after completing just one trajectory #459

@abd-hsn

Description

@abd-hsn

I’m running a customized version of art-e with additional tools locally. Training proceeds normally until the first validation stage, where it gets stuck for hours.

I am running Qwen2.5-1.5B-Instruct on single A100. Below is the code I use to load the model.

    from art.local.backend import LocalBackend

    backend = LocalBackend(path="./.art")

    model = art.TrainableModel(
        name="Qwen2.5-1.5B",
        project="my-model",
        base_model="Qwen/Qwen2.5-1.5B-Instruct",
    )

    model._internal_config = art.dev.InternalModelConfig(
        init_args=art.dev.InitArgs(
            max_seq_length=4096,
        ),
        peft_args=art.dev.PeftArgs(
            r=8,
            lora_alpha=8,
            target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
        ),
        trainer_args=art.dev.TrainerArgs(
            per_device_train_batch_size=1,
            gradient_accumulation_steps=2,
        ),
        engine_args=art.dev.EngineArgs(
            gpu_memory_utilization=0.8,
            enforce_eager=True,
        ),
    )
    await model.register(backend)

The model shows this and stuck

    train:   0%|          | 0/3 [00:00<?, ?it/s]�[A==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
       \\   /|    Num examples = 10,000,000 | Num Epochs = 3 | Total steps = 15,000,000
    O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 2
    \        /    Data Parallel GPUs = 1 | Total batch size (2 x 2 x 1) = 4
     "-____-"     Trainable parameters = 2,179,072 of 1,545,893,376 (0.14% trained)

I’ve been monitoring VLLM.log, and it’s showing the same entries.
From VLLM.log:

INFO:     127.0.0.1:46160 - "GET /metrics HTTP/1.1" 200
INFO:     127.0.0.1:46166 - "GET /metrics HTTP/1.1" 200
INFO:     127.0.0.1:46172 - "GET /metrics HTTP/1.1" 200
INFO:     127.0.0.1:46178 - "GET /metrics HTTP/1.1" 200
INFO:     127.0.0.1:46184 - "GET /metrics HTTP/1.1" 200
INFO:     127.0.0.1:46190 - "GET /metrics HTTP/1.1" 200
INFO:     127.0.0.1:46196 - "GET /metrics HTTP/1.1" 200
INFO:     127.0.0.1:46204 - "GET /metrics HTTP/1.1" 200

From debug_internal.log:

{"time":"2025-11","level":"ERROR","msg":"request failed","error":"Get \"http://localhost:8000/metrics\": dial tcp [::1]:8000: connect: connection refused","method":"GET","url":"http://localhost:8000/metrics"}
{"time":"2025-11","level":"ERROR","msg":"request failed","error":"Get \"http://localhost:8000/metrics\": dial tcp [::1]:8000: connect: connection refused","method":"GET","url":"http://localhost:8000/metrics"}
{"time":"2025-11","level":"ERROR","msg":"request failed","error":"Get \"http://localhost:8000/metrics\": dial tcp [::1]:8000: connect: connection refused","method":"GET","url":"http://localhost:8000/metrics"}
{"time":"2025-11","level":"ERROR","msg":"request failed","error":"Get \"http://localhost:8000/metrics\": dial tcp [::1]:8000: connect: connection refused","method":"GET","url":"http://localhost:8000/metrics"}
{"time":"2025-11","level":"ERROR","msg":"monitor: error sampling metrics: GET http://localhost:8000/metrics giving up after 4 attempt(s): Get \"http://localhost:8000/metrics\": dial tcp [::1]:8000: connect: connection refused"}

Similar issue to 329, I tried installing the latest version from GitHub, but the issue still persists. pointers or examples would be greatly appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions