Training gets stuck after completing just one trajectory

I’m running a customized version of art-e with additional tools locally. Training proceeds normally until the first validation stage, where it gets stuck for hours.

I am running Qwen2.5-1.5B-Instruct on single A100. Below is the code I use to load the model.
```
    from art.local.backend import LocalBackend

    backend = LocalBackend(path="./.art")

    model = art.TrainableModel(
        name="Qwen2.5-1.5B",
        project="my-model",
        base_model="Qwen/Qwen2.5-1.5B-Instruct",
    )

    model._internal_config = art.dev.InternalModelConfig(
        init_args=art.dev.InitArgs(
            max_seq_length=4096,
        ),
        peft_args=art.dev.PeftArgs(
            r=8,
            lora_alpha=8,
            target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
        ),
        trainer_args=art.dev.TrainerArgs(
            per_device_train_batch_size=1,
            gradient_accumulation_steps=2,
        ),
        engine_args=art.dev.EngineArgs(
            gpu_memory_utilization=0.8,
            enforce_eager=True,
        ),
    )
    await model.register(backend)
```

The model shows this and stuck
```
    train:   0%|          | 0/3 [00:00<?, ?it/s][A==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
       \\   /|    Num examples = 10,000,000 | Num Epochs = 3 | Total steps = 15,000,000
    O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 2
    \        /    Data Parallel GPUs = 1 | Total batch size (2 x 2 x 1) = 4
     "-____-"     Trainable parameters = 2,179,072 of 1,545,893,376 (0.14% trained)
```

I’ve been monitoring VLLM.log, and it’s showing the same entries.
From `VLLM.log`:

    INFO:     127.0.0.1:46160 - "GET /metrics HTTP/1.1" 200
    INFO:     127.0.0.1:46166 - "GET /metrics HTTP/1.1" 200
    INFO:     127.0.0.1:46172 - "GET /metrics HTTP/1.1" 200
    INFO:     127.0.0.1:46178 - "GET /metrics HTTP/1.1" 200
    INFO:     127.0.0.1:46184 - "GET /metrics HTTP/1.1" 200
    INFO:     127.0.0.1:46190 - "GET /metrics HTTP/1.1" 200
    INFO:     127.0.0.1:46196 - "GET /metrics HTTP/1.1" 200
    INFO:     127.0.0.1:46204 - "GET /metrics HTTP/1.1" 200

From `debug_internal.log`:

```
{"time":"2025-11","level":"ERROR","msg":"request failed","error":"Get \"http://localhost:8000/metrics\": dial tcp [::1]:8000: connect: connection refused","method":"GET","url":"http://localhost:8000/metrics"}
{"time":"2025-11","level":"ERROR","msg":"request failed","error":"Get \"http://localhost:8000/metrics\": dial tcp [::1]:8000: connect: connection refused","method":"GET","url":"http://localhost:8000/metrics"}
{"time":"2025-11","level":"ERROR","msg":"request failed","error":"Get \"http://localhost:8000/metrics\": dial tcp [::1]:8000: connect: connection refused","method":"GET","url":"http://localhost:8000/metrics"}
{"time":"2025-11","level":"ERROR","msg":"request failed","error":"Get \"http://localhost:8000/metrics\": dial tcp [::1]:8000: connect: connection refused","method":"GET","url":"http://localhost:8000/metrics"}
{"time":"2025-11","level":"ERROR","msg":"monitor: error sampling metrics: GET http://localhost:8000/metrics giving up after 4 attempt(s): Get \"http://localhost:8000/metrics\": dial tcp [::1]:8000: connect: connection refused"}
```

Similar issue to [329](https://github.com/OpenPipe/ART/issues/329), I tried installing the latest version from GitHub, but the issue still persists. pointers or examples would be greatly appreciated!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training gets stuck after completing just one trajectory #459

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Training gets stuck after completing just one trajectory #459

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions