Skip to content

Conversation

@pcmoritz
Copy link
Collaborator

@pcmoritz pcmoritz commented Jan 13, 2026

The engine can e.g. by started with

uv run --extra gpu --extra tinker -m tx.tinker.api --base-model Qwen/Qwen3-4B --backend "skyrl_train"

and then you can e.g. run

uv run --with wandb --with tinker sl_loop.py base_url=http://localhost:8000 model_name=Qwen/Qwen3-8B lora_rank=1

@pcmoritz pcmoritz added the tx label Jan 13, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new SkyRL-train backend for supervised training. The changes include updating project dependencies in pyproject.toml and adding the new backend implementation in skyrl-tx/tx/tinker/backends/skyrl_train.py. While this is a good starting point for the new backend, my review has identified several issues that need to be addressed. The most critical issue is in the forward_backward method, which is currently a stub and does not perform a backward pass or return actual losses, preventing any training from occurring. Other significant issues include the use of hardcoded paths and hyperparameters, potentially incorrect token padding, and breaking encapsulation by accessing private members of a library class. Addressing these points will be crucial for the backend to be functional and maintainable.

Comment on lines 192 to 197
ray.get([actor.save_checkpoint.remote(output_path) for actor in self._actor_group._actor_handlers])

def load_checkpoint(self, checkpoint_path, model_id: str) -> None:
if model_id != self._model_id:
raise ValueError(f"Model {model_id} not found")
ray.get([actor.load_checkpoint.remote(Path(checkpoint_path)) for actor in self._actor_group._actor_handlers])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Accessing the private member _actor_handlers of PPORayActorGroup breaks encapsulation and makes the code dependent on the internal implementation of the skyrl-train library. This could lead to breakages if the library is updated. It would be more robust to use a public API from PPORayActorGroup for this purpose, or request one if it doesn't exist.

@vercel
Copy link

vercel bot commented Jan 25, 2026

@pcmoritz is attempting to deploy a commit to the Tyler's projects Team on Vercel.

A member of the Team first needs to authorize it.

@pcmoritz pcmoritz changed the title [tx] [WIP] Add SkyRL-train backend [tx] Add SkyRL-train backend Jan 25, 2026
@pcmoritz pcmoritz changed the title [tx] Add SkyRL-train backend [tx] Add experimental SkyRL-train backend that supports SFT Jan 25, 2026
tyler-griggs added a commit that referenced this pull request Jan 26, 2026
…ropy")

Enables supervised fine-tuning using the Tinker-compatible API.

Changes:
- ppo_utils.py: Add CROSS_ENTROPY loss type and cross_entropy_loss() function
- worker.py: Add SFT code path that returns per-token logprobs and elementwise_loss
- worker_dispatch.py: Add loss_fn and loss_fn_config params to forward_backward()
- dispatch.py: Update MeshDispatch to pass through kwargs (loss_fn, loss_fn_config)
- replay_buffer.py: Make action_log_probs optional in Experience
- worker_utils.py: Use .get() for optional fields; handle non-scalar metrics

New:
- examples/sft/: Minimal SFT example demonstrating the API

This enables PR #871 (SkyRL-train backend for Tinker) to return proper
per-token values instead of placeholder data.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
tyler-griggs added a commit that referenced this pull request Jan 26, 2026
…ropy")

Enables supervised fine-tuning using the Tinker-compatible API.

Changes:
- ppo_utils.py: Add CROSS_ENTROPY loss type and cross_entropy_loss() function
- worker.py: Add SFT code path that returns per-token logprobs and elementwise_loss
- worker_dispatch.py: Add loss_fn and loss_fn_config params to forward_backward()
- dispatch.py: Update MeshDispatch to pass through kwargs (loss_fn, loss_fn_config)
- replay_buffer.py: Make action_log_probs optional in Experience
- worker_utils.py: Use .get() for optional fields; handle non-scalar metrics

New:
- examples/sft/: Minimal SFT example demonstrating the API

This enables PR #871 (SkyRL-train backend for Tinker) to return proper
per-token values instead of placeholder data.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
tyler-griggs added a commit that referenced this pull request Jan 26, 2026
…ropy")

Enables supervised fine-tuning using the Tinker-compatible API.

Changes:
- ppo_utils.py: Add CROSS_ENTROPY loss type and cross_entropy_loss() function
- worker.py: Add SFT code path that returns per-token logprobs and elementwise_loss
- worker_dispatch.py: Add loss_fn and loss_fn_config params to forward_backward()
- dispatch.py: Update MeshDispatch to pass through kwargs (loss_fn, loss_fn_config)
- replay_buffer.py: Make action_log_probs optional in Experience
- worker_utils.py: Use .get() for optional fields; handle non-scalar metrics

New:
- examples/sft/: Minimal SFT example demonstrating the API

This enables PR #871 (SkyRL-train backend for Tinker) to return proper
per-token values instead of placeholder data.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
tyler-griggs added a commit that referenced this pull request Jan 26, 2026
…ropy")

Enables supervised fine-tuning using the Tinker-compatible API.

Changes:
- ppo_utils.py: Add CROSS_ENTROPY loss type and cross_entropy_loss() function
- worker.py: Add SFT code path that returns per-token logprobs and elementwise_loss
- worker_dispatch.py: Add loss_fn and loss_fn_config params to forward_backward()
- dispatch.py: Update MeshDispatch to pass through kwargs (loss_fn, loss_fn_config)
- replay_buffer.py: Make action_log_probs optional in Experience
- worker_utils.py: Use .get() for optional fields; handle non-scalar metrics

New:
- examples/sft/: Minimal SFT example demonstrating the API

This enables PR #871 (SkyRL-train backend for Tinker) to return proper
per-token values instead of placeholder data.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
tyler-griggs added a commit that referenced this pull request Jan 26, 2026
…ropy")

Enables supervised fine-tuning using the Tinker-compatible API.

Changes:
- ppo_utils.py: Add CROSS_ENTROPY loss type and cross_entropy_loss() function
- worker.py: Add SFT code path that returns per-token logprobs and elementwise_loss
- worker_dispatch.py: Add loss_fn and loss_fn_config params to forward_backward()
- dispatch.py: Update MeshDispatch to pass through kwargs (loss_fn, loss_fn_config)
- replay_buffer.py: Make action_log_probs optional in Experience
- worker_utils.py: Use .get() for optional fields; handle non-scalar metrics

New:
- examples/sft/: Minimal SFT example demonstrating the API

This enables PR #871 (SkyRL-train backend for Tinker) to return proper
per-token values instead of placeholder data.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
tyler-griggs added a commit that referenced this pull request Jan 26, 2026
…ropy")

Enables supervised fine-tuning using the Tinker-compatible API.

Changes:
- ppo_utils.py: Add CROSS_ENTROPY loss type and cross_entropy_loss() function
- worker.py: Add SFT code path that returns per-token logprobs and elementwise_loss
- worker_dispatch.py: Add loss_fn and loss_fn_config params to forward_backward()
- dispatch.py: Update MeshDispatch to pass through kwargs (loss_fn, loss_fn_config)
- replay_buffer.py: Make action_log_probs optional in Experience
- worker_utils.py: Use .get() for optional fields; handle non-scalar metrics

New:
- examples/sft/: Minimal SFT example demonstrating the API

This enables PR #871 (SkyRL-train backend for Tinker) to return proper
per-token values instead of placeholder data.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant