[skyrl-train] Add SFT support via forward_backward(loss_fn="cross_entropy") by tyler-griggs · Pull Request #961 · NovaSky-AI/SkyRL

tyler-griggs · 2026-01-26T02:11:50Z

Summary

Enables SFT using the Tinker-compatible API:

metrics = dispatch.forward_backward("policy", batch, loss_fn="cross_entropy")

Key Changes

Add loss_fn parameter to forward_backward() (overrides config's policy_loss_type)
Implement cross_entropy loss in PolicyLossRegistry

Return per-sequence loss_fn_outputs for Tinker API compatibility:

metrics["loss_fn_outputs"] = [
    {"logprobs": [...], "elementwise_loss": [...]},  # sequence 1
    {"logprobs": [...], "elementwise_loss": [...]},  # sequence 2
    ...
]

Make action_log_probs optional in Experience (SFT batches don't have rollout log probs)
Update MeshDispatch to pass through kwargs to worker methods

vercel · 2026-01-26T02:11:54Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Review	Updated (UTC)
skyrl-docs	Ready	Preview, Comment	Jan 27, 2026 0:12am

gemini-code-assist

Code Review

This pull request introduces supervised fine-tuning (SFT) support, which is a great addition to the training framework. The changes are well-structured, and the new SFT example is clear and helpful. The implementation correctly adds a cross_entropy loss function and adapts the forward_backward path to support it, including returning per-token outputs for Tinker API compatibility. I've identified one high-severity issue regarding an in-place modification of the shared configuration object, which could lead to unexpected side effects, and a related medium-severity style issue. Overall, this is a solid contribution that significantly enhances the framework's capabilities.

…ropy") Enables supervised fine-tuning using the Tinker-compatible API. Changes: - ppo_utils.py: Add CROSS_ENTROPY loss type and cross_entropy_loss() function - worker.py: Add SFT code path that returns per-token logprobs and elementwise_loss - worker_dispatch.py: Add loss_fn and loss_fn_config params to forward_backward() - dispatch.py: Update MeshDispatch to pass through kwargs (loss_fn, loss_fn_config) - replay_buffer.py: Make action_log_probs optional in Experience - worker_utils.py: Use .get() for optional fields; handle non-scalar metrics New: - examples/sft/: Minimal SFT example demonstrating the API This enables PR #871 (SkyRL-train backend for Tinker) to return proper per-token values instead of placeholder data. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- validate_dispatch_args now accepts data as positional or keyword arg - worker_dispatch only passes loss_fn/loss_fn_config when non-None (critic worker doesn't accept these params) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…sion error loss_fn_outputs is a list of dicts (per-sequence data for Tinker API), not a tensor/scalar. Extract before all_reduce and add back after. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

With DP>1, each rank returns loss_fn_outputs for its data chunk. Previously only statuses[0] was returned, dropping other ranks' outputs. Now concatenate all loss_fn_outputs in rank order. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Tinker expects variable-length arrays that align with input weights, not padded to batch max. Use loss_mask to determine valid length per sample and slice accordingly. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Verifies: - loss_fn="cross_entropy" returns loss_fn_outputs - Each DP rank returns outputs for its data chunk - Output structure has logprobs and elementwise_loss keys - Arrays are trimmed to valid length Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…ptim_step (#901) ## Summary - Add `forward_backward()` and `optim_step()` methods to `MegatronPolicyWorkerBase` to match FSDP worker interface - Update trainer to use unified interface for both Megatron and FSDP strategies (removes strategy branching) - Mark `ppo_train()` as deprecated (kept for backward compatibility) - Update `test_megatron_worker.py` to use the new interface - Add `get_lr` and `set_lr` to the megatron worker to be in line with behavior from #978 - Add SFT behavior form #961, allowing the megatron backend to be used with the TX SkyRL-Train integration This brings Megatron up to parity with FSDP following the refactoring in PR #859. ## Test plan - [x] Run `test_megatron_worker.py` to verify forward_backward + optim_step works correctly - [x] Verify metrics match between Megatron and FSDP implementations Co-Authored-By: Eric Tang <erictang000@gmail.com> --------- Co-authored-by: Eric Tang <erictang000@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

vercel Bot deployed to Preview January 26, 2026 02:12 View deployment

gemini-code-assist Bot reviewed Jan 26, 2026

View reviewed changes

Comment thread skyrl-train/skyrl_train/workers/worker.py

tyler-griggs force-pushed the tyler/sft-support branch 2 times, most recently from b259251 to 25f8632 Compare January 26, 2026 02:15

vercel Bot deployed to Preview January 26, 2026 02:16 View deployment

tyler-griggs force-pushed the tyler/sft-support branch from 25f8632 to ef12c2e Compare January 26, 2026 02:18

vercel Bot deployed to Preview January 26, 2026 02:18 View deployment

tyler-griggs force-pushed the tyler/sft-support branch from ef12c2e to 4cec033 Compare January 26, 2026 02:35

vercel Bot deployed to Preview January 26, 2026 02:35 View deployment

vercel Bot deployed to Preview January 26, 2026 03:58 View deployment

Fix: extract loss_fn_outputs before all_reduce to avoid tensor conver…

c4a4f7d

…sion error loss_fn_outputs is a list of dicts (per-sequence data for Tinker API), not a tensor/scalar. Extract before all_reduce and add back after. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

vercel Bot deployed to Preview January 26, 2026 04:02 View deployment

vercel Bot deployed to Preview January 26, 2026 04:42 View deployment

Fix: trim loss_fn_outputs to actual response length per sample

939641e

Tinker expects variable-length arrays that align with input weights, not padded to batch max. Use loss_mask to determine valid length per sample and slice accordingly. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

vercel Bot deployed to Preview January 26, 2026 05:30 View deployment

style: fix formatting (ruff, black)

16292a7

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

vercel Bot deployed to Preview January 26, 2026 23:49 View deployment

vercel Bot deployed to Preview January 27, 2026 00:09 View deployment

Restore comments in worker_utils.py

2e07f24

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

vercel Bot deployed to Preview January 27, 2026 00:12 View deployment

tyler-griggs merged commit 3683ceb into main Jan 27, 2026
4 of 5 checks passed

erictang000 mentioned this pull request Jan 31, 2026

Unify Megatron and FSDP training interfaces with forward_backward + optim_step #901

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[skyrl-train] Add SFT support via forward_backward(loss_fn="cross_entropy")#961

[skyrl-train] Add SFT support via forward_backward(loss_fn="cross_entropy")#961
tyler-griggs merged 8 commits intomainfrom
tyler/sft-support

tyler-griggs commented Jan 26, 2026 •

edited

Loading

Uh oh!

vercel Bot commented Jan 26, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tyler-griggs commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Uh oh!

vercel Bot commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tyler-griggs commented Jan 26, 2026 •

edited

Loading

vercel Bot commented Jan 26, 2026 •

edited

Loading