Stage 4: Add multi-engine load-balancing to sample() API #934

tyler-griggs · 2026-01-24T18:08:30Z

Summary

Adds multi-engine load-balancing to InferenceEngineClient.sample().

Routing logic:

session_id=None: Random load-balancing (used by Tinker API)
session_id provided: Sticky routing via hash_with_sha256(session_id) % num_engines (for chat/conversations)

Why Tinker uses random routing:
Tinker's sampling_session_id/seq_id identify model checkpoints, not conversations. Each sample() call is independent with no KV cache benefit from sticky routing. This matches the official Tinker backend behavior.

Changes:

Added InferenceEngineClient.sample() method with multi-engine support
SkyRLInferenceClient doesn't pass session_id → random routing
Added GPU test: test_multi_engine_load_balancing

Tests: ✅ All existing tests pass + new multi-engine GPU test

…pter Changes: 1. Extract _select_engine_idx() as shared engine selection logic 2. Update sample() to use load-balancing and pause/resume support 3. Update chat_completion() to use shared _select_engine_idx() 4. Create TinkerInferenceAdapter in skyrl-train for core Tinker logic 5. Update SkyRLInferenceClient in skyrl-tx to use the adapter The core Tinker inference logic now lives in skyrl-train's TinkerInferenceAdapter, with skyrl-tx as a thin wrapper for Tinker type conversion and database storage. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Tests cover: - TinkerSampleResult initialization and to_dict() - Prompt token extraction from ModelInput - Sampling params conversion - Output conversion with stop reason mapping - Multiple samples handling - Async sample() client interaction Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces a TinkerInferenceAdapter to decouple skyrl-train from Tinker-specific types, and adds load-balancing capabilities to the sample() method. The changes are well-structured and include comprehensive unit tests for the new adapter.

My review focuses on ensuring the new load-balancing feature is fully integrated, and on improving code maintainability by reducing duplication. I've identified a critical issue where the session_id for load-balancing is not passed through in skyrl-tx, and a couple of medium-severity issues related to duplicated logic that should be delegated to the new adapter.

skyrl-tx/tx/tinker/extra/skyrl_inference.py

skyrl-train/skyrl_train/inference_engines/tinker_adapter.py

skyrl-tx/tx/tinker/extra/skyrl_inference.py

- Pass request.sampling_session_id or request.seq_id to adapter.sample() - Ensures multi-step sampling sessions stay pinned to same engine - Fixes load-balancing for KV cache reuse scenarios Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Tests sample() with 2 vLLM engines (TP=1 each, 2 GPUs total) - Verifies requests complete successfully across multiple engines - Uses different session_ids to test engine selection logic - Simple test: 3 requests × 2 samples = 6 total samples - Passes in ~52 seconds Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Fix import ordering and code style - Improve consistency across adapter files Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Apply PR NovaSky-AI#934 review comment - more concise and idiomatic Python. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…EngineClient calls The two-layer architecture (TinkerInferenceAdapter + SkyRLInferenceClient) was over-engineering. Merged all functionality into SkyRLInferenceClient which now calls InferenceEngineClient.sample() directly. Changes: - Deleted skyrl_train/inference_engines/tinker_adapter.py - Deleted tests/cpu/test_tinker_adapter.py - Updated SkyRLInferenceClient to call inference_client.sample() directly - Kept list comprehension for token extraction - GPU tests still pass Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

pcmoritz · 2026-01-24T22:16:28Z

skyrl-tx/tx/tinker/extra/skyrl_inference.py

+        # Extract session_id for consistent engine routing
+        # Prefer sampling_session_id (string), fall back to seq_id (int)
+        session_id = request.sampling_session_id if hasattr(request, "sampling_session_id") else None
+        if session_id is None and hasattr(request, "seq_id"):


Might be good to either leave a comment or remove this

After comparing with the official Tinker backend, we found that: 1. sampling_session_id and seq_id are only used to identify model checkpoints 2. Tinker backend does NOT use these for sticky routing/session affinity 3. Multi-turn conversations have no KV cache benefit from sticky routing (each sample() call is for an independent problem) Changes: - Removed session_id extraction logic from SkyRLInferenceClient - Removed session_id parameter from InferenceEngineClient.sample() - Simplified _select_engine_idx() to always use random selection - Removed unused hash_with_sha256 import - Updated multi-engine test to verify random load-balancing works This matches the official Tinker backend behavior. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

The previous commit incorrectly removed session_id entirely, but it's needed for other callers (e.g., chat completions) where session_id represents conversation ID and sticky routing provides KV cache benefits. Changes: - Restored session_id parameter to InferenceEngineClient.sample() - Restored session_id logic in _select_engine_idx() (None = random, else hash) - Restored hash_with_sha256 import - SkyRLInferenceClient (Tinker path) doesn't pass session_id (gets None/random) - Updated docstrings to clarify when session_id should/shouldn't be used For Tinker: No session_id = random load-balancing (each call is independent) For chat: Pass conversation ID = sticky routing (KV cache benefits) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

) ## Summary Adds multi-engine load-balancing to `InferenceEngineClient.sample()`. **Routing logic:** - `session_id=None`: Random load-balancing (used by Tinker API) - `session_id` provided: Sticky routing via `hash_with_sha256(session_id) % num_engines` (for chat/conversations) **Why Tinker uses random routing:** Tinker's `sampling_session_id`/`seq_id` identify model checkpoints, not conversations. Each `sample()` call is independent with no KV cache benefit from sticky routing. This matches the official Tinker backend behavior. **Changes:** - Added `InferenceEngineClient.sample()` method with multi-engine support - `SkyRLInferenceClient` doesn't pass `session_id` → random routing - Added GPU test: `test_multi_engine_load_balancing` **Tests:** ✅ All existing tests pass + new multi-engine GPU test --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

tyler-griggs and others added 2 commits January 24, 2026 17:59

gemini-code-assist bot reviewed Jan 24, 2026

View reviewed changes

skyrl-tx/tx/tinker/extra/skyrl_inference.py Outdated Show resolved Hide resolved

skyrl-train/skyrl_train/inference_engines/tinker_adapter.py Outdated Show resolved Hide resolved

skyrl-tx/tx/tinker/extra/skyrl_inference.py Show resolved Hide resolved

tyler-griggs and others added 5 commits January 24, 2026 21:06

Clean up formatting in Stage 4 files

8a19d38

- Fix import ordering and code style - Improve consistency across adapter files Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Refactor: Use list comprehension for token extraction

b663697

Apply PR NovaSky-AI#934 review comment - more concise and idiomatic Python. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

tyler-griggs changed the title ~~Add TinkerInferenceAdapter and load-balancing to sample() (Stage 4)~~ Add load-balancing to sample() Jan 24, 2026

Minor formatting: Remove stale docstring and add clarifying comment

54539e5

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

pcmoritz reviewed Jan 24, 2026

View reviewed changes

pcmoritz approved these changes Jan 24, 2026

View reviewed changes

tyler-griggs changed the title ~~Add load-balancing to sample()~~ Stage 4: Add multi-engine load-balancing to sample() API Jan 24, 2026

tyler-griggs and others added 2 commits January 24, 2026 22:36

Increase multi-engine test to 5 requests for better coverage

740901b

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

tyler-griggs merged commit 118d070 into NovaSky-AI:main Jan 24, 2026
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stage 4: Add multi-engine load-balancing to sample() API #934

Stage 4: Add multi-engine load-balancing to sample() API #934

Uh oh!

tyler-griggs commented Jan 24, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pcmoritz Jan 24, 2026

Uh oh!

tyler-griggs Jan 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Stage 4: Add multi-engine load-balancing to sample() API #934

Stage 4: Add multi-engine load-balancing to sample() API #934

Uh oh!

Conversation

tyler-griggs commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pcmoritz Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

tyler-griggs Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tyler-griggs commented Jan 24, 2026 •

edited

Loading