Skip to content

Conversation

@tyler-griggs
Copy link
Member

@tyler-griggs tyler-griggs commented Jan 24, 2026

Summary

Adds multi-engine load-balancing to InferenceEngineClient.sample().

Routing logic:

  • session_id=None: Random load-balancing (used by Tinker API)
  • session_id provided: Sticky routing via hash_with_sha256(session_id) % num_engines (for chat/conversations)

Why Tinker uses random routing:
Tinker's sampling_session_id/seq_id identify model checkpoints, not conversations. Each sample() call is independent with no KV cache benefit from sticky routing. This matches the official Tinker backend behavior.

Changes:

  • Added InferenceEngineClient.sample() method with multi-engine support
  • SkyRLInferenceClient doesn't pass session_id → random routing
  • Added GPU test: test_multi_engine_load_balancing

Tests: ✅ All existing tests pass + new multi-engine GPU test

tyler-griggs and others added 2 commits January 24, 2026 17:59
…pter

Changes:
1. Extract _select_engine_idx() as shared engine selection logic
2. Update sample() to use load-balancing and pause/resume support
3. Update chat_completion() to use shared _select_engine_idx()
4. Create TinkerInferenceAdapter in skyrl-train for core Tinker logic
5. Update SkyRLInferenceClient in skyrl-tx to use the adapter

The core Tinker inference logic now lives in skyrl-train's
TinkerInferenceAdapter, with skyrl-tx as a thin wrapper for
Tinker type conversion and database storage.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tests cover:
- TinkerSampleResult initialization and to_dict()
- Prompt token extraction from ModelInput
- Sampling params conversion
- Output conversion with stop reason mapping
- Multiple samples handling
- Async sample() client interaction

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a TinkerInferenceAdapter to decouple skyrl-train from Tinker-specific types, and adds load-balancing capabilities to the sample() method. The changes are well-structured and include comprehensive unit tests for the new adapter.

My review focuses on ensuring the new load-balancing feature is fully integrated, and on improving code maintainability by reducing duplication. I've identified a critical issue where the session_id for load-balancing is not passed through in skyrl-tx, and a couple of medium-severity issues related to duplicated logic that should be delegated to the new adapter.

tyler-griggs and others added 5 commits January 24, 2026 21:06
- Pass request.sampling_session_id or request.seq_id to adapter.sample()
- Ensures multi-step sampling sessions stay pinned to same engine
- Fixes load-balancing for KV cache reuse scenarios

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Tests sample() with 2 vLLM engines (TP=1 each, 2 GPUs total)
- Verifies requests complete successfully across multiple engines
- Uses different session_ids to test engine selection logic
- Simple test: 3 requests × 2 samples = 6 total samples
- Passes in ~52 seconds

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Fix import ordering and code style
- Improve consistency across adapter files

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Apply PR NovaSky-AI#934 review comment - more concise and idiomatic Python.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…EngineClient calls

The two-layer architecture (TinkerInferenceAdapter + SkyRLInferenceClient) was
over-engineering. Merged all functionality into SkyRLInferenceClient which now
calls InferenceEngineClient.sample() directly.

Changes:
- Deleted skyrl_train/inference_engines/tinker_adapter.py
- Deleted tests/cpu/test_tinker_adapter.py
- Updated SkyRLInferenceClient to call inference_client.sample() directly
- Kept list comprehension for token extraction
- GPU tests still pass

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@tyler-griggs tyler-griggs changed the title Add TinkerInferenceAdapter and load-balancing to sample() (Stage 4) Add load-balancing to sample() Jan 24, 2026
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
# Extract session_id for consistent engine routing
# Prefer sampling_session_id (string), fall back to seq_id (int)
session_id = request.sampling_session_id if hasattr(request, "sampling_session_id") else None
if session_id is None and hasattr(request, "seq_id"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to either leave a comment or remove this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed :)

After comparing with the official Tinker backend, we found that:
1. sampling_session_id and seq_id are only used to identify model checkpoints
2. Tinker backend does NOT use these for sticky routing/session affinity
3. Multi-turn conversations have no KV cache benefit from sticky routing
   (each sample() call is for an independent problem)

Changes:
- Removed session_id extraction logic from SkyRLInferenceClient
- Removed session_id parameter from InferenceEngineClient.sample()
- Simplified _select_engine_idx() to always use random selection
- Removed unused hash_with_sha256 import
- Updated multi-engine test to verify random load-balancing works

This matches the official Tinker backend behavior.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@tyler-griggs tyler-griggs changed the title Add load-balancing to sample() Stage 4: Add multi-engine load-balancing to sample() API Jan 24, 2026
tyler-griggs and others added 2 commits January 24, 2026 22:36
The previous commit incorrectly removed session_id entirely, but it's needed
for other callers (e.g., chat completions) where session_id represents
conversation ID and sticky routing provides KV cache benefits.

Changes:
- Restored session_id parameter to InferenceEngineClient.sample()
- Restored session_id logic in _select_engine_idx() (None = random, else hash)
- Restored hash_with_sha256 import
- SkyRLInferenceClient (Tinker path) doesn't pass session_id (gets None/random)
- Updated docstrings to clarify when session_id should/shouldn't be used

For Tinker: No session_id = random load-balancing (each call is independent)
For chat: Pass conversation ID = sticky routing (KV cache benefits)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@tyler-griggs tyler-griggs merged commit 118d070 into NovaSky-AI:main Jan 24, 2026
4 of 5 checks passed
tanmaysachan pushed a commit to tanmaysachan/SkyRL that referenced this pull request Jan 25, 2026
)

## Summary

Adds multi-engine load-balancing to `InferenceEngineClient.sample()`.

**Routing logic:**
- `session_id=None`: Random load-balancing (used by Tinker API)
- `session_id` provided: Sticky routing via
`hash_with_sha256(session_id) % num_engines` (for chat/conversations)

**Why Tinker uses random routing:**
Tinker's `sampling_session_id`/`seq_id` identify model checkpoints, not
conversations. Each `sample()` call is independent with no KV cache
benefit from sticky routing. This matches the official Tinker backend
behavior.

**Changes:**
- Added `InferenceEngineClient.sample()` method with multi-engine
support
- `SkyRLInferenceClient` doesn't pass `session_id` → random routing
- Added GPU test: `test_multi_engine_load_balancing`

**Tests:** ✅ All existing tests pass + new multi-engine GPU test

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants