-
Notifications
You must be signed in to change notification settings - Fork 235
Stage 4: Add multi-engine load-balancing to sample() API #934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stage 4: Add multi-engine load-balancing to sample() API #934
Conversation
…pter Changes: 1. Extract _select_engine_idx() as shared engine selection logic 2. Update sample() to use load-balancing and pause/resume support 3. Update chat_completion() to use shared _select_engine_idx() 4. Create TinkerInferenceAdapter in skyrl-train for core Tinker logic 5. Update SkyRLInferenceClient in skyrl-tx to use the adapter The core Tinker inference logic now lives in skyrl-train's TinkerInferenceAdapter, with skyrl-tx as a thin wrapper for Tinker type conversion and database storage. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tests cover: - TinkerSampleResult initialization and to_dict() - Prompt token extraction from ModelInput - Sampling params conversion - Output conversion with stop reason mapping - Multiple samples handling - Async sample() client interaction Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a TinkerInferenceAdapter to decouple skyrl-train from Tinker-specific types, and adds load-balancing capabilities to the sample() method. The changes are well-structured and include comprehensive unit tests for the new adapter.
My review focuses on ensuring the new load-balancing feature is fully integrated, and on improving code maintainability by reducing duplication. I've identified a critical issue where the session_id for load-balancing is not passed through in skyrl-tx, and a couple of medium-severity issues related to duplicated logic that should be delegated to the new adapter.
- Pass request.sampling_session_id or request.seq_id to adapter.sample() - Ensures multi-step sampling sessions stay pinned to same engine - Fixes load-balancing for KV cache reuse scenarios Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Tests sample() with 2 vLLM engines (TP=1 each, 2 GPUs total) - Verifies requests complete successfully across multiple engines - Uses different session_ids to test engine selection logic - Simple test: 3 requests × 2 samples = 6 total samples - Passes in ~52 seconds Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Fix import ordering and code style - Improve consistency across adapter files Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Apply PR NovaSky-AI#934 review comment - more concise and idiomatic Python. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…EngineClient calls The two-layer architecture (TinkerInferenceAdapter + SkyRLInferenceClient) was over-engineering. Merged all functionality into SkyRLInferenceClient which now calls InferenceEngineClient.sample() directly. Changes: - Deleted skyrl_train/inference_engines/tinker_adapter.py - Deleted tests/cpu/test_tinker_adapter.py - Updated SkyRLInferenceClient to call inference_client.sample() directly - Kept list comprehension for token extraction - GPU tests still pass Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
| # Extract session_id for consistent engine routing | ||
| # Prefer sampling_session_id (string), fall back to seq_id (int) | ||
| session_id = request.sampling_session_id if hasattr(request, "sampling_session_id") else None | ||
| if session_id is None and hasattr(request, "seq_id"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to either leave a comment or remove this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed :)
After comparing with the official Tinker backend, we found that: 1. sampling_session_id and seq_id are only used to identify model checkpoints 2. Tinker backend does NOT use these for sticky routing/session affinity 3. Multi-turn conversations have no KV cache benefit from sticky routing (each sample() call is for an independent problem) Changes: - Removed session_id extraction logic from SkyRLInferenceClient - Removed session_id parameter from InferenceEngineClient.sample() - Simplified _select_engine_idx() to always use random selection - Removed unused hash_with_sha256 import - Updated multi-engine test to verify random load-balancing works This matches the official Tinker backend behavior. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The previous commit incorrectly removed session_id entirely, but it's needed for other callers (e.g., chat completions) where session_id represents conversation ID and sticky routing provides KV cache benefits. Changes: - Restored session_id parameter to InferenceEngineClient.sample() - Restored session_id logic in _select_engine_idx() (None = random, else hash) - Restored hash_with_sha256 import - SkyRLInferenceClient (Tinker path) doesn't pass session_id (gets None/random) - Updated docstrings to clarify when session_id should/shouldn't be used For Tinker: No session_id = random load-balancing (each call is independent) For chat: Pass conversation ID = sticky routing (KV cache benefits) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
) ## Summary Adds multi-engine load-balancing to `InferenceEngineClient.sample()`. **Routing logic:** - `session_id=None`: Random load-balancing (used by Tinker API) - `session_id` provided: Sticky routing via `hash_with_sha256(session_id) % num_engines` (for chat/conversations) **Why Tinker uses random routing:** Tinker's `sampling_session_id`/`seq_id` identify model checkpoints, not conversations. Each `sample()` call is independent with no KV cache benefit from sticky routing. This matches the official Tinker backend behavior. **Changes:** - Added `InferenceEngineClient.sample()` method with multi-engine support - `SkyRLInferenceClient` doesn't pass `session_id` → random routing - Added GPU test: `test_multi_engine_load_balancing` **Tests:** ✅ All existing tests pass + new multi-engine GPU test --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Summary
Adds multi-engine load-balancing to
InferenceEngineClient.sample().Routing logic:
session_id=None: Random load-balancing (used by Tinker API)session_idprovided: Sticky routing viahash_with_sha256(session_id) % num_engines(for chat/conversations)Why Tinker uses random routing:
Tinker's
sampling_session_id/seq_ididentify model checkpoints, not conversations. Eachsample()call is independent with no KV cache benefit from sticky routing. This matches the official Tinker backend behavior.Changes:
InferenceEngineClient.sample()method with multi-engine supportSkyRLInferenceClientdoesn't passsession_id→ random routingtest_multi_engine_load_balancingTests: ✅ All existing tests pass + new multi-engine GPU test