Skip to content

feat(spider-scheduler): Add round-robin scheduler core implementation.#334

Open
LinZhihao-723 wants to merge 18 commits into
y-scope:mainfrom
LinZhihao-723:scheduler-core-impl
Open

feat(spider-scheduler): Add round-robin scheduler core implementation.#334
LinZhihao-723 wants to merge 18 commits into
y-scope:mainfrom
LinZhihao-723:scheduler-core-impl

Conversation

@LinZhihao-723

@LinZhihao-723 LinZhihao-723 commented Jun 5, 2026

Copy link
Copy Markdown
Member

Description

This PR depends on #332.

This PR implements a round-robin scheduling algorithm.

The scheduler operates in discrete, interval-paced ticks. Each tick consumes the results of an asynchronous inbound-queue poll (all three lanes are polled concurrently per round, throttled by the remaining buffer capacities), loads new tasks into the internal buffers, and then makes scheduling decisions until the dispatch queue reaches capacity. Key behaviors:

  • Round-robin fairness: a fixed-capacity active job pool is scheduled via a rotation of slots; each full cycle dispatches at most one task per active job, plus at most one commit task and one cleanup task.
  • Pending jobs: jobs beyond the active pool's capacity are buffered in FIFO order and promoted whenever an active job runs out of schedulable tasks.
  • Deduplication: storage may deliver duplicate task entries; every buffered task is unique, so each task reaches the dispatch queue at most once.
  • Finalizing jobs: a commit-ready or cleanup-ready entry marks its job as finalizing — the job leaves the active/pending sets, its buffered tasks are discarded, and any further regular tasks for it are ignored. The finalizing gate persists after the finalizing task dispatches and is retired by a (currently hard-coded) 6-hour expiry.
  • Session bumps: when storage reports a newer session, the scheduler clears all placement state, bumps the dispatch queue's session (draining stale assignments), and drops entries from lanes still reporting an older session.
  • Observability: tracing instrumentation across the scheduling flow (startup/shutdown, session bumps, job state transitions, finalizing-gate changes, and decision-loop summaries).

The module is laid out as round_robin/{mod.rs, implementation.rs, tests.rs}, with internals exposed to the sibling test module via pub(super) and only RoundRobinConfig/RoundRobinCore exported.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Ensure all workflows pass.
  • Add unit tests to cover round-robin behaviors:
    • Black-box tests that assert on the task assignments generated by the scheduling algorithm.
    • White-box tests that assert internal data structures match our expectations.
  • Test on AWS with the storage service to make sure the scheduling overhead is negligible in real workloads, and not observable in the task execution critical path.

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced a new scheduler component with round-robin task scheduling that distributes tasks fairly across jobs while respecting capacity constraints.
    • Implemented task dispatch queue infrastructure to manage assignment delivery and maintain session consistency.
  • Refactor

    • Enhanced task identifier representation for improved type clarity and serialization support.

@LinZhihao-723 LinZhihao-723 requested review from a team and sitaowang1998 as code owners June 5, 2026 02:00
@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

This PR introduces the Spider scheduler component into the workspace while refactoring the TaskId type representation. The TaskId moves from a marker-based generic alias to a discriminated enum (with Index, Commit, and Cleanup variants) in spider-core, displacing the previous definition from spider-storage. The new spider-scheduler crate implements a round-robin task dispatcher that polls storage placement state across three inbound lanes (ready, commit-ready, cleanup-ready), maintains per-job FIFO task queues, and dispatches assignments through a session-aware dispatch queue for execution managers to consume.

Changes

TaskId Type Refactor

Layer / File(s) Summary
TaskId Enum Definition
components/spider-core/src/types/id.rs
TaskId changes from a marker-based Id type alias to a three-variant enum (Index(TaskIndex), Commit, Cleanup) with direct Serialize/Deserialize derives; SignedTaskId alias and its test module are removed.
TaskId Import Updates
components/spider-storage/src/cache.rs, components/spider-storage/src/cache/job.rs, components/spider-storage/src/task_instance_pool.rs, components/spider-storage/tests/scheduling_infra.rs
All modules importing TaskId update their sources from the removed crate::cache export to spider_core::types::id; the cache module's TaskId enum definition is removed.
Test Fixture Updates
components/spider-tdl/src/task.rs, components/spider-tdl/src/task_context.rs, components/spider-tdl/tests/test_task_macro.rs, tests/huntsman/task-executor/src/lib.rs, tests/huntsman/task-executor/tests/test_process_pool.rs, tests/huntsman/tdl-integration/tests/complex.rs
Test helpers and fixtures constructing TaskContext task identifiers switch from the removed TaskId::new() to the deterministic TaskId::Index(0) variant.

Spider Scheduler Implementation

Layer / File(s) Summary
Workspace and Crate Setup
Cargo.toml, components/spider-scheduler/Cargo.toml
spider-scheduler is added to the workspace member list; the crate manifest declares dependencies on tokio, async-trait, serde, local spider-core, thiserror, and tracing with pinned versions and selected feature flags.
Core Abstractions and Error Types
components/spider-scheduler/src/error.rs, components/spider-scheduler/src/storage_client.rs, components/spider-scheduler/src/types.rs, components/spider-scheduler/src/core.rs
StorageClientError and SchedulerError enums define error conditions; InboundEntry and TaskAssignment structs carry task/job/resource-group identifiers; SchedulerStorageClient trait abstracts storage polling (ready, commit-ready, cleanup-ready lanes) and job-state queries; SchedulerCore trait defines the async scheduling loop contract with associated sink and storage client types.
Dispatch Queue Infrastructure
components/spider-scheduler/src/dispatch_queue.rs
DispatchQueueSink/DispatchQueueSource traits define producer/consumer contracts; DispatchQueueWriter enqueues assignments and atomically bumps session IDs (validating monotonicity, draining queued items, and updating shared session state); DispatchQueueReader dequeues assignments with session-stamping under a read lock; create_dispatch_queue factory wires bounded async channels and shared session state; extensive test coverage validates delivery correctness, session pairing across bumps, and load distribution across multiple consumers.
Round-Robin Scheduler Core
components/spider-scheduler/src/core_impl/mod.rs, components/spider-scheduler/src/core_impl/round_robin/mod.rs, components/spider-scheduler/src/core_impl/round_robin/implementation.rs
RoundRobinConfig validates non-zero queue capacities; RoundRobinCore implements SchedulerCore::run by constructing and starting the internal scheduler; RoundRobin maintains buffered task IDs, active/pending job queues (via JobTaskQueue), commit/cleanup finalization lanes, and inbound poll state; the tick loop consumes inbound results, ingests storage entries (session-aware, deduplicating, buffering by job), makes round-robin dispatch decisions (rotating job/commit/cleanup lanes), and retires expired finalizers; AsyncInboundQueueReader spawns per-lane background tasks subject to configurable limits and non-blockingly collects results.
Round-Robin Scheduler Tests
components/spider-scheduler/src/core_impl/round_robin/tests.rs
Mock storage client serves scripted batches per lane; configuration validation rejects zero-capacity settings; strict round-robin ordering test validates dispatch sequence when all jobs fit active pool; promotion and scheduling test validates round-robin dispatch once pending jobs are promoted; commit/cleanup lane test validates single-dispatch-per-cycle and within-batch deduplication; white-box finalizing scenario test validates removal of finalized jobs, discarding of buffered regular tasks, and FIFO ordering for surviving jobs; session-bump test validates clearing of old-session buffered tasks and new-session assignment stamping.
Crate Entrypoint and Public API
components/spider-scheduler/src/lib.rs
Crate-level documentation describes the scheduler pipeline (storage polling → core decision loop → dispatch queue → execution manager fan-out); public modules and re-exports define the library's API surface for downstream consumers.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • y-scope/spider#325: Updates to the TaskId type representation propagate into the test-executor harness used by Huntsman integration tests, affecting request builders and task context construction fixtures in the same test area.

Suggested reviewers

  • sitaowang1998
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: a round-robin scheduler core implementation is the primary feature added in this PR, with the new spider-scheduler component containing the core logic.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
components/spider-scheduler/src/core_impl/round_robin/implementation.rs (1)

309-309: ⚡ Quick win

Reduce per-tick log level to debug/trace.

These are hot-path logs and currently emit every tick at info, which can be noisy and expensive under small tick intervals. Consider downgrading to debug (or sampling).

Also applies to: 824-828

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/spider-scheduler/src/core_impl/round_robin/implementation.rs` at
line 309, Change the hot-path tracing::info!("Starting scheduling tick."); to a
lower-volume level (tracing::debug! or tracing::trace!) and do the same for the
other similar per-tick info logs (the other tracing::info! calls that log
scheduling tick/start messages). If you want finer control, wrap the call with a
sampling or feature-gated conditional (e.g., tracing::debug!(target:
"scheduler", ...) or use tracing::span/Level::TRACE) so frequent ticks don’t
spam logs; update the same log invocations that emit per-tick messages to use
the chosen lower level.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/spider-scheduler/src/core_impl/round_robin/implementation.rs`:
- Around line 241-242: The finalizing_job_queue currently stores timestamps as
SystemTime which makes elapsed checks fallible; change the queue element type
from (JobId, SystemTime) to (JobId, Instant) and replace uses of
SystemTime::now() / SystemTime::elapsed() with Instant::now() and
Instant::elapsed() so expiry checks are monotonic and non-fallible; update all
code paths that push into finalizing_job_queue and that check/compare expiry
(the retirement/expiry logic that inspects finalizing_job_queue and computes
elapsed) to use Instant-based semantics and adjust any test/timeout constants if
needed.
- Around line 552-555: The buffered_tasks dedup uses inbound_entry.task_id
directly, but downstream logic normalizes finalizing-lane IDs to
TaskId::Commit/TaskId::Cleanup; update the dedupe key by canonicalizing
inbound_entry.task_id for finalizing lanes before inserting/checking
buffered_tasks (i.e., map any equivalent/alias finalizing IDs to the canonical
TaskId::Commit or TaskId::Cleanup via a small match/normalize function) so
buffered entries match what dispatch/removal expects and cannot survive
indefinitely; apply the same canonicalization in both locations that currently
insert/check buffered_tasks (the spots referencing
buffered_tasks.insert((inbound_entry.job_id, inbound_entry.task_id))).

In `@components/spider-scheduler/src/dispatch_queue.rs`:
- Around line 82-84: The comment notes enqueue and bump_session_id must not run
concurrently but the code doesn't enforce it; modify DispatchQueue so both
enqueue (the send path using cloned writers) and bump_session_id (the
draining/rewind path) acquire the same lock before operating: move session_id
and the sender/receiver drain logic under a shared Mutex (or use a single RwLock
with exclusive write for bump_session_id and read for enqueue) so enqueue cannot
send while bump_session_id is draining; update functions enqueue and
bump_session_id to lock that mutex before accessing session_id or performing
send/drain to prevent dropped assignments.

---

Nitpick comments:
In `@components/spider-scheduler/src/core_impl/round_robin/implementation.rs`:
- Line 309: Change the hot-path tracing::info!("Starting scheduling tick."); to
a lower-volume level (tracing::debug! or tracing::trace!) and do the same for
the other similar per-tick info logs (the other tracing::info! calls that log
scheduling tick/start messages). If you want finer control, wrap the call with a
sampling or feature-gated conditional (e.g., tracing::debug!(target:
"scheduler", ...) or use tracing::span/Level::TRACE) so frequent ticks don’t
spam logs; update the same log invocations that emit per-tick messages to use
the chosen lower level.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 7aa4dcab-6b24-416b-8681-416cee8217ab

📥 Commits

Reviewing files that changed from the base of the PR and between 85a5130 and 4c300f3.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (23)
  • Cargo.toml
  • components/spider-core/src/types/id.rs
  • components/spider-scheduler/Cargo.toml
  • components/spider-scheduler/src/core.rs
  • components/spider-scheduler/src/core_impl/mod.rs
  • components/spider-scheduler/src/core_impl/round_robin/implementation.rs
  • components/spider-scheduler/src/core_impl/round_robin/mod.rs
  • components/spider-scheduler/src/core_impl/round_robin/tests.rs
  • components/spider-scheduler/src/dispatch_queue.rs
  • components/spider-scheduler/src/error.rs
  • components/spider-scheduler/src/lib.rs
  • components/spider-scheduler/src/storage_client.rs
  • components/spider-scheduler/src/types.rs
  • components/spider-storage/src/cache.rs
  • components/spider-storage/src/cache/job.rs
  • components/spider-storage/src/task_instance_pool.rs
  • components/spider-storage/tests/scheduling_infra.rs
  • components/spider-tdl/src/task.rs
  • components/spider-tdl/src/task_context.rs
  • components/spider-tdl/tests/test_task_macro.rs
  • tests/huntsman/task-executor/src/lib.rs
  • tests/huntsman/task-executor/tests/test_process_pool.rs
  • tests/huntsman/tdl-integration/tests/complex.rs
💤 Files with no reviewable changes (1)
  • components/spider-storage/src/cache.rs

Comment on lines +241 to +242
pub(super) finalizing_job_queue: VecDeque<(JobId, SystemTime)>,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use monotonic time for finalizing-job expiry.

SystemTime::elapsed() is fallible on clock rollback. That propagates as an error and can terminate the scheduler loop. Use Instant for expiry tracking so retirement is monotonic and non-fallible.

Proposed fix
 use std::{
     collections::{HashMap, HashSet, VecDeque},
-    time::{Duration, SystemTime},
+    time::{Duration, Instant},
 };
 ...
-    pub(super) finalizing_job_queue: VecDeque<(JobId, SystemTime)>,
+    pub(super) finalizing_job_queue: VecDeque<(JobId, Instant)>,
 ...
     fn mark_job_finalizing(&mut self, job_id: JobId) {
         if self.finalizing_jobs.insert(job_id) {
             self.finalizing_job_queue
-                .push_back((job_id, SystemTime::now()));
+                .push_back((job_id, Instant::now()));
         }
     }
 ...
-    fn retire_expired_finalizing_jobs(&mut self) -> Result<(), SchedulerError> {
+    fn retire_expired_finalizing_jobs(&mut self) -> Result<(), SchedulerError> {
         const EXPIRATION_TIME: Duration = Duration::from_hours(6);
         while let Some((job_id, insertion_time)) = self.finalizing_job_queue.front() {
-            if insertion_time.elapsed()? > EXPIRATION_TIME {
+            if insertion_time.elapsed() > EXPIRATION_TIME {
                 ...
             } else {
                 break;
             }
         }
         Ok(())
     }

Also applies to: 456-457, 471-475

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/spider-scheduler/src/core_impl/round_robin/implementation.rs`
around lines 241 - 242, The finalizing_job_queue currently stores timestamps as
SystemTime which makes elapsed checks fallible; change the queue element type
from (JobId, SystemTime) to (JobId, Instant) and replace uses of
SystemTime::now() / SystemTime::elapsed() with Instant::now() and
Instant::elapsed() so expiry checks are monotonic and non-fallible; update all
code paths that push into finalizing_job_queue and that check/compare expiry
(the retirement/expiry logic that inspects finalizing_job_queue and computes
elapsed) to use Instant-based semantics and adjust any test/timeout constants if
needed.

Comment thread components/spider-scheduler/src/dispatch_queue.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant