feat(spider-scheduler): Add round-robin scheduler core implementation. by LinZhihao-723 · Pull Request #334 · y-scope/spider

LinZhihao-723 · 2026-06-05T02:00:52Z

Description

This PR depends on #332.

This PR implements a round-robin scheduling algorithm.

The scheduler operates in discrete, interval-paced ticks. Each tick consumes the results of an asynchronous inbound-queue poll (all three lanes are polled concurrently per round, throttled by the remaining buffer capacities), loads new tasks into the internal buffers, and then makes scheduling decisions until the dispatch queue reaches capacity. Key behaviors:

Round-robin fairness: a fixed-capacity active job pool is scheduled via a rotation of slots; each full cycle dispatches at most one task per active job, plus at most one commit task and one cleanup task.
Pending jobs: jobs beyond the active pool's capacity are buffered in FIFO order and promoted whenever an active job runs out of schedulable tasks.
Deduplication: storage may deliver duplicate task entries; every buffered task is unique, so each task reaches the dispatch queue at most once.
Finalizing jobs: a commit-ready or cleanup-ready entry marks its job as finalizing — the job leaves the active/pending sets, its buffered tasks are discarded, and any further regular tasks for it are ignored. The finalizing gate persists after the finalizing task dispatches and is retired by a (currently hard-coded) 6-hour expiry.
Session bumps: when storage reports a newer session, the scheduler clears all placement state, bumps the dispatch queue's session (draining stale assignments), and drops entries from lanes still reporting an older session.
Observability: tracing instrumentation across the scheduling flow (startup/shutdown, session bumps, job state transitions, finalizing-gate changes, and decision-loop summaries).

The module is laid out as round_robin/{mod.rs, implementation.rs, tests.rs}, with internals exposed to the sibling test module via pub(super) and only RoundRobinConfig/RoundRobinCore exported.

Checklist

The PR satisfies the contribution guidelines.
This is a breaking change and that has been indicated in the PR title, OR this isn't a
breaking change.
Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Ensure all workflows pass.
Add unit tests to cover round-robin behaviors:
- Black-box tests that assert on the task assignments generated by the scheduling algorithm.
- White-box tests that assert internal data structures match our expectations.
Test on AWS with the storage service to make sure the scheduling overhead is negligible in real workloads, and not observable in the task execution critical path.

Summary by CodeRabbit

Release Notes

New Features
- Introduced a new scheduler component with round-robin task scheduling that distributes tasks fairly across jobs while respecting capacity constraints.
- Implemented task dispatch queue infrastructure to manage assignment delivery and maintain session consistency.
Refactor
- Enhanced task identifier representation for improved type clarity and serialization support.

…/spider into scheduler-skeleton

coderabbitai · 2026-06-05T02:01:02Z

Walkthrough

This PR introduces the Spider scheduler component into the workspace while refactoring the TaskId type representation. The TaskId moves from a marker-based generic alias to a discriminated enum (with Index, Commit, and Cleanup variants) in spider-core, displacing the previous definition from spider-storage. The new spider-scheduler crate implements a round-robin task dispatcher that polls storage placement state across three inbound lanes (ready, commit-ready, cleanup-ready), maintains per-job FIFO task queues, and dispatches assignments through a session-aware dispatch queue for execution managers to consume.

Changes

TaskId Type Refactor

Layer / File(s)	Summary
TaskId Enum Definition `components/spider-core/src/types/id.rs`	`TaskId` changes from a marker-based `Id` type alias to a three-variant enum (`Index(TaskIndex)`, `Commit`, `Cleanup`) with direct `Serialize`/`Deserialize` derives; `SignedTaskId` alias and its test module are removed.
TaskId Import Updates `components/spider-storage/src/cache.rs`, `components/spider-storage/src/cache/job.rs`, `components/spider-storage/src/task_instance_pool.rs`, `components/spider-storage/tests/scheduling_infra.rs`	All modules importing `TaskId` update their sources from the removed `crate::cache` export to `spider_core::types::id`; the cache module's `TaskId` enum definition is removed.
Test Fixture Updates `components/spider-tdl/src/task.rs`, `components/spider-tdl/src/task_context.rs`, `components/spider-tdl/tests/test_task_macro.rs`, `tests/huntsman/task-executor/src/lib.rs`, `tests/huntsman/task-executor/tests/test_process_pool.rs`, `tests/huntsman/tdl-integration/tests/complex.rs`	Test helpers and fixtures constructing `TaskContext` task identifiers switch from the removed `TaskId::new()` to the deterministic `TaskId::Index(0)` variant.

Spider Scheduler Implementation

Layer / File(s)	Summary
Workspace and Crate Setup `Cargo.toml`, `components/spider-scheduler/Cargo.toml`	`spider-scheduler` is added to the workspace member list; the crate manifest declares dependencies on tokio, async-trait, serde, local spider-core, thiserror, and tracing with pinned versions and selected feature flags.
Core Abstractions and Error Types `components/spider-scheduler/src/error.rs`, `components/spider-scheduler/src/storage_client.rs`, `components/spider-scheduler/src/types.rs`, `components/spider-scheduler/src/core.rs`	`StorageClientError` and `SchedulerError` enums define error conditions; `InboundEntry` and `TaskAssignment` structs carry task/job/resource-group identifiers; `SchedulerStorageClient` trait abstracts storage polling (ready, commit-ready, cleanup-ready lanes) and job-state queries; `SchedulerCore` trait defines the async scheduling loop contract with associated sink and storage client types.
Dispatch Queue Infrastructure `components/spider-scheduler/src/dispatch_queue.rs`	`DispatchQueueSink`/`DispatchQueueSource` traits define producer/consumer contracts; `DispatchQueueWriter` enqueues assignments and atomically bumps session IDs (validating monotonicity, draining queued items, and updating shared session state); `DispatchQueueReader` dequeues assignments with session-stamping under a read lock; `create_dispatch_queue` factory wires bounded async channels and shared session state; extensive test coverage validates delivery correctness, session pairing across bumps, and load distribution across multiple consumers.
Round-Robin Scheduler Core `components/spider-scheduler/src/core_impl/mod.rs`, `components/spider-scheduler/src/core_impl/round_robin/mod.rs`, `components/spider-scheduler/src/core_impl/round_robin/implementation.rs`	`RoundRobinConfig` validates non-zero queue capacities; `RoundRobinCore` implements `SchedulerCore::run` by constructing and starting the internal scheduler; `RoundRobin` maintains buffered task IDs, active/pending job queues (via `JobTaskQueue`), commit/cleanup finalization lanes, and inbound poll state; the `tick` loop consumes inbound results, ingests storage entries (session-aware, deduplicating, buffering by job), makes round-robin dispatch decisions (rotating job/commit/cleanup lanes), and retires expired finalizers; `AsyncInboundQueueReader` spawns per-lane background tasks subject to configurable limits and non-blockingly collects results.
Round-Robin Scheduler Tests `components/spider-scheduler/src/core_impl/round_robin/tests.rs`	Mock storage client serves scripted batches per lane; configuration validation rejects zero-capacity settings; strict round-robin ordering test validates dispatch sequence when all jobs fit active pool; promotion and scheduling test validates round-robin dispatch once pending jobs are promoted; commit/cleanup lane test validates single-dispatch-per-cycle and within-batch deduplication; white-box finalizing scenario test validates removal of finalized jobs, discarding of buffered regular tasks, and FIFO ordering for surviving jobs; session-bump test validates clearing of old-session buffered tasks and new-session assignment stamping.
Crate Entrypoint and Public API `components/spider-scheduler/src/lib.rs`	Crate-level documentation describes the scheduler pipeline (storage polling → core decision loop → dispatch queue → execution manager fan-out); public modules and re-exports define the library's API surface for downstream consumers.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

y-scope/spider#325: Updates to the TaskId type representation propagate into the test-executor harness used by Huntsman integration tests, affecting request builders and task context construction fixtures in the same test area.

Suggested reviewers

sitaowang1998

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: a round-robin scheduler core implementation is the primary feature added in this PR, with the new spider-scheduler component containing the core logic.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

components/spider-scheduler/src/core_impl/round_robin/implementation.rs (1)
309-309: ⚡ Quick win

Reduce per-tick log level to debug/trace.

These are hot-path logs and currently emit every tick at info, which can be noisy and expensive under small tick intervals. Consider downgrading to debug (or sampling).

Also applies to: 824-828
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/spider-scheduler/src/core_impl/round_robin/implementation.rs` at
line 309, Change the hot-path tracing::info!("Starting scheduling tick."); to a
lower-volume level (tracing::debug! or tracing::trace!) and do the same for the
other similar per-tick info logs (the other tracing::info! calls that log
scheduling tick/start messages). If you want finer control, wrap the call with a
sampling or feature-gated conditional (e.g., tracing::debug!(target:
"scheduler", ...) or use tracing::span/Level::TRACE) so frequent ticks don’t
spam logs; update the same log invocations that emit per-tick messages to use
the chosen lower level.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/spider-scheduler/src/core_impl/round_robin/implementation.rs`:
- Around line 241-242: The finalizing_job_queue currently stores timestamps as
SystemTime which makes elapsed checks fallible; change the queue element type
from (JobId, SystemTime) to (JobId, Instant) and replace uses of
SystemTime::now() / SystemTime::elapsed() with Instant::now() and
Instant::elapsed() so expiry checks are monotonic and non-fallible; update all
code paths that push into finalizing_job_queue and that check/compare expiry
(the retirement/expiry logic that inspects finalizing_job_queue and computes
elapsed) to use Instant-based semantics and adjust any test/timeout constants if
needed.
- Around line 552-555: The buffered_tasks dedup uses inbound_entry.task_id
directly, but downstream logic normalizes finalizing-lane IDs to
TaskId::Commit/TaskId::Cleanup; update the dedupe key by canonicalizing
inbound_entry.task_id for finalizing lanes before inserting/checking
buffered_tasks (i.e., map any equivalent/alias finalizing IDs to the canonical
TaskId::Commit or TaskId::Cleanup via a small match/normalize function) so
buffered entries match what dispatch/removal expects and cannot survive
indefinitely; apply the same canonicalization in both locations that currently
insert/check buffered_tasks (the spots referencing
buffered_tasks.insert((inbound_entry.job_id, inbound_entry.task_id))).

In `@components/spider-scheduler/src/dispatch_queue.rs`:
- Around line 82-84: The comment notes enqueue and bump_session_id must not run
concurrently but the code doesn't enforce it; modify DispatchQueue so both
enqueue (the send path using cloned writers) and bump_session_id (the
draining/rewind path) acquire the same lock before operating: move session_id
and the sender/receiver drain logic under a shared Mutex (or use a single RwLock
with exclusive write for bump_session_id and read for enqueue) so enqueue cannot
send while bump_session_id is draining; update functions enqueue and
bump_session_id to lock that mutex before accessing session_id or performing
send/drain to prevent dropped assignments.

---

Nitpick comments:
In `@components/spider-scheduler/src/core_impl/round_robin/implementation.rs`:
- Line 309: Change the hot-path tracing::info!("Starting scheduling tick."); to
a lower-volume level (tracing::debug! or tracing::trace!) and do the same for
the other similar per-tick info logs (the other tracing::info! calls that log
scheduling tick/start messages). If you want finer control, wrap the call with a
sampling or feature-gated conditional (e.g., tracing::debug!(target:
"scheduler", ...) or use tracing::span/Level::TRACE) so frequent ticks don’t
spam logs; update the same log invocations that emit per-tick messages to use
the chosen lower level.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 7aa4dcab-6b24-416b-8681-416cee8217ab

📥 Commits

Reviewing files that changed from the base of the PR and between 85a5130 and 4c300f3.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (23)

Cargo.toml
components/spider-core/src/types/id.rs
components/spider-scheduler/Cargo.toml
components/spider-scheduler/src/core.rs
components/spider-scheduler/src/core_impl/mod.rs
components/spider-scheduler/src/core_impl/round_robin/implementation.rs
components/spider-scheduler/src/core_impl/round_robin/mod.rs
components/spider-scheduler/src/core_impl/round_robin/tests.rs
components/spider-scheduler/src/dispatch_queue.rs
components/spider-scheduler/src/error.rs
components/spider-scheduler/src/lib.rs
components/spider-scheduler/src/storage_client.rs
components/spider-scheduler/src/types.rs
components/spider-storage/src/cache.rs
components/spider-storage/src/cache/job.rs
components/spider-storage/src/task_instance_pool.rs
components/spider-storage/tests/scheduling_infra.rs
components/spider-tdl/src/task.rs
components/spider-tdl/src/task_context.rs
components/spider-tdl/tests/test_task_macro.rs
tests/huntsman/task-executor/src/lib.rs
tests/huntsman/task-executor/tests/test_process_pool.rs
tests/huntsman/tdl-integration/tests/complex.rs

💤 Files with no reviewable changes (1)

components/spider-storage/src/cache.rs

coderabbitai · 2026-06-05T02:11:33Z

+    pub(super) finalizing_job_queue: VecDeque<(JobId, SystemTime)>,
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use monotonic time for finalizing-job expiry.

SystemTime::elapsed() is fallible on clock rollback. That propagates as an error and can terminate the scheduler loop. Use Instant for expiry tracking so retirement is monotonic and non-fallible.

Proposed fix

use std::{ collections::{HashMap, HashSet, VecDeque}, - time::{Duration, SystemTime}, + time::{Duration, Instant}, }; ... - pub(super) finalizing_job_queue: VecDeque<(JobId, SystemTime)>, + pub(super) finalizing_job_queue: VecDeque<(JobId, Instant)>, ... fn mark_job_finalizing(&mut self, job_id: JobId) { if self.finalizing_jobs.insert(job_id) { self.finalizing_job_queue - .push_back((job_id, SystemTime::now())); + .push_back((job_id, Instant::now())); } } ... - fn retire_expired_finalizing_jobs(&mut self) -> Result<(), SchedulerError> { + fn retire_expired_finalizing_jobs(&mut self) -> Result<(), SchedulerError> { const EXPIRATION_TIME: Duration = Duration::from_hours(6); while let Some((job_id, insertion_time)) = self.finalizing_job_queue.front() { - if insertion_time.elapsed()? > EXPIRATION_TIME { + if insertion_time.elapsed() > EXPIRATION_TIME { ... } else { break; } } Ok(()) }

Also applies to: 456-457, 471-475

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/spider-scheduler/src/core_impl/round_robin/implementation.rs` around lines 241 - 242, The finalizing_job_queue currently stores timestamps as SystemTime which makes elapsed checks fallible; change the queue element type from (JobId, SystemTime) to (JobId, Instant) and replace uses of SystemTime::now() / SystemTime::elapsed() with Instant::now() and Instant::elapsed() so expiry checks are monotonic and non-fallible; update all code paths that push into finalizing_job_queue and that check/compare expiry (the retirement/expiry logic that inspects finalizing_job_queue and computes elapsed) to use Instant-based semantics and adjust any test/timeout constants if needed.

LinZhihao-723 and others added 17 commits May 28, 2026 22:35

WIP

ac0dd77

Done.

dced92e

Merge branch 'main' into scheduler-skeleton

7e87422

Update dispatch queue's trait.

9c436bd

Merge branch 'scheduler-skeleton' of https://github.com/LinZhihao-723…

2013cbf

…/spider into scheduler-skeleton

Fix done.

5650d5a

Merge branch 'task-id-fix' into dispatch-queue-impl

0c8575e

Add channel-based dispatch queue implementation.

10cb6ad

Done.

dda7770

Polish.

25045e2

Core implementation done.

8b722d7

Add black-box unit testing.

1068066

Add cleanup and commit tasks.

bc61958

Add the last.

8c8bdf3

Add commit and cleanup testing.

9ef086c

Add test for session bump.

7af3840

Done with implementation and testing.

4c300f3

LinZhihao-723 requested review from a team and sitaowang1998 as code owners June 5, 2026 02:00

Merge oss-main

e678b75

coderabbitai Bot reviewed Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(spider-scheduler): Add round-robin scheduler core implementation.#334

feat(spider-scheduler): Add round-robin scheduler core implementation.#334
LinZhihao-723 wants to merge 18 commits into
y-scope:mainfrom
LinZhihao-723:scheduler-core-impl

LinZhihao-723 commented Jun 5, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 5, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		pub(super) finalizing_job_queue: VecDeque<(JobId, SystemTime)>,

Conversation

LinZhihao-723 commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Validation performed

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LinZhihao-723 commented Jun 5, 2026 •

edited

Loading

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading