fix: drain tasks before checkpoint and enforce per-domain limit (#9 #10)#22
Conversation
…imit - Wait for active_tasks to reach zero before saving the checkpoint on pause, so URLs enqueued by still-running spawn tasks are not lost - Add a per-domain HashMap<String, Arc<Semaphore>> that lazily creates a semaphore for each host and acquires a permit before dispatching, so a spider's concurrent_requests_per_domain setting is actually enforced Closes #9 Closes #10 https://claude.ai/code/session_012RmdaovmNWZVAim4XxCWwn
📝 WalkthroughWalkthrough
ChangesPer-domain concurrency limits and task-aware shutdown
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/spiders/engine.rs (1)
229-235: ⚖️ Poor tradeoffConsider adding a timeout to the drain loop to prevent indefinite waiting.
If a spawned task hangs (e.g., due to a stalled network call), this loop will spin indefinitely. Consider adding a maximum wait duration and proceeding with the checkpoint save even if some tasks haven't completed.
♻️ Example with timeout
+ let drain_deadline = Instant::now() + Duration::from_secs(30); while self.active_tasks.load(Ordering::SeqCst) > 0 { + if Instant::now() >= drain_deadline { + // Log warning about timed-out tasks, proceed with checkpoint + break; + } tokio::time::sleep(Duration::from_millis(50)).await; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/spiders/engine.rs` around lines 229 - 235, The drain loop waiting for in-flight tasks (uses self.paused and checks self.active_tasks with tokio::time::sleep(Duration::from_millis(50)).await) can block forever; modify it to enforce a maximum wait duration by recording a start Instant and breaking the loop when elapsed exceeds a configurable timeout (e.g., few seconds) so the checkpoint proceeds even if tasks hang, and log or return a warning/error indicating a forced drain due to timeout; ensure the timeout value is configurable and referenced where self.paused/self.active_tasks are handled.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/spiders/engine.rs`:
- Around line 229-235: The drain loop waiting for in-flight tasks (uses
self.paused and checks self.active_tasks with
tokio::time::sleep(Duration::from_millis(50)).await) can block forever; modify
it to enforce a maximum wait duration by recording a start Instant and breaking
the loop when elapsed exceeds a configurable timeout (e.g., few seconds) so the
checkpoint proceeds even if tasks hang, and log or return a warning/error
indicating a forced drain due to timeout; ensure the timeout value is
configurable and referenced where self.paused/self.active_tasks are handled.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 043fc56a-7373-4a9d-9a47-c98c91c0ecee
📒 Files selected for processing (1)
src/spiders/engine.rs
Summary
Two engine reliability fixes:
request_pause()triggered the loop break, the checkpoint was saved immediately whiletokio::spawned tasks were still running. Any URLs those tasks enqueued were dropped on the floor. Now waits foractive_tasksto drain (50ms poll) before serializing the checkpoint.concurrent_requests_per_domainwas a no-op: The value was recorded inCrawlStatsbut never used to throttle anything; every request contended on the single global semaphore. Added a lazily-populatedHashMap<String, Arc<Semaphore>>keyed on host. Each spawned request acquires a per-domain permit (when the cap is set) in addition to the global permit; both are released on completion.Test plan
cargo test— full suite greencargo clippy --all-targets -- -D warningscleancargo fmt --checkcleanThe per-domain limiter is exercised indirectly by the existing spider tests; a focused load test would be the next step but is non-trivial without a controllable test server.
Closes #9
Closes #10
Generated by Claude Code
Summary by CodeRabbit