Skip to content

fix: drain tasks before checkpoint and enforce per-domain limit (#9 #10)#22

Merged
Liohtml merged 1 commit into
masterfrom
claude/engine-reliability
May 28, 2026
Merged

fix: drain tasks before checkpoint and enforce per-domain limit (#9 #10)#22
Liohtml merged 1 commit into
masterfrom
claude/engine-reliability

Conversation

@Liohtml
Copy link
Copy Markdown
Owner

@Liohtml Liohtml commented May 28, 2026

Summary

Two engine reliability fixes:

Test plan

  • cargo test — full suite green
  • cargo clippy --all-targets -- -D warnings clean
  • cargo fmt --check clean

The per-domain limiter is exercised indirectly by the existing spider tests; a focused load test would be the next step but is non-trivial without a controllable test server.

Closes #9
Closes #10


Generated by Claude Code

Summary by CodeRabbit

  • New Features
    • Added per-domain concurrent request limit configuration
    • Improved shutdown and pause handling to wait for all in-flight tasks before saving progress, preventing loss of enqueued URLs

…imit

- Wait for active_tasks to reach zero before saving the checkpoint on
  pause, so URLs enqueued by still-running spawn tasks are not lost
- Add a per-domain HashMap<String, Arc<Semaphore>> that lazily creates a
  semaphore for each host and acquires a permit before dispatching, so a
  spider's concurrent_requests_per_domain setting is actually enforced

Closes #9
Closes #10

https://claude.ai/code/session_012RmdaovmNWZVAim4XxCWwn
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

📝 Walkthrough

Walkthrough

CrawlerEngine now enforces optional per-domain request concurrency caps alongside the existing global limiter using lazily-created, per-host semaphores. During pause or shutdown, the engine waits for all in-flight tasks to complete before saving checkpoints, ensuring no URLs are lost.

Changes

Per-domain concurrency limits and task-aware shutdown

Layer / File(s) Summary
Per-domain semaphore data structure and initialization
src/spiders/engine.rs
Imports Duration and Instant from std::time. Adds domain_limiters field to CrawlerEngine as an Arc<Mutex<HashMap>> storing per-host semaphores. Field initialized as empty HashMap in new().
Per-domain permit acquisition and release in crawl loop
src/spiders/engine.rs
Crawl loop acquires a per-domain semaphore permit when concurrent_requests_per_domain() is configured, keying by req.domain(). Creates a new Semaphore per host on first encounter, reuses existing ones thereafter. Permit is dropped after task spawn.
Pause handling with active task completion wait
src/spiders/engine.rs
When pause is triggered, engine polls active_tasks until it reaches zero before saving a checkpoint, ensuring any URLs from in-flight requests are included in persistence.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

  • #10: Implements per-domain semaphores in src/spiders/engine.rs by adding domain_limiters field and domain-scoped permit acquisition, directly addressing the missing per-domain concurrency enforcement.

Poem

A crawler learns restraint at last,
One domain at a time held fast,
When paused, it waits for tasks to cease, 🐰
Before the checkpoint finds its peace,
No URLs drop when shutdown flows,
The engine bows, the limiter knows.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the two main fixes: draining tasks before checkpoint (fixing issue #9) and enforcing per-domain concurrency limits (fixing issue #10), which align with the core changes in the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/engine-reliability

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/spiders/engine.rs (1)

229-235: ⚖️ Poor tradeoff

Consider adding a timeout to the drain loop to prevent indefinite waiting.

If a spawned task hangs (e.g., due to a stalled network call), this loop will spin indefinitely. Consider adding a maximum wait duration and proceeding with the checkpoint save even if some tasks haven't completed.

♻️ Example with timeout
+        let drain_deadline = Instant::now() + Duration::from_secs(30);
         while self.active_tasks.load(Ordering::SeqCst) > 0 {
+            if Instant::now() >= drain_deadline {
+                // Log warning about timed-out tasks, proceed with checkpoint
+                break;
+            }
             tokio::time::sleep(Duration::from_millis(50)).await;
         }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/spiders/engine.rs` around lines 229 - 235, The drain loop waiting for
in-flight tasks (uses self.paused and checks self.active_tasks with
tokio::time::sleep(Duration::from_millis(50)).await) can block forever; modify
it to enforce a maximum wait duration by recording a start Instant and breaking
the loop when elapsed exceeds a configurable timeout (e.g., few seconds) so the
checkpoint proceeds even if tasks hang, and log or return a warning/error
indicating a forced drain due to timeout; ensure the timeout value is
configurable and referenced where self.paused/self.active_tasks are handled.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/spiders/engine.rs`:
- Around line 229-235: The drain loop waiting for in-flight tasks (uses
self.paused and checks self.active_tasks with
tokio::time::sleep(Duration::from_millis(50)).await) can block forever; modify
it to enforce a maximum wait duration by recording a start Instant and breaking
the loop when elapsed exceeds a configurable timeout (e.g., few seconds) so the
checkpoint proceeds even if tasks hang, and log or return a warning/error
indicating a forced drain due to timeout; ensure the timeout value is
configurable and referenced where self.paused/self.active_tasks are handled.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 043fc56a-7373-4a9d-9a47-c98c91c0ecee

📥 Commits

Reviewing files that changed from the base of the PR and between 805f55b and 6563843.

📒 Files selected for processing (1)
  • src/spiders/engine.rs

@Liohtml Liohtml merged commit ef0407c into master May 28, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants