fix: drain tasks before checkpoint and enforce per-domain limit (#9 #10) by Liohtml · Pull Request #22 · Liohtml/RUSTScrapling

Liohtml · 2026-05-28T22:06:21Z

Summary

Two engine reliability fixes:

[repo-monitor] Medium: Checkpoint on pause loses in-flight request results #9 (Medium) — checkpoint loses in-flight work on pause: When request_pause() triggered the loop break, the checkpoint was saved immediately while tokio::spawned tasks were still running. Any URLs those tasks enqueued were dropped on the floor. Now waits for active_tasks to drain (50ms poll) before serializing the checkpoint.
[repo-monitor] Medium: concurrent_requests_per_domain setting silently ignored — no per-domain rate limiting enforced #10 (Medium) — concurrent_requests_per_domain was a no-op: The value was recorded in CrawlStats but never used to throttle anything; every request contended on the single global semaphore. Added a lazily-populated HashMap<String, Arc<Semaphore>> keyed on host. Each spawned request acquires a per-domain permit (when the cap is set) in addition to the global permit; both are released on completion.

Test plan

cargo test — full suite green
cargo clippy --all-targets -- -D warnings clean
cargo fmt --check clean

The per-domain limiter is exercised indirectly by the existing spider tests; a focused load test would be the next step but is non-trivial without a controllable test server.

Closes #9
Closes #10

Generated by Claude Code

Summary by CodeRabbit

New Features
- Added per-domain concurrent request limit configuration
- Improved shutdown and pause handling to wait for all in-flight tasks before saving progress, preventing loss of enqueued URLs

…imit - Wait for active_tasks to reach zero before saving the checkpoint on pause, so URLs enqueued by still-running spawn tasks are not lost - Add a per-domain HashMap<String, Arc<Semaphore>> that lazily creates a semaphore for each host and acquires a permit before dispatching, so a spider's concurrent_requests_per_domain setting is actually enforced Closes #9 Closes #10 https://claude.ai/code/session_012RmdaovmNWZVAim4XxCWwn

coderabbitai · 2026-05-28T22:06:34Z

📝 Walkthrough

Walkthrough

CrawlerEngine now enforces optional per-domain request concurrency caps alongside the existing global limiter using lazily-created, per-host semaphores. During pause or shutdown, the engine waits for all in-flight tasks to complete before saving checkpoints, ensuring no URLs are lost.

Changes

Per-domain concurrency limits and task-aware shutdown

Layer / File(s)	Summary
Per-domain semaphore data structure and initialization `src/spiders/engine.rs`	Imports `Duration` and `Instant` from `std::time`. Adds `domain_limiters` field to `CrawlerEngine` as an `Arc<Mutex<HashMap>>` storing per-host semaphores. Field initialized as empty `HashMap` in `new()`.
Per-domain permit acquisition and release in crawl loop `src/spiders/engine.rs`	Crawl loop acquires a per-domain semaphore permit when `concurrent_requests_per_domain()` is configured, keying by `req.domain()`. Creates a new `Semaphore` per host on first encounter, reuses existing ones thereafter. Permit is dropped after task spawn.
Pause handling with active task completion wait `src/spiders/engine.rs`	When pause is triggered, engine polls `active_tasks` until it reaches zero before saving a checkpoint, ensuring any URLs from in-flight requests are included in persistence.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

#10: Implements per-domain semaphores in src/spiders/engine.rs by adding domain_limiters field and domain-scoped permit acquisition, directly addressing the missing per-domain concurrency enforcement.

Poem

A crawler learns restraint at last,
One domain at a time held fast,
When paused, it waits for tasks to cease, 🐰
Before the checkpoint finds its peace,
No URLs drop when shutdown flows,
The engine bows, the limiter knows.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the two main fixes: draining tasks before checkpoint (fixing issue `#9`) and enforcing per-domain concurrency limits (fixing issue `#10`), which align with the core changes in the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch claude/engine-reliability

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/spiders/engine.rs (1)

229-235: ⚖️ Poor tradeoff

Consider adding a timeout to the drain loop to prevent indefinite waiting.

If a spawned task hangs (e.g., due to a stalled network call), this loop will spin indefinitely. Consider adding a maximum wait duration and proceeding with the checkpoint save even if some tasks haven't completed.

♻️ Example with timeout

+        let drain_deadline = Instant::now() + Duration::from_secs(30);
         while self.active_tasks.load(Ordering::SeqCst) > 0 {
+            if Instant::now() >= drain_deadline {
+                // Log warning about timed-out tasks, proceed with checkpoint
+                break;
+            }
             tokio::time::sleep(Duration::from_millis(50)).await;
         }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/spiders/engine.rs` around lines 229 - 235, The drain loop waiting for
in-flight tasks (uses self.paused and checks self.active_tasks with
tokio::time::sleep(Duration::from_millis(50)).await) can block forever; modify
it to enforce a maximum wait duration by recording a start Instant and breaking
the loop when elapsed exceeds a configurable timeout (e.g., few seconds) so the
checkpoint proceeds even if tasks hang, and log or return a warning/error
indicating a forced drain due to timeout; ensure the timeout value is
configurable and referenced where self.paused/self.active_tasks are handled.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/spiders/engine.rs`:
- Around line 229-235: The drain loop waiting for in-flight tasks (uses
self.paused and checks self.active_tasks with
tokio::time::sleep(Duration::from_millis(50)).await) can block forever; modify
it to enforce a maximum wait duration by recording a start Instant and breaking
the loop when elapsed exceeds a configurable timeout (e.g., few seconds) so the
checkpoint proceeds even if tasks hang, and log or return a warning/error
indicating a forced drain due to timeout; ensure the timeout value is
configurable and referenced where self.paused/self.active_tasks are handled.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 043fc56a-7373-4a9d-9a47-c98c91c0ecee

📥 Commits

Reviewing files that changed from the base of the PR and between 805f55b and 6563843.

📒 Files selected for processing (1)

src/spiders/engine.rs

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

Liohtml merged commit ef0407c into master May 28, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: drain tasks before checkpoint and enforce per-domain limit (#9 #10)#22

fix: drain tasks before checkpoint and enforce per-domain limit (#9 #10)#22
Liohtml merged 1 commit into
masterfrom
claude/engine-reliability

Liohtml commented May 28, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 28, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Liohtml commented May 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Liohtml commented May 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 28, 2026 •

edited

Loading