Skip to content

fix: robots.txt timeout and multi-agent group parsing (#3 #6)#19

Merged
Liohtml merged 1 commit into
masterfrom
claude/robots-txt-fixes
May 28, 2026
Merged

fix: robots.txt timeout and multi-agent group parsing (#3 #6)#19
Liohtml merged 1 commit into
masterfrom
claude/robots-txt-fixes

Conversation

@Liohtml
Copy link
Copy Markdown
Owner

@Liohtml Liohtml commented May 28, 2026

Summary

Two robots.txt fixes in src/spiders/robots.rs:

Test plan

  • 5 new unit tests covering multi-agent groups, specific-over-wildcard, wildcard fallback, no-match allow-all, and crawl-delay parsing
  • cargo test — full suite green
  • cargo clippy --all-targets -- -D warnings clean
  • cargo fmt --check clean

Note: the timeout is a hard-coded 10s for now. A follow-up could plumb the spider's FetcherConfig through to honor user-configured timeouts/proxies for robots.txt fetches.

Closes #3
Closes #6


Generated by Claude Code

Summary by CodeRabbit

  • Improvements

    • Enhanced robots.txt parsing now correctly handles multi-agent directive groups, user-agent precedence, and wildcard fallback scenarios with improved timeout management.
    • Strengthened error handling ensures graceful fallback when robots.txt parsing fails.
  • Tests

    • Expanded test suite to cover multi-agent grouping, specific vs. wildcard agent precedence, unknown agents, and crawl-delay parsing.

Review Change Stack

- Use a purpose-built reqwest client with a 10s timeout in fetch_robots so
  a hanging robots.txt endpoint cannot block crawl setup indefinitely
- Parse robots.txt into groups so multiple consecutive User-agent: lines
  share the following directives (standard grouping pattern that was
  previously silently ignored)
- Specific agent matches still win over wildcard groups

Closes #3
Closes #6

https://claude.ai/code/session_012RmdaovmNWZVAim4XxCWwn
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: efddf386-49e7-4e68-a061-1407d00caf09

📥 Commits

Reviewing files that changed from the base of the PR and between 805f55b and b5e2952.

📒 Files selected for processing (1)
  • src/spiders/robots.rs

📝 Walkthrough

Walkthrough

The PR refactors robots.txt fetching and parsing to fix two issues: HTTP timeout handling and multi-agent group support. A new RobotsGroup struct groups consecutive User-agent: directives with their rules. fetch_robots now uses a timed reqwest::Client instead of reqwest::get, and parse_robots selects matching groups by exact agent or wildcard fallback. Parsing helpers and tests are added to support the new group-based logic.

Changes

Robots.txt Parser and Fetch Refactoring

Layer / File(s) Summary
Data structures and helper constructors
src/spiders/robots.rs
Duration import added for timeout. RobotsGroup struct introduced to represent contiguous robots.txt User-agent: blocks with disallow and optional crawl_delay. RobotsRules::allow_all() constructor standardizes fallback behavior.
HTTP fetch with timeout and unified error handling
src/spiders/robots.rs
fetch_robots now builds a dedicated reqwest::Client with 10-second timeout and consolidates error/parse-failure handling by falling back to allow_all() before caching. is_allowed cache-miss path simplified to return allow-all.
Group-based user-agent selection and matching
src/spiders/robots.rs
parse_robots rewritten to select the first RobotsGroup matching the lowercased target user-agent, fall back to wildcard (*) group if present, or return allow-all if no groups match. Replaces previous single-pass matching that incorrectly skipped multi-agent blocks.
Robots.txt group tokenization and directive extraction
src/spiders/robots.rs
parse_groups tokenizes robots.txt into contiguous RobotsGroup blocks by tracking User-agent: boundaries and grouping Disallow/Crawl-delay rules per block. strip_prefix_ci performs case-insensitive directive prefix matching.
Test coverage for multi-agent groups and fallback logic
src/spiders/robots.rs
Unit tests expanded to validate multi-agent group handling, specific-agent precedence over wildcard, wildcard fallback when agents absent, unknown-agent rejection without wildcard, and Crawl-delay parsing.

🎯 4 (Complex) | ⏱️ ~45 minutes

A rabbit hops through the group-based rows,
With timeout set to swift repose,
No more hangs or groups ignored,
The robots.txt now well-restored! 🐰✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title concisely and accurately captures both main fixes: timeout handling and multi-agent group parsing, with issue numbers for traceability.
Linked Issues check ✅ Passed The pull request directly addresses both linked issues: #3 (timeout via 10s client) and #6 (two-pass group parsing with specific-over-wildcard logic) with full implementation and test coverage.
Out of Scope Changes check ✅ Passed All changes in src/spiders/robots.rs are directly scoped to fixing issues #3 and #6; no extraneous modifications detected.
Docstring Coverage ✅ Passed Docstring coverage is 92.31% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/robots-txt-fixes

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

@Liohtml Liohtml merged commit 0e2a347 into master May 28, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants