fix: robots.txt timeout and multi-agent group parsing (#3 #6)#19
Conversation
- Use a purpose-built reqwest client with a 10s timeout in fetch_robots so a hanging robots.txt endpoint cannot block crawl setup indefinitely - Parse robots.txt into groups so multiple consecutive User-agent: lines share the following directives (standard grouping pattern that was previously silently ignored) - Specific agent matches still win over wildcard groups Closes #3 Closes #6 https://claude.ai/code/session_012RmdaovmNWZVAim4XxCWwn
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe PR refactors robots.txt fetching and parsing to fix two issues: HTTP timeout handling and multi-agent group support. A new ChangesRobots.txt Parser and Fetch Refactoring
🎯 4 (Complex) | ⏱️ ~45 minutes
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
Actionable comments posted: 0 |
Summary
Two
robots.txtfixes insrc/spiders/robots.rs:reqwest::get()created a default client with no timeout and ignored allFetcherConfig. A hanging/robots.txtendpoint would block crawl setup indefinitely. Replaced with a purpose-builtreqwest::Client::builder().timeout(10s).build().User-agent:grouping (multipleUser-agent:lines preceding one set of directives) was not handled — only the last agent of the group got the rules. Rewrote the parser as a two-step process: collect groups (each holds its applicable agents + directives), then pick the most specific matching group (specific agent > wildcard*).Test plan
cargo test— full suite greencargo clippy --all-targets -- -D warningscleancargo fmt --checkcleanNote: the timeout is a hard-coded 10s for now. A follow-up could plumb the spider's
FetcherConfigthrough to honor user-configured timeouts/proxies for robots.txt fetches.Closes #3
Closes #6
Generated by Claude Code
Summary by CodeRabbit
Improvements
Tests