fix: harden engine against path traversal and allowed_domains bypass (High)#18
Conversation
- Sanitize spider.name() before interpolating it into cache/checkpoint paths so a name like "../etc/passwd" cannot escape the .scrapling/ directory - Reject requests whose domain cannot be parsed (data:, file://, malformed URLs) when allowed_domains is set, instead of silently allowing them Closes #1 Closes #2 https://claude.ai/code/session_012RmdaovmNWZVAim4XxCWwn
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR hardens the spider engine against two security issues: path traversal via unsanitized spider names and domain validation bypass for unparseable URLs. A new sanitization utility normalizes spider names for safe filesystem paths, applied to cache and checkpoint directory construction. Additionally, the allowed_domains enforcement is tightened to reject requests with missing or unparseable domains instead of silently allowing them. ChangesSecurity hardening: path sanitization and domain validation
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
Actionable comments posted: 0 |
Summary
Two High-severity security fixes in
src/spiders/engine.rs:spider.name(): A spider name containing../was interpolated directly into the cache/checkpoint directory paths, allowing writes outside.scrapling/. Addedsanitize_path_segment()which replaces any character that is not alphanumeric,_, or-with_. Empty inputs become_.allowed_domainswhitelist bypass: Whenrequest.domain()returnedNone(malformed URL,data:,file://), the request was silently allowed through. Replaced theif let Some(...)with amatchthat explicitly rejects unparseable domains as offsite.Test plan
sanitize_path_segmentcovering traversal chars, safe chars, and empty input (3 tests added)cargo test— full suite passing (184 + 3 = 187)cargo clippy --all-targets -- -D warningscleancargo fmt --checkcleanCloses #1
Closes #2
Generated by Claude Code
Summary by CodeRabbit
Bug Fixes
Tests