Skip to content

fix: harden engine against path traversal and allowed_domains bypass (High)#18

Merged
Liohtml merged 1 commit into
masterfrom
claude/engine-security-issues
May 28, 2026
Merged

fix: harden engine against path traversal and allowed_domains bypass (High)#18
Liohtml merged 1 commit into
masterfrom
claude/engine-security-issues

Conversation

@Liohtml
Copy link
Copy Markdown
Owner

@Liohtml Liohtml commented May 28, 2026

Summary

Two High-severity security fixes in src/spiders/engine.rs:

Test plan

  • New unit tests for sanitize_path_segment covering traversal chars, safe chars, and empty input (3 tests added)
  • cargo test — full suite passing (184 + 3 = 187)
  • cargo clippy --all-targets -- -D warnings clean
  • cargo fmt --check clean

Closes #1
Closes #2


Generated by Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced security handling for spider naming and cache directory creation
    • Improved domain validation to reject requests with unparseable domains
  • Tests

    • Added comprehensive unit tests for validation logic improvements

Review Change Stack

- Sanitize spider.name() before interpolating it into cache/checkpoint paths
  so a name like "../etc/passwd" cannot escape the .scrapling/ directory
- Reject requests whose domain cannot be parsed (data:, file://, malformed
  URLs) when allowed_domains is set, instead of silently allowing them

Closes #1
Closes #2

https://claude.ai/code/session_012RmdaovmNWZVAim4XxCWwn
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 3fb52caa-00e9-4895-93c1-f77e0568b36a

📥 Commits

Reviewing files that changed from the base of the PR and between 805f55b and 55abaca.

📒 Files selected for processing (1)
  • src/spiders/engine.rs

📝 Walkthrough

Walkthrough

This PR hardens the spider engine against two security issues: path traversal via unsanitized spider names and domain validation bypass for unparseable URLs. A new sanitization utility normalizes spider names for safe filesystem paths, applied to cache and checkpoint directory construction. Additionally, the allowed_domains enforcement is tightened to reject requests with missing or unparseable domains instead of silently allowing them.

Changes

Security hardening: path sanitization and domain validation

Layer / File(s) Summary
Path sanitization utility and tests
src/spiders/engine.rs
Introduces sanitize_path_segment helper to normalize spider names into safe filesystem path segments by replacing non-[A-Za-z0-9_-] characters with _. Includes unit tests covering traversal-like inputs (../), safe character preservation, and empty/edge-case inputs.
Apply sanitization to cache and checkpoint paths
src/spiders/engine.rs
Updates development-mode cache directory and checkpoint directory construction to use sanitize_path_segment(spider.name()) instead of raw spider names, preventing path traversal attacks.
Enforce domain validation for unparseable URLs
src/spiders/engine.rs
Reworks allowed_domains whitelist enforcement to treat requests with unparseable domains (request.domain() == None) as offsite requests, closing a bypass where malformed URLs or non-HTTP schemes would bypass the domain whitelist.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 A rabbit's ode to safer paths:
Traversals thwarted, domains locked tight,
No ../ escapes in the dark of night,
Spider names sanitized, URLs checked twice—
Security's served with some fuzzy advice! 🔐

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title directly and clearly describes the main changes: hardening the engine against path traversal and allowed_domains bypass, which are the two primary fixes in the changeset.
Linked Issues check ✅ Passed The code changes address both linked issues: sanitize_path_segment prevents path traversal via spider.name() [#2], and allowed_domains enforcement now rejects requests with None domains [#1].
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the two high-severity vulnerabilities in src/spiders/engine.rs; no unrelated modifications are present.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/engine-security-issues

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

@Liohtml Liohtml merged commit 39b1193 into master May 28, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants