fix: harden engine against path traversal and allowed_domains bypass (High) by Liohtml · Pull Request #18 · Liohtml/RUSTScrapling

Liohtml · 2026-05-28T21:58:52Z

Summary

Two High-severity security fixes in src/spiders/engine.rs:

[repo-monitor] High: Path traversal via spider.name() — ../sequences write to arbitrary filesystem paths #2 — Path traversal via spider.name(): A spider name containing ../ was interpolated directly into the cache/checkpoint directory paths, allowing writes outside .scrapling/. Added sanitize_path_segment() which replaces any character that is not alphanumeric, _, or - with _. Empty inputs become _.
[repo-monitor] High: allowed_domains whitelist bypassed when URL has no parseable host #1 — allowed_domains whitelist bypass: When request.domain() returned None (malformed URL, data:, file://), the request was silently allowed through. Replaced the if let Some(...) with a match that explicitly rejects unparseable domains as offsite.

Test plan

New unit tests for sanitize_path_segment covering traversal chars, safe chars, and empty input (3 tests added)
cargo test — full suite passing (184 + 3 = 187)
cargo clippy --all-targets -- -D warnings clean
cargo fmt --check clean

Closes #1
Closes #2

Generated by Claude Code

Summary by CodeRabbit

Bug Fixes
- Enhanced security handling for spider naming and cache directory creation
- Improved domain validation to reject requests with unparseable domains
Tests
- Added comprehensive unit tests for validation logic improvements

- Sanitize spider.name() before interpolating it into cache/checkpoint paths so a name like "../etc/passwd" cannot escape the .scrapling/ directory - Reject requests whose domain cannot be parsed (data:, file://, malformed URLs) when allowed_domains is set, instead of silently allowing them Closes #1 Closes #2 https://claude.ai/code/session_012RmdaovmNWZVAim4XxCWwn

coderabbitai · 2026-05-28T21:59:02Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 3fb52caa-00e9-4895-93c1-f77e0568b36a

📥 Commits

Reviewing files that changed from the base of the PR and between 805f55b and 55abaca.

📒 Files selected for processing (1)

src/spiders/engine.rs

📝 Walkthrough

Walkthrough

This PR hardens the spider engine against two security issues: path traversal via unsanitized spider names and domain validation bypass for unparseable URLs. A new sanitization utility normalizes spider names for safe filesystem paths, applied to cache and checkpoint directory construction. Additionally, the allowed_domains enforcement is tightened to reject requests with missing or unparseable domains instead of silently allowing them.

Changes

Security hardening: path sanitization and domain validation

Layer / File(s)	Summary
Path sanitization utility and tests `src/spiders/engine.rs`	Introduces `sanitize_path_segment` helper to normalize spider names into safe filesystem path segments by replacing non-`[A-Za-z0-9_-]` characters with `_`. Includes unit tests covering traversal-like inputs (`../`), safe character preservation, and empty/edge-case inputs.
Apply sanitization to cache and checkpoint paths `src/spiders/engine.rs`	Updates development-mode cache directory and checkpoint directory construction to use `sanitize_path_segment(spider.name())` instead of raw spider names, preventing path traversal attacks.
Enforce domain validation for unparseable URLs `src/spiders/engine.rs`	Reworks `allowed_domains` whitelist enforcement to treat requests with unparseable domains (`request.domain() == None`) as offsite requests, closing a bypass where malformed URLs or non-HTTP schemes would bypass the domain whitelist.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 A rabbit's ode to safer paths:
Traversals thwarted, domains locked tight,
No ../ escapes in the dark of night,
Spider names sanitized, URLs checked twice—
Security's served with some fuzzy advice! 🔐

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title directly and clearly describes the main changes: hardening the engine against path traversal and allowed_domains bypass, which are the two primary fixes in the changeset.
Linked Issues check	✅ Passed	The code changes address both linked issues: sanitize_path_segment prevents path traversal via spider.name() [`#2`], and allowed_domains enforcement now rejects requests with None domains [`#1`].
Out of Scope Changes check	✅ Passed	All changes are directly scoped to fixing the two high-severity vulnerabilities in src/spiders/engine.rs; no unrelated modifications are present.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch claude/engine-security-issues

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai · 2026-05-28T22:01:28Z

Actionable comments posted: 0

Liohtml merged commit 39b1193 into master May 28, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: harden engine against path traversal and allowed_domains bypass (High)#18

fix: harden engine against path traversal and allowed_domains bypass (High)#18
Liohtml merged 1 commit into
masterfrom
claude/engine-security-issues

Liohtml commented May 28, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 28, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Liohtml commented May 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Liohtml commented May 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 28, 2026 •

edited

Loading