fix(security): bound pickle metadata reads in metadata extraction#712
fix(security): bound pickle metadata reads in metadata extraction#712
Conversation
WalkthroughAdds a configurable read-size cap for pickle metadata extraction: validates the configured limit (>0), checks file size against it, performs a capped read (limit + 1) to detect over-read, infers protocol from the read bytes, and raises ValueError when limits are exceeded. Changes
Sequence DiagramsequenceDiagram
participant Scanner as PickleScanner
participant Config as Configuration
participant FileIO as File I/O
participant Validator as Validator
participant Metadata as MetadataExtractor
Scanner->>Config: read max_metadata_read_size
Config-->>Scanner: return limit
Scanner->>FileIO: stat file -> size
FileIO-->>Scanner: return size
Scanner->>Validator: ensure limit > 0
Validator-->>Scanner: ok / raise ValueError
Scanner->>Validator: compare file size <= limit
alt file size > limit
Validator-->>Scanner: raise ValueError
else file size <= limit
Scanner->>FileIO: read up to (limit + 1) bytes
FileIO-->>Scanner: return pickle_data
Scanner->>Validator: ensure len(pickle_data) <= limit
alt read exceeded limit
Validator-->>Scanner: raise ValueError
else within limit
Scanner->>Scanner: detect protocol from pickle_data
Scanner->>Metadata: extract metadata from limited data
Metadata-->>Scanner: return metadata
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 6129-6148: The code allows non-positive max_metadata_read_size
which causes f.read(-1) to read the entire file and bypass the guard; change the
logic in the pickle metadata read block (where max_metadata_read_size,
get_file_size, open(file_path) and pickle_data are used) to reject non-positive
values or coerce them to a safe default (e.g., 10*1024*1024) before reading;
specifically, validate max_metadata_read_size > 0 at the start of the try block
and if it is <= 0 raise a ValueError (or set it to the documented default) so
that f.read is always called with a bounded positive length and the subsequent
length checks remain effective.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 2bc940b8-4f9f-4bb4-9f56-2b5dbb99ec89
📒 Files selected for processing (2)
modelaudit/scanners/pickle_scanner.pytests/test_metadata_extractor.py
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/test_metadata_extractor.py`:
- Around line 469-489: The parametrized test
test_pickle_metadata_enforces_read_limit currently only asserts failure cases;
add a valid success case (e.g., limit 256) to the ("limit", "expected_error")
matrix and update assertions so that when limit is the success value the test
asserts that PickleScanner({"max_metadata_pickle_read_size":
limit}).extract_metadata(...) does not contain "extraction_error" (i.e., assert
"extraction_error" not in metadata), using the same tmp_path pkl_file setup and
referencing PickleScanner and the max_metadata_pickle_read_size config key.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 4e0768a7-e299-4842-8114-1b30f84ba28b
📒 Files selected for processing (2)
modelaudit/scanners/pickle_scanner.pytests/test_metadata_extractor.py
| @pytest.mark.parametrize( | ||
| ("limit", "expected_error"), | ||
| [ | ||
| (64, "read limit exceeded"), | ||
| (0, "must be greater than 0"), | ||
| (-1, "must be greater than 0"), | ||
| ], | ||
| ) | ||
| def test_pickle_metadata_enforces_read_limit(self, tmp_path: Path, limit: int, expected_error: str) -> None: | ||
| """Ensure pickle metadata extraction rejects oversized and invalid read limits.""" | ||
| from modelaudit.scanners.pickle_scanner import PickleScanner | ||
|
|
||
| pkl_file = tmp_path / "oversized.pkl" | ||
| pkl_file.write_bytes(b"x" * 128) | ||
|
|
||
| scanner = PickleScanner({"max_metadata_pickle_read_size": limit}) | ||
| metadata = scanner.extract_metadata(str(pkl_file)) | ||
|
|
||
| assert "extraction_error" in metadata | ||
| assert expected_error in metadata["extraction_error"] | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Add a success-path case to this limit matrix.
This parametrization only checks failing paths. Add one valid limit case (e.g., 256) and assert metadata extraction succeeds without extraction_error to prevent over-restrictive regressions.
Proposed test refinement
`@pytest.mark.parametrize`(
- ("limit", "expected_error"),
+ ("limit", "expected_error"),
[
(64, "read limit exceeded"),
(0, "must be greater than 0"),
(-1, "must be greater than 0"),
+ (256, None),
],
)
-def test_pickle_metadata_enforces_read_limit(self, tmp_path: Path, limit: int, expected_error: str) -> None:
+def test_pickle_metadata_enforces_read_limit(
+ self, tmp_path: Path, limit: int, expected_error: str | None
+) -> None:
@@
- assert "extraction_error" in metadata
- assert expected_error in metadata["extraction_error"]
+ if expected_error is None:
+ assert "extraction_error" not in metadata
+ assert metadata.get("pickle_size") == 128
+ else:
+ assert "extraction_error" in metadata
+ assert expected_error in metadata["extraction_error"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_metadata_extractor.py` around lines 469 - 489, The parametrized
test test_pickle_metadata_enforces_read_limit currently only asserts failure
cases; add a valid success case (e.g., limit 256) to the ("limit",
"expected_error") matrix and update assertions so that when limit is the success
value the test asserts that PickleScanner({"max_metadata_pickle_read_size":
limit}).extract_metadata(...) does not contain "extraction_error" (i.e., assert
"extraction_error" not in metadata), using the same tmp_path pkl_file setup and
referencing PickleScanner and the max_metadata_pickle_read_size config key.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 6885-6895: The code reads max_metadata_read_size from self.config
but only rejects non-positive values and allows arbitrarily large sizes; change
the logic around max_metadata_read_size in pickle_scanner.py (the variable
max_metadata_read_size, the config.get call) so that after parsing you enforce a
hard upper bound of 10 * 1024 * 1024 (10 MiB) — e.g., validate > 0 then apply
max_metadata_read_size = min(max_metadata_read_size, 10 * 1024 * 1024) (or
replace oversized values with the cap) to ensure metadata reads cannot exceed 10
MiB.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 38d7bd2d-53f6-42de-9347-8bad4c7dc0e8
📒 Files selected for processing (1)
modelaudit/scanners/pickle_scanner.py
| max_metadata_read_size = int(self.config.get("max_metadata_pickle_read_size", 10 * 1024 * 1024)) | ||
|
|
||
| try: | ||
| import pickle | ||
| import pickletools | ||
| from io import BytesIO | ||
|
|
||
| if max_metadata_read_size <= 0: | ||
| raise ValueError( | ||
| f"Invalid pickle metadata read limit: {max_metadata_read_size} (must be greater than 0)" | ||
| ) |
There was a problem hiding this comment.
Enforce a hard 10 MiB upper bound for max_metadata_pickle_read_size.
The new validation rejects non-positive values, but any very large positive config still weakens the metadata DoS protection.
🔧 Proposed fix
if max_metadata_read_size <= 0:
raise ValueError(
f"Invalid pickle metadata read limit: {max_metadata_read_size} (must be greater than 0)"
)
+ hard_cap = 10 * 1024 * 1024
+ if max_metadata_read_size > hard_cap:
+ raise ValueError(
+ f"Invalid pickle metadata read limit: {max_metadata_read_size} (must be <= {hard_cap})"
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelaudit/scanners/pickle_scanner.py` around lines 6885 - 6895, The code
reads max_metadata_read_size from self.config but only rejects non-positive
values and allows arbitrarily large sizes; change the logic around
max_metadata_read_size in pickle_scanner.py (the variable
max_metadata_read_size, the config.get call) so that after parsing you enforce a
hard upper bound of 10 * 1024 * 1024 (10 MiB) — e.g., validate > 0 then apply
max_metadata_read_size = min(max_metadata_read_size, 10 * 1024 * 1024) (or
replace oversized values with the cap) to ensure metadata reads cannot exceed 10
MiB.
Motivation
f.read()on arbitrarily large .pkl files.Description
PickleScanner.extract_metadata()using a new config keymax_metadata_pickle_read_sizewith a default of 10 MiB and raise aValueErrorwhen the file or read exceeds the limit.max_metadata_pickle_read_size + 1bytes and surface a clearextraction_errorwhen the limit is exceeded to avoid large allocations while preserving opcode analysis for small files.test_pickle_metadata_enforces_read_limittotests/test_metadata_extractor.pythat verifies oversized pickle metadata extraction is rejected.Testing
uv run ruff format modelaudit/ tests/(reformatted 2 files) anduv run ruff check --fix modelaudit/ tests/(passed).uv run mypy modelaudit/(passed with no issues).PickleScanner({'max_metadata_pickle_read_size':64}).extract_metadata()returns anextraction_errorfor a 128-byte file), which passed.uv run pytest -n auto -m "not slow and not integration" --maxfail=1was attempted but encountered unrelated, pre-existing failures in other tests; the change-specific checks and linters passed.Codex Task
Summary by CodeRabbit
New Features
Tests