fix: bound XZ decompression memory in r_serialized scanner#713
fix: bound XZ decompression memory in r_serialized scanner#713
Conversation
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelaudit/scanners/r_serialized_scanner.py`:
- Around line 185-193: The XZ branch currently calls _read_xz_with_memlimit
directly and skips the decompression-ratio guard in _read_decompressed_stream,
allowing oversized decompressed payloads; modify the XZ path to route the
decompressed data through the same decompression-ratio check used by
_read_decompressed_stream (or replicate that guard) by comparing
total_decompressed against max_decompressed_bytes and max_scan_bytes (use
max_decompressed_bytes for mem limits and set truncated = total_decompressed >
max_scan_bytes) before slicing to max_scan_bytes, ensuring
_read_xz_with_memlimit returns both the raw decompressed length and the payload
so the same truncation/ratio logic can apply.
- Around line 149-166: The _read_xz_with_memlimit helper currently returns
partial output if the XZ input ends before the LZMADecompressor signals eof and
it also bypasses the decompression-ratio guard used elsewhere; update
_read_xz_with_memlimit to (1) track compressed bytes read and decompressed bytes
produced and enforce the existing max_decompression_ratio limit (same semantics
as _read_decompressed_stream), and (2) treat premature end-of-input as an error
by raising an exception when the file is exhausted but decompressor.eof is False
(instead of returning partial bytes). Then update the XZ branch in
_read_payload_for_analysis to use this corrected _read_xz_with_memlimit (or
apply the same ratio/enforcement logic inline) so XZ is protected consistently
with gzip/bzip2; reference functions/fields: _read_xz_with_memlimit,
_read_payload_for_analysis, _read_decompressed_stream, max_decompression_ratio,
and _XZ_READ_CHUNK_SIZE.
In `@tests/scanners/test_r_serialized_scanner.py`:
- Around line 171-183: Add a new test in
tests/scanners/test_r_serialized_scanner.py that mirrors
test_scan_xz_memory_limited_stream_is_handled_fail_closed but uses a scanner
config with a sufficiently large r_max_decompressed_bytes (or default) so the
XZ-compressed RDS produced by _write_xz_r_serialized(path, "safe", dict_size=1
<< 24) is successfully scanned; assert RSerializedScanner.can_handle(str(path)),
that result.success is True, and that the "R Serialized Decompression" check
(use _check_by_name(result, "R Serialized Decompression")) exists and has status
CheckStatus.PASSED to ensure benign XZ handling is preserved by
RSerializedScanner.scan.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 51d8d029-0701-4dfc-911c-8f3ce12e6db7
📒 Files selected for processing (2)
modelaudit/scanners/r_serialized_scanner.pytests/scanners/test_r_serialized_scanner.py
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelaudit/scanners/r_serialized_scanner.py`:
- Around line 146-153: The call to cls._read_xz_with_memlimit in
r_serialized_scanner.py uses a magic literal max_decompression_ratio=250.0;
replace this literal with a descriptive class-level constant (e.g.
_XZ_MAX_DECOMPRESSION_RATIO) declared on the same class, set to 250.0, and use
that constant in the cls._read_xz_with_memlimit invocation so the ratio is
configurable and tracked with other XZ-related limits.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 1d22c7e9-e0cd-4dcc-a856-6296b10cd54b
📒 Files selected for processing (2)
modelaudit/scanners/r_serialized_scanner.pytests/scanners/test_r_serialized_scanner.py
| prefix, _truncated, _total_decompressed = cls._read_xz_with_memlimit( | ||
| path=path, | ||
| output_limit=limit, | ||
| memlimit=cls._XZ_DECOMPRESS_MEMLIMIT, | ||
| compressed_size=os.path.getsize(path), | ||
| max_decompressed_bytes=limit + cls._XZ_READ_CHUNK_SIZE, | ||
| max_decompression_ratio=250.0, | ||
| ) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Replace the hard-coded decompression ratio with a named constant.
max_decompression_ratio=250.0 here is a magic number and can drift from scanner defaults/config intent; promoting it to a class constant would reduce maintenance risk.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelaudit/scanners/r_serialized_scanner.py` around lines 146 - 153, The call
to cls._read_xz_with_memlimit in r_serialized_scanner.py uses a magic literal
max_decompression_ratio=250.0; replace this literal with a descriptive
class-level constant (e.g. _XZ_MAX_DECOMPRESSION_RATIO) declared on the same
class, set to 250.0, and use that constant in the cls._read_xz_with_memlimit
invocation so the ratio is configurable and tracked with other XZ-related
limits.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelaudit/scanners/r_serialized_scanner.py`:
- Around line 106-110: The can_handle probe currently catches EOFError, OSError,
gzip.BadGzipFile, and lzma.LZMAError when calling cls._read_decompressed_prefix
but omits MemoryError, which can cause probe-time allocation failures to escape
and stop fail-open routing; update the except clause in
ModelSerializedScanner.can_handle (the try around
cls._read_decompressed_prefix(path, compression,
cls._CAN_HANDLE_DECOMPRESSED_LIMIT)) to also catch MemoryError so
corrupt/compressed wrappers and allocation failures both return True and
preserve the intended routing behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 88ac5c0c-a07c-49f2-90b2-02d88f6ec7f2
📒 Files selected for processing (2)
modelaudit/scanners/r_serialized_scanner.pytests/scanners/test_r_serialized_scanner.py
| try: | ||
| prefix = cls._read_decompressed_prefix(path, compression, cls._CAN_HANDLE_DECOMPRESSED_LIMIT) | ||
| except (EOFError, OSError, ValueError, gzip.BadGzipFile, lzma.LZMAError): | ||
| except (EOFError, OSError, gzip.BadGzipFile, lzma.LZMAError): | ||
| # Corrupt compressed wrappers should still route to this scanner. | ||
| return True |
There was a problem hiding this comment.
Catch probe-time memory failures in can_handle to preserve fail-open routing intent.
can_handle treats corrupt compressed wrappers as routable to this scanner, but MemoryError is not included in this probe exception list. A probe-time allocation failure can bypass that intent and abort routing unexpectedly.
💡 Proposed fix
- except (EOFError, OSError, gzip.BadGzipFile, lzma.LZMAError):
+ except (EOFError, OSError, MemoryError, gzip.BadGzipFile, lzma.LZMAError):
# Corrupt compressed wrappers should still route to this scanner.
return True🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelaudit/scanners/r_serialized_scanner.py` around lines 106 - 110, The
can_handle probe currently catches EOFError, OSError, gzip.BadGzipFile, and
lzma.LZMAError when calling cls._read_decompressed_prefix but omits MemoryError,
which can cause probe-time allocation failures to escape and stop fail-open
routing; update the except clause in ModelSerializedScanner.can_handle (the try
around cls._read_decompressed_prefix(path, compression,
cls._CAN_HANDLE_DECOMPRESSED_LIMIT)) to also catch MemoryError so
corrupt/compressed wrappers and allocation failures both return True and
preserve the intended routing behavior.
Motivation
lzma.open()for XZ-compressed files without limiting the decompressor's memory, allowing a crafted XZ header (large dictionary) to force large allocations and cause a Denial-of-Service.Description
_XZ_DECOMPRESS_MEMLIMITand_XZ_READ_CHUNK_SIZEand implement_read_xz_with_memlimit()that useslzma.LZMADecompressor(..., memlimit=...)to keep allocations bounded.lzma.open()usage in both_read_decompressed_prefix()and_read_payload_for_analysis()with the bounded helper to protect prefix probing and full payload reads.scan()decompression error handling to includeMemoryErrorso allocation failures are reported as a fail-closed decompression check.test_scan_xz_memory_limited_stream_is_handled_fail_closedthat crafts an XZ payload with a large LZMA2 dictionary and verifies the scanner fails safely when configured with a low decompression budget.Testing
uv run ruff check modelaudit/ tests/anduv run ruff format --check modelaudit/ tests/which passed.uv run mypy modelaudit/which passed.uv run pytest tests/scanners/test_r_serialized_scanner.py -qwhich passed (10 passed).uv run pytest -n auto -m "not slow and not integration" --maxfail=1which still fails due to an unrelated pre-existing test (tests/utils/helpers/test_secure_hasher.py::TestErrorHandling::test_hash_permission_denied), not caused by these changes.Codex Task
Summary by CodeRabbit
New Features
Bug Fixes
Tests