Skip to content

fix: bound XZ decompression memory in r_serialized scanner#713

Merged
mldangelo merged 6 commits intomainfrom
codex/fix-unbounded-lzma-decompression-issue
Mar 18, 2026
Merged

fix: bound XZ decompression memory in r_serialized scanner#713
mldangelo merged 6 commits intomainfrom
codex/fix-unbounded-lzma-decompression-issue

Conversation

@mldangelo
Copy link
Member

@mldangelo mldangelo commented Mar 16, 2026

Motivation

  • The R serialized scanner used lzma.open() for XZ-compressed files without limiting the decompressor's memory, allowing a crafted XZ header (large dictionary) to force large allocations and cause a Denial-of-Service.

Description

  • Add class constants _XZ_DECOMPRESS_MEMLIMIT and _XZ_READ_CHUNK_SIZE and implement _read_xz_with_memlimit() that uses lzma.LZMADecompressor(..., memlimit=...) to keep allocations bounded.
  • Replace direct lzma.open() usage in both _read_decompressed_prefix() and _read_payload_for_analysis() with the bounded helper to protect prefix probing and full payload reads.
  • Update scan() decompression error handling to include MemoryError so allocation failures are reported as a fail-closed decompression check.
  • Add a regression test test_scan_xz_memory_limited_stream_is_handled_fail_closed that crafts an XZ payload with a large LZMA2 dictionary and verifies the scanner fails safely when configured with a low decompression budget.

Testing

  • Ran linters and format checks with uv run ruff check modelaudit/ tests/ and uv run ruff format --check modelaudit/ tests/ which passed.
  • Ran type checks with uv run mypy modelaudit/ which passed.
  • Ran the focused tests with uv run pytest tests/scanners/test_r_serialized_scanner.py -q which passed (10 passed).
  • Ran the wider test suite with uv run pytest -n auto -m "not slow and not integration" --maxfail=1 which still fails due to an unrelated pre-existing test (tests/utils/helpers/test_secure_hasher.py::TestErrorHandling::test_hash_permission_denied), not caused by these changes.

Codex Task

Summary by CodeRabbit

  • New Features

    • Added memory-bounded XZ decompression for R-serialized payloads with chunked processing and decompression checks to avoid resource exhaustion.
  • Bug Fixes

    • Improved handling for truncated, concatenated, and oversized XZ streams, including graceful failure when memory limits are exceeded.
  • Tests

    • Added tests covering XZ decompression, memory limit enforcement, truncated streams, concatenated streams, and malicious payload detection.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 16, 2026

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and specifically describes the main change: adding memory bounds to XZ decompression in the R serialized scanner, which is the primary security fix in this PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/fix-unbounded-lzma-decompression-issue
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/scanners/r_serialized_scanner.py`:
- Around line 185-193: The XZ branch currently calls _read_xz_with_memlimit
directly and skips the decompression-ratio guard in _read_decompressed_stream,
allowing oversized decompressed payloads; modify the XZ path to route the
decompressed data through the same decompression-ratio check used by
_read_decompressed_stream (or replicate that guard) by comparing
total_decompressed against max_decompressed_bytes and max_scan_bytes (use
max_decompressed_bytes for mem limits and set truncated = total_decompressed >
max_scan_bytes) before slicing to max_scan_bytes, ensuring
_read_xz_with_memlimit returns both the raw decompressed length and the payload
so the same truncation/ratio logic can apply.
- Around line 149-166: The _read_xz_with_memlimit helper currently returns
partial output if the XZ input ends before the LZMADecompressor signals eof and
it also bypasses the decompression-ratio guard used elsewhere; update
_read_xz_with_memlimit to (1) track compressed bytes read and decompressed bytes
produced and enforce the existing max_decompression_ratio limit (same semantics
as _read_decompressed_stream), and (2) treat premature end-of-input as an error
by raising an exception when the file is exhausted but decompressor.eof is False
(instead of returning partial bytes). Then update the XZ branch in
_read_payload_for_analysis to use this corrected _read_xz_with_memlimit (or
apply the same ratio/enforcement logic inline) so XZ is protected consistently
with gzip/bzip2; reference functions/fields: _read_xz_with_memlimit,
_read_payload_for_analysis, _read_decompressed_stream, max_decompression_ratio,
and _XZ_READ_CHUNK_SIZE.

In `@tests/scanners/test_r_serialized_scanner.py`:
- Around line 171-183: Add a new test in
tests/scanners/test_r_serialized_scanner.py that mirrors
test_scan_xz_memory_limited_stream_is_handled_fail_closed but uses a scanner
config with a sufficiently large r_max_decompressed_bytes (or default) so the
XZ-compressed RDS produced by _write_xz_r_serialized(path, "safe", dict_size=1
<< 24) is successfully scanned; assert RSerializedScanner.can_handle(str(path)),
that result.success is True, and that the "R Serialized Decompression" check
(use _check_by_name(result, "R Serialized Decompression")) exists and has status
CheckStatus.PASSED to ensure benign XZ handling is preserved by
RSerializedScanner.scan.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 51d8d029-0701-4dfc-911c-8f3ce12e6db7

📥 Commits

Reviewing files that changed from the base of the PR and between d9fe283 and 04c1d9b.

📒 Files selected for processing (2)
  • modelaudit/scanners/r_serialized_scanner.py
  • tests/scanners/test_r_serialized_scanner.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/scanners/r_serialized_scanner.py`:
- Around line 146-153: The call to cls._read_xz_with_memlimit in
r_serialized_scanner.py uses a magic literal max_decompression_ratio=250.0;
replace this literal with a descriptive class-level constant (e.g.
_XZ_MAX_DECOMPRESSION_RATIO) declared on the same class, set to 250.0, and use
that constant in the cls._read_xz_with_memlimit invocation so the ratio is
configurable and tracked with other XZ-related limits.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1d22c7e9-e0cd-4dcc-a856-6296b10cd54b

📥 Commits

Reviewing files that changed from the base of the PR and between 481d4a5 and 143ce8e.

📒 Files selected for processing (2)
  • modelaudit/scanners/r_serialized_scanner.py
  • tests/scanners/test_r_serialized_scanner.py

Comment on lines +146 to +153
prefix, _truncated, _total_decompressed = cls._read_xz_with_memlimit(
path=path,
output_limit=limit,
memlimit=cls._XZ_DECOMPRESS_MEMLIMIT,
compressed_size=os.path.getsize(path),
max_decompressed_bytes=limit + cls._XZ_READ_CHUNK_SIZE,
max_decompression_ratio=250.0,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Replace the hard-coded decompression ratio with a named constant.

max_decompression_ratio=250.0 here is a magic number and can drift from scanner defaults/config intent; promoting it to a class constant would reduce maintenance risk.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelaudit/scanners/r_serialized_scanner.py` around lines 146 - 153, The call
to cls._read_xz_with_memlimit in r_serialized_scanner.py uses a magic literal
max_decompression_ratio=250.0; replace this literal with a descriptive
class-level constant (e.g. _XZ_MAX_DECOMPRESSION_RATIO) declared on the same
class, set to 250.0, and use that constant in the cls._read_xz_with_memlimit
invocation so the ratio is configurable and tracked with other XZ-related
limits.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/scanners/r_serialized_scanner.py`:
- Around line 106-110: The can_handle probe currently catches EOFError, OSError,
gzip.BadGzipFile, and lzma.LZMAError when calling cls._read_decompressed_prefix
but omits MemoryError, which can cause probe-time allocation failures to escape
and stop fail-open routing; update the except clause in
ModelSerializedScanner.can_handle (the try around
cls._read_decompressed_prefix(path, compression,
cls._CAN_HANDLE_DECOMPRESSED_LIMIT)) to also catch MemoryError so
corrupt/compressed wrappers and allocation failures both return True and
preserve the intended routing behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 88ac5c0c-a07c-49f2-90b2-02d88f6ec7f2

📥 Commits

Reviewing files that changed from the base of the PR and between 143ce8e and 010a723.

📒 Files selected for processing (2)
  • modelaudit/scanners/r_serialized_scanner.py
  • tests/scanners/test_r_serialized_scanner.py

Comment on lines 106 to 110
try:
prefix = cls._read_decompressed_prefix(path, compression, cls._CAN_HANDLE_DECOMPRESSED_LIMIT)
except (EOFError, OSError, ValueError, gzip.BadGzipFile, lzma.LZMAError):
except (EOFError, OSError, gzip.BadGzipFile, lzma.LZMAError):
# Corrupt compressed wrappers should still route to this scanner.
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Catch probe-time memory failures in can_handle to preserve fail-open routing intent.

can_handle treats corrupt compressed wrappers as routable to this scanner, but MemoryError is not included in this probe exception list. A probe-time allocation failure can bypass that intent and abort routing unexpectedly.

💡 Proposed fix
-        except (EOFError, OSError, gzip.BadGzipFile, lzma.LZMAError):
+        except (EOFError, OSError, MemoryError, gzip.BadGzipFile, lzma.LZMAError):
             # Corrupt compressed wrappers should still route to this scanner.
             return True
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelaudit/scanners/r_serialized_scanner.py` around lines 106 - 110, The
can_handle probe currently catches EOFError, OSError, gzip.BadGzipFile, and
lzma.LZMAError when calling cls._read_decompressed_prefix but omits MemoryError,
which can cause probe-time allocation failures to escape and stop fail-open
routing; update the except clause in ModelSerializedScanner.can_handle (the try
around cls._read_decompressed_prefix(path, compression,
cls._CAN_HANDLE_DECOMPRESSED_LIMIT)) to also catch MemoryError so
corrupt/compressed wrappers and allocation failures both return True and
preserve the intended routing behavior.

@mldangelo mldangelo merged commit 26d5b44 into main Mar 18, 2026
25 checks passed
@mldangelo mldangelo deleted the codex/fix-unbounded-lzma-decompression-issue branch March 18, 2026 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant