fix: bound XZ decompression memory in r_serialized scanner by mldangelo · Pull Request #713 · promptfoo/modelaudit

mldangelo · 2026-03-16T09:39:57Z

Motivation

The R serialized scanner used lzma.open() for XZ-compressed files without limiting the decompressor's memory, allowing a crafted XZ header (large dictionary) to force large allocations and cause a Denial-of-Service.

Description

Add class constants _XZ_DECOMPRESS_MEMLIMIT and _XZ_READ_CHUNK_SIZE and implement _read_xz_with_memlimit() that uses lzma.LZMADecompressor(..., memlimit=...) to keep allocations bounded.
Replace direct lzma.open() usage in both _read_decompressed_prefix() and _read_payload_for_analysis() with the bounded helper to protect prefix probing and full payload reads.
Update scan() decompression error handling to include MemoryError so allocation failures are reported as a fail-closed decompression check.
Add a regression test test_scan_xz_memory_limited_stream_is_handled_fail_closed that crafts an XZ payload with a large LZMA2 dictionary and verifies the scanner fails safely when configured with a low decompression budget.

Testing

Ran linters and format checks with uv run ruff check modelaudit/ tests/ and uv run ruff format --check modelaudit/ tests/ which passed.
Ran type checks with uv run mypy modelaudit/ which passed.
Ran the focused tests with uv run pytest tests/scanners/test_r_serialized_scanner.py -q which passed (10 passed).
Ran the wider test suite with uv run pytest -n auto -m "not slow and not integration" --maxfail=1 which still fails due to an unrelated pre-existing test (tests/utils/helpers/test_secure_hasher.py::TestErrorHandling::test_hash_permission_denied), not caused by these changes.

Codex Task

Summary by CodeRabbit

New Features
- Added memory-bounded XZ decompression for R-serialized payloads with chunked processing and decompression checks to avoid resource exhaustion.
Bug Fixes
- Improved handling for truncated, concatenated, and oversized XZ streams, including graceful failure when memory limits are exceeded.
Tests
- Added tests covering XZ decompression, memory limit enforcement, truncated streams, concatenated streams, and malicious payload detection.

coderabbitai · 2026-03-16T09:40:45Z

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and specifically describes the main change: adding memory bounds to XZ decompression in the R serialized scanner, which is the primary security fix in this PR.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch codex/fix-unbounded-lzma-decompression-issue

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/scanners/r_serialized_scanner.py`:
- Around line 185-193: The XZ branch currently calls _read_xz_with_memlimit
directly and skips the decompression-ratio guard in _read_decompressed_stream,
allowing oversized decompressed payloads; modify the XZ path to route the
decompressed data through the same decompression-ratio check used by
_read_decompressed_stream (or replicate that guard) by comparing
total_decompressed against max_decompressed_bytes and max_scan_bytes (use
max_decompressed_bytes for mem limits and set truncated = total_decompressed >
max_scan_bytes) before slicing to max_scan_bytes, ensuring
_read_xz_with_memlimit returns both the raw decompressed length and the payload
so the same truncation/ratio logic can apply.
- Around line 149-166: The _read_xz_with_memlimit helper currently returns
partial output if the XZ input ends before the LZMADecompressor signals eof and
it also bypasses the decompression-ratio guard used elsewhere; update
_read_xz_with_memlimit to (1) track compressed bytes read and decompressed bytes
produced and enforce the existing max_decompression_ratio limit (same semantics
as _read_decompressed_stream), and (2) treat premature end-of-input as an error
by raising an exception when the file is exhausted but decompressor.eof is False
(instead of returning partial bytes). Then update the XZ branch in
_read_payload_for_analysis to use this corrected _read_xz_with_memlimit (or
apply the same ratio/enforcement logic inline) so XZ is protected consistently
with gzip/bzip2; reference functions/fields: _read_xz_with_memlimit,
_read_payload_for_analysis, _read_decompressed_stream, max_decompression_ratio,
and _XZ_READ_CHUNK_SIZE.

In `@tests/scanners/test_r_serialized_scanner.py`:
- Around line 171-183: Add a new test in
tests/scanners/test_r_serialized_scanner.py that mirrors
test_scan_xz_memory_limited_stream_is_handled_fail_closed but uses a scanner
config with a sufficiently large r_max_decompressed_bytes (or default) so the
XZ-compressed RDS produced by _write_xz_r_serialized(path, "safe", dict_size=1
<< 24) is successfully scanned; assert RSerializedScanner.can_handle(str(path)),
that result.success is True, and that the "R Serialized Decompression" check
(use _check_by_name(result, "R Serialized Decompression")) exists and has status
CheckStatus.PASSED to ensure benign XZ handling is preserved by
RSerializedScanner.scan.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 51d8d029-0701-4dfc-911c-8f3ce12e6db7

📥 Commits

Reviewing files that changed from the base of the PR and between d9fe283 and 04c1d9b.

📒 Files selected for processing (2)

modelaudit/scanners/r_serialized_scanner.py
tests/scanners/test_r_serialized_scanner.py

modelaudit/scanners/r_serialized_scanner.py

tests/scanners/test_r_serialized_scanner.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/scanners/r_serialized_scanner.py`:
- Around line 146-153: The call to cls._read_xz_with_memlimit in
r_serialized_scanner.py uses a magic literal max_decompression_ratio=250.0;
replace this literal with a descriptive class-level constant (e.g.
_XZ_MAX_DECOMPRESSION_RATIO) declared on the same class, set to 250.0, and use
that constant in the cls._read_xz_with_memlimit invocation so the ratio is
configurable and tracked with other XZ-related limits.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1d22c7e9-e0cd-4dcc-a856-6296b10cd54b

📥 Commits

Reviewing files that changed from the base of the PR and between 481d4a5 and 143ce8e.

📒 Files selected for processing (2)

modelaudit/scanners/r_serialized_scanner.py
tests/scanners/test_r_serialized_scanner.py

coderabbitai · 2026-03-17T14:16:26Z

modelaudit/scanners/r_serialized_scanner.py

+            prefix, _truncated, _total_decompressed = cls._read_xz_with_memlimit(
+                path=path,
+                output_limit=limit,
+                memlimit=cls._XZ_DECOMPRESS_MEMLIMIT,
+                compressed_size=os.path.getsize(path),
+                max_decompressed_bytes=limit + cls._XZ_READ_CHUNK_SIZE,
+                max_decompression_ratio=250.0,
+            )


🧹 Nitpick | 🔵 Trivial

Replace the hard-coded decompression ratio with a named constant.

max_decompression_ratio=250.0 here is a magic number and can drift from scanner defaults/config intent; promoting it to a class constant would reduce maintenance risk.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelaudit/scanners/r_serialized_scanner.py` around lines 146 - 153, The call to cls._read_xz_with_memlimit in r_serialized_scanner.py uses a magic literal max_decompression_ratio=250.0; replace this literal with a descriptive class-level constant (e.g. _XZ_MAX_DECOMPRESSION_RATIO) declared on the same class, set to 250.0, and use that constant in the cls._read_xz_with_memlimit invocation so the ratio is configurable and tracked with other XZ-related limits.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/scanners/r_serialized_scanner.py`:
- Around line 106-110: The can_handle probe currently catches EOFError, OSError,
gzip.BadGzipFile, and lzma.LZMAError when calling cls._read_decompressed_prefix
but omits MemoryError, which can cause probe-time allocation failures to escape
and stop fail-open routing; update the except clause in
ModelSerializedScanner.can_handle (the try around
cls._read_decompressed_prefix(path, compression,
cls._CAN_HANDLE_DECOMPRESSED_LIMIT)) to also catch MemoryError so
corrupt/compressed wrappers and allocation failures both return True and
preserve the intended routing behavior.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 88ac5c0c-a07c-49f2-90b2-02d88f6ec7f2

📥 Commits

Reviewing files that changed from the base of the PR and between 143ce8e and 010a723.

📒 Files selected for processing (2)

modelaudit/scanners/r_serialized_scanner.py
tests/scanners/test_r_serialized_scanner.py

coderabbitai · 2026-03-18T07:00:56Z

modelaudit/scanners/r_serialized_scanner.py

        try:
            prefix = cls._read_decompressed_prefix(path, compression, cls._CAN_HANDLE_DECOMPRESSED_LIMIT)
-        except (EOFError, OSError, ValueError, gzip.BadGzipFile, lzma.LZMAError):
+        except (EOFError, OSError, gzip.BadGzipFile, lzma.LZMAError):
            # Corrupt compressed wrappers should still route to this scanner.
            return True


⚠️ Potential issue | 🟠 Major

Catch probe-time memory failures in can_handle to preserve fail-open routing intent.

can_handle treats corrupt compressed wrappers as routable to this scanner, but MemoryError is not included in this probe exception list. A probe-time allocation failure can bypass that intent and abort routing unexpectedly.

💡 Proposed fix

- except (EOFError, OSError, gzip.BadGzipFile, lzma.LZMAError): + except (EOFError, OSError, MemoryError, gzip.BadGzipFile, lzma.LZMAError): # Corrupt compressed wrappers should still route to this scanner. return True

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelaudit/scanners/r_serialized_scanner.py` around lines 106 - 110, The can_handle probe currently catches EOFError, OSError, gzip.BadGzipFile, and lzma.LZMAError when calling cls._read_decompressed_prefix but omits MemoryError, which can cause probe-time allocation failures to escape and stop fail-open routing; update the except clause in ModelSerializedScanner.can_handle (the try around cls._read_decompressed_prefix(path, compression, cls._CAN_HANDLE_DECOMPRESSED_LIMIT)) to also catch MemoryError so corrupt/compressed wrappers and allocation failures both return True and preserve the intended routing behavior.

fix: bound xz decompression memory in r serialized scanner

04c1d9b

mldangelo added aardvark codex labels Mar 16, 2026 — with ChatGPT Codex Connector

mldangelo added the aardvark label Mar 16, 2026

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

modelaudit/scanners/r_serialized_scanner.py Show resolved Hide resolved

modelaudit/scanners/r_serialized_scanner.py Outdated Show resolved Hide resolved

tests/scanners/test_r_serialized_scanner.py Show resolved Hide resolved

mldangelo added 3 commits March 16, 2026 03:57

fix: enforce xz decompression guards in r scanner

481d4a5

Merge branch 'main' into codex/fix-unbounded-lzma-decompression-issue

fa07975

fix(r): scan concatenated xz members safely

143ce8e

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

mldangelo added 2 commits March 17, 2026 23:55

fix(r): avoid misrouting non-r xz payloads

010a723

Merge remote-tracking branch 'origin/main' into audit-pr713-mainmerge

f997e80

coderabbitai bot reviewed Mar 18, 2026

View reviewed changes

mldangelo merged commit 26d5b44 into main Mar 18, 2026
25 checks passed

mldangelo deleted the codex/fix-unbounded-lzma-decompression-issue branch March 18, 2026 14:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: bound XZ decompression memory in r_serialized scanner#713

fix: bound XZ decompression memory in r_serialized scanner#713
mldangelo merged 6 commits intomainfrom
codex/fix-unbounded-lzma-decompression-issue

mldangelo commented Mar 16, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 16, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 17, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mldangelo commented Mar 16, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Description

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mldangelo commented Mar 16, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 16, 2026 •

edited

Loading