Scan compressed pickle artifacts by massy-o · Pull Request #345 · protectai/modelscan

massy-o · 2026-05-14T05:52:34Z

Summary

recognize compressed pickle/joblib/dill extensions such as .joblib.gz, .pkl.xz, and .dill.bz2
decompress supported compressed pickle streams before running pickle opcode scanning
update compatibility checks to handle compound suffixes instead of only Path.suffix
avoid removed NumPy private header parsing helpers by using public NumPy header readers
add regression coverage for gzip-compressed .joblib.gz scanning

Tests

uv run --with-editable . --with pytest --with dill --with requests --with aiohttp --with torch --with tf-keras pytest tests/test_modelscan.py::test_scan_file_path tests/test_modelscan.py::test_scan_numpy
uv run --with black black --check modelscan/middlewares/format_via_extension.py modelscan/modelscan.py modelscan/settings.py modelscan/tools/picklescanner.py tests/test_modelscan.py
locally verified compressed .joblib.gz and object-array .npy samples are reported as CRITICAL with this branch

massy-o · 2026-05-14T05:55:12Z

Self-review notes:

The compressed pickle path currently decompresses the full stream into memory before passing it to pickletools.genops. I kept that approach to make the fix small and consistent with the existing byte-stream scanner, but a follow-up could replace this with streaming decompression if very large compressed artifacts are a concern.
Compound suffix handling is now based on str(source).endswith(...) so formats like .joblib.gz are recognized by the same settings lists used by scanner compatibility. This intentionally keeps extension policy in settings, but it does mean future compressed pickle suffixes should be added in both scanner and middleware settings entries.
I also removed the NumPy private _check_version / _read_array_header calls while touching this scanner path, because current NumPy no longer exposes them and the existing test_scan_numpy catches that regression.

massy-o · 2026-05-15T05:05:58Z

+    decompress = COMPRESSED_PICKLE_SUFFIXES.get(model.get_source().suffix)
+    if decompress is not None:
+        try:
+            stream = io.BytesIO(decompress(stream.read()))


Self-review: decompression is intentionally placed before _list_globals() so the existing unsafe-global detection remains unchanged. This keeps the patch small, but very large compressed artifacts could justify a streaming decompression follow-up.

massy-o · 2026-05-15T05:05:58Z

    """Disassemble a Pickle stream and report issues"""
    issues: List[Issue] = []
+    stream = model.get_stream(offset)
+    decompress = COMPRESSED_PICKLE_SUFFIXES.get(model.get_source().suffix)


Self-review: this uses the outer compound extension to decide whether to decompress before pickle opcode scanning. That directly covers valid joblib.dump(..., compress=...) artifacts such as .joblib.gz; the tradeoff is that compression policy stays extension-based rather than magic-byte based.

massy-o · 2026-05-15T05:05:58Z

+            ),
+        ),
+    }
+    results = compressed_joblib.scan(Path(f"{file_path}/data/malicious16.joblib.gz"))


Self-review: this regression asserts the formerly skipped .joblib.gz artifact now goes through the pickle scanner and reports the embedded posix.system payload as CRITICAL, matching the reported scanner-bypass path.

Scan compressed pickle artifacts

5f38596

massy-o commented May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scan compressed pickle artifacts#345

Scan compressed pickle artifacts#345
massy-o wants to merge 1 commit into
protectai:mainfrom
massy-o:scan-compressed-pickle-artifacts

massy-o commented May 14, 2026

Uh oh!

massy-o commented May 14, 2026

Uh oh!

massy-o May 15, 2026

Uh oh!

massy-o May 15, 2026

Uh oh!

massy-o May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

massy-o commented May 14, 2026

Summary

Tests

Uh oh!

massy-o commented May 14, 2026

Uh oh!

massy-o May 15, 2026

Choose a reason for hiding this comment

Uh oh!

massy-o May 15, 2026

Choose a reason for hiding this comment

Uh oh!

massy-o May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant