Skip to content

Scan compressed pickle artifacts#345

Open
massy-o wants to merge 1 commit into
protectai:mainfrom
massy-o:scan-compressed-pickle-artifacts
Open

Scan compressed pickle artifacts#345
massy-o wants to merge 1 commit into
protectai:mainfrom
massy-o:scan-compressed-pickle-artifacts

Conversation

@massy-o
Copy link
Copy Markdown

@massy-o massy-o commented May 14, 2026

Summary

  • recognize compressed pickle/joblib/dill extensions such as .joblib.gz, .pkl.xz, and .dill.bz2
  • decompress supported compressed pickle streams before running pickle opcode scanning
  • update compatibility checks to handle compound suffixes instead of only Path.suffix
  • avoid removed NumPy private header parsing helpers by using public NumPy header readers
  • add regression coverage for gzip-compressed .joblib.gz scanning

Tests

  • uv run --with-editable . --with pytest --with dill --with requests --with aiohttp --with torch --with tf-keras pytest tests/test_modelscan.py::test_scan_file_path tests/test_modelscan.py::test_scan_numpy
  • uv run --with black black --check modelscan/middlewares/format_via_extension.py modelscan/modelscan.py modelscan/settings.py modelscan/tools/picklescanner.py tests/test_modelscan.py
  • locally verified compressed .joblib.gz and object-array .npy samples are reported as CRITICAL with this branch

@massy-o
Copy link
Copy Markdown
Author

massy-o commented May 14, 2026

Self-review notes:

  • The compressed pickle path currently decompresses the full stream into memory before passing it to pickletools.genops. I kept that approach to make the fix small and consistent with the existing byte-stream scanner, but a follow-up could replace this with streaming decompression if very large compressed artifacts are a concern.
  • Compound suffix handling is now based on str(source).endswith(...) so formats like .joblib.gz are recognized by the same settings lists used by scanner compatibility. This intentionally keeps extension policy in settings, but it does mean future compressed pickle suffixes should be added in both scanner and middleware settings entries.
  • I also removed the NumPy private _check_version / _read_array_header calls while touching this scanner path, because current NumPy no longer exposes them and the existing test_scan_numpy catches that regression.

decompress = COMPRESSED_PICKLE_SUFFIXES.get(model.get_source().suffix)
if decompress is not None:
try:
stream = io.BytesIO(decompress(stream.read()))
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review: decompression is intentionally placed before _list_globals() so the existing unsafe-global detection remains unchanged. This keeps the patch small, but very large compressed artifacts could justify a streaming decompression follow-up.

"""Disassemble a Pickle stream and report issues"""
issues: List[Issue] = []
stream = model.get_stream(offset)
decompress = COMPRESSED_PICKLE_SUFFIXES.get(model.get_source().suffix)
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review: this uses the outer compound extension to decide whether to decompress before pickle opcode scanning. That directly covers valid joblib.dump(..., compress=...) artifacts such as .joblib.gz; the tradeoff is that compression policy stays extension-based rather than magic-byte based.

Comment thread tests/test_modelscan.py
),
),
}
results = compressed_joblib.scan(Path(f"{file_path}/data/malicious16.joblib.gz"))
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review: this regression asserts the formerly skipped .joblib.gz artifact now goes through the pickle scanner and reports the embedded posix.system payload as CRITICAL, matching the reported scanner-bypass path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant