Skip to content

perf: use C-backed cryptography library for RC4 decryption#1255

Open
KRRT7 wants to merge 2 commits intopdfminer:masterfrom
KRRT7:arcfour-use-cryptography-backend
Open

perf: use C-backed cryptography library for RC4 decryption#1255
KRRT7 wants to merge 2 commits intopdfminer:masterfrom
KRRT7:arcfour-use-cryptography-backend

Conversation

@KRRT7
Copy link
Copy Markdown

@KRRT7 KRRT7 commented Mar 1, 2026

Summary

  • Replace the pure-Python byte-by-byte RC4 implementation with the C-backed ARC4 from the cryptography library (already a dependency)
  • The original used r += bytes((c ^ k,)) in a loop, allocating a new bytes object per byte of input

Replace the pure-Python byte-by-byte RC4 implementation with the
C-backed ARC4 from the cryptography library (already a dependency).

The original used `r += bytes((c ^ k,))` in a loop, allocating a
new bytes object per byte of input.
@KRRT7 KRRT7 force-pushed the arcfour-use-cryptography-backend branch from 5096279 to 06d7add Compare March 1, 2026 10:36
@KRRT7
Copy link
Copy Markdown
Author

KRRT7 commented Mar 1, 2026

Updated the test vectors in test_pdfminer_crypto.py — the original keys (b"Key", b"Wiki", b"Secret") are 3, 4, and 6 bytes which are not valid RC4 key sizes. The cryptography library only accepts keys in {40, 56, 64, 80, 128, 160, 192, 256} bits (i.e. 5, 7, 8, 10, 16, 20, 24, 32 bytes).

This doesn't affect real-world usage — PDF encryption key lengths (/Length parameter) are always a multiple of 8 in the range 40–256, and compute_encryption_key produces keys of length // 8 bytes, which all map to valid cryptography ARC4 key sizes. The per-object decrypt_rc4 path also lands on valid sizes (80 or 128 bits) since it truncates the MD5 digest to min(base_key_len + 5, 16) bytes.

@dhdaines
Copy link
Copy Markdown
Contributor

This seems reasonable, particularly since rolling one's own encryption code is usually a bad idea, and also as you say, the cryptography library is already a dependency (one that unfortunately doesn't build and install everywhere, but that's not a new problem).

Have you verified this on real-world "encrypted" (because RC4...) PDFs? I wouldn't trust them to actually follow the standard, but if compute_encryption_key is guaranteed to produce valid keys anyway, that's probably okay.

@dhdaines dhdaines requested a review from pietermarsman March 13, 2026 13:51
@KRRT7
Copy link
Copy Markdown
Author

KRRT7 commented Mar 23, 2026

Yes — verified on real-world PDFs.

A/B testing: Took 8 PDFs (English, Japanese, mixed-script, various sizes) and encrypted each with both RC4-40 and RC4-128 via qpdf --allow-weak-crypto. Extracted text on master and on this branch — SHA-256 hashes match for all 16 files. The existing rc4-40.pdf and rc4-128.pdf repo samples also decrypt identically on both branches.

We also ran a smoke test over all the PDFs in unstructured-inference/sample-docs and internally we've tested against a much larger and more varied corpus. No regressions.

This is part of an ongoing effort to optimize Unstructured — we identified the pure-Python RC4 implementation in the hot path and the cryptography backend is a straightforward drop-in.

Re: key sizes — compute_encryption_key always produces keys whose length is key_length // 8 where key_length is a multiple of 8 in [40, 256], and the per-object decrypt_rc4 path truncates the MD5 digest to min(base_key_len + 5, 16) bytes — both land on valid cryptography ARC4 key sizes. No code path can produce an invalid key from a real PDF.

@dhdaines
Copy link
Copy Markdown
Contributor

A/B testing: Took 8 PDFs (English, Japanese, mixed-script, various sizes) and encrypted each with both RC4-40 and RC4-128 via qpdf --allow-weak-crypto. Extracted text on master and on this branch — SHA-256 hashes match for all 16 files. The existing rc4-40.pdf and rc4-128.pdf repo samples also decrypt identically on both branches.

Sounds good to me - the only possible issue I can see with this is that cryptography has no support for some platforms in its latest release, so we'll need to pin the dependency in order to support Windows on ARM: pyca/cryptography#14249

This is part of an ongoing effort to optimize Unstructured — we identified the pure-Python RC4 implementation in the hot path and the cryptography backend is a straightforward drop-in.

Ah, okay! You could also use PLAYA-PDF which is a fork of pdfminer.six that implements its layout analysis algorithm, but contains numerous performance optimizations and has a much more ergonomic API, as I suggested a while back in a PR 😉

@KRRT7
Copy link
Copy Markdown
Author

KRRT7 commented Mar 24, 2026

Ah, okay! You could also use PLAYA-PDF which is a fork of pdfminer.six that implements its layout analysis algorithm, but contains numerous performance optimizations and has a much more ergonomic API, as I suggested a while back in a PR 😉

I'll take a look at it, thank you

@dhdaines
Copy link
Copy Markdown
Contributor

Ah, okay! You could also use PLAYA-PDF which is a fork of pdfminer.six that implements its layout analysis algorithm, but contains numerous performance optimizations and has a much more ergonomic API, as I suggested a while back in a PR 😉

I'll take a look at it, thank you

In the meantime I think we can merge this and some of the other performance PRs - what do you think @pietermarsman ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants