perf: use C-backed cryptography library for RC4 decryption by KRRT7 · Pull Request #1255 · pdfminer/pdfminer.six

KRRT7 · 2026-03-01T10:30:14Z

Summary

Replace the pure-Python byte-by-byte RC4 implementation with the C-backed ARC4 from the cryptography library (already a dependency)
The original used r += bytes((c ^ k,)) in a loop, allocating a new bytes object per byte of input

Replace the pure-Python byte-by-byte RC4 implementation with the C-backed ARC4 from the cryptography library (already a dependency). The original used `r += bytes((c ^ k,))` in a loop, allocating a new bytes object per byte of input.

KRRT7 · 2026-03-01T10:38:52Z

Updated the test vectors in test_pdfminer_crypto.py — the original keys (b"Key", b"Wiki", b"Secret") are 3, 4, and 6 bytes which are not valid RC4 key sizes. The cryptography library only accepts keys in {40, 56, 64, 80, 128, 160, 192, 256} bits (i.e. 5, 7, 8, 10, 16, 20, 24, 32 bytes).

This doesn't affect real-world usage — PDF encryption key lengths (/Length parameter) are always a multiple of 8 in the range 40–256, and compute_encryption_key produces keys of length // 8 bytes, which all map to valid cryptography ARC4 key sizes. The per-object decrypt_rc4 path also lands on valid sizes (80 or 128 bits) since it truncates the MD5 digest to min(base_key_len + 5, 16) bytes.

dhdaines · 2026-03-13T03:03:19Z

This seems reasonable, particularly since rolling one's own encryption code is usually a bad idea, and also as you say, the cryptography library is already a dependency (one that unfortunately doesn't build and install everywhere, but that's not a new problem).

Have you verified this on real-world "encrypted" (because RC4...) PDFs? I wouldn't trust them to actually follow the standard, but if compute_encryption_key is guaranteed to produce valid keys anyway, that's probably okay.

KRRT7 · 2026-03-23T02:57:57Z

Yes — verified on real-world PDFs.

A/B testing: Took 8 PDFs (English, Japanese, mixed-script, various sizes) and encrypted each with both RC4-40 and RC4-128 via qpdf --allow-weak-crypto. Extracted text on master and on this branch — SHA-256 hashes match for all 16 files. The existing rc4-40.pdf and rc4-128.pdf repo samples also decrypt identically on both branches.

We also ran a smoke test over all the PDFs in unstructured-inference/sample-docs and internally we've tested against a much larger and more varied corpus. No regressions.

This is part of an ongoing effort to optimize Unstructured — we identified the pure-Python RC4 implementation in the hot path and the cryptography backend is a straightforward drop-in.

Re: key sizes — compute_encryption_key always produces keys whose length is key_length // 8 where key_length is a multiple of 8 in [40, 256], and the per-object decrypt_rc4 path truncates the MD5 digest to min(base_key_len + 5, 16) bytes — both land on valid cryptography ARC4 key sizes. No code path can produce an invalid key from a real PDF.

dhdaines · 2026-03-23T20:44:06Z

A/B testing: Took 8 PDFs (English, Japanese, mixed-script, various sizes) and encrypted each with both RC4-40 and RC4-128 via qpdf --allow-weak-crypto. Extracted text on master and on this branch — SHA-256 hashes match for all 16 files. The existing rc4-40.pdf and rc4-128.pdf repo samples also decrypt identically on both branches.

Sounds good to me - the only possible issue I can see with this is that cryptography has no support for some platforms in its latest release, so we'll need to pin the dependency in order to support Windows on ARM: pyca/cryptography#14249

This is part of an ongoing effort to optimize Unstructured — we identified the pure-Python RC4 implementation in the hot path and the cryptography backend is a straightforward drop-in.

Ah, okay! You could also use PLAYA-PDF which is a fork of pdfminer.six that implements its layout analysis algorithm, but contains numerous performance optimizations and has a much more ergonomic API, as I suggested a while back in a PR 😉

KRRT7 · 2026-03-24T05:39:13Z

Ah, okay! You could also use PLAYA-PDF which is a fork of pdfminer.six that implements its layout analysis algorithm, but contains numerous performance optimizations and has a much more ergonomic API, as I suggested a while back in a PR 😉

I'll take a look at it, thank you

dhdaines · 2026-03-24T11:25:47Z

Ah, okay! You could also use PLAYA-PDF which is a fork of pdfminer.six that implements its layout analysis algorithm, but contains numerous performance optimizations and has a much more ergonomic API, as I suggested a while back in a PR 😉

I'll take a look at it, thank you

In the meantime I think we can merge this and some of the other performance PRs - what do you think @pietermarsman ?

KRRT7 force-pushed the arcfour-use-cryptography-backend branch from 5096279 to 06d7add Compare March 1, 2026 10:36

dhdaines requested a review from pietermarsman March 13, 2026 13:51

Merge branch 'master' into arcfour-use-cryptography-backend

8080bee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: use C-backed cryptography library for RC4 decryption#1255

perf: use C-backed cryptography library for RC4 decryption#1255
KRRT7 wants to merge 2 commits intopdfminer:masterfrom
KRRT7:arcfour-use-cryptography-backend

KRRT7 commented Mar 1, 2026

Uh oh!

KRRT7 commented Mar 1, 2026

Uh oh!

dhdaines commented Mar 13, 2026

Uh oh!

KRRT7 commented Mar 23, 2026

Uh oh!

dhdaines commented Mar 23, 2026

Uh oh!

KRRT7 commented Mar 24, 2026

Uh oh!

dhdaines commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

KRRT7 commented Mar 1, 2026

Summary

Uh oh!

KRRT7 commented Mar 1, 2026

Uh oh!

dhdaines commented Mar 13, 2026

Uh oh!

KRRT7 commented Mar 23, 2026

Uh oh!

dhdaines commented Mar 23, 2026

Uh oh!

KRRT7 commented Mar 24, 2026

Uh oh!

dhdaines commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants