perf: use C-backed cryptography library for RC4 decryption#1255
perf: use C-backed cryptography library for RC4 decryption#1255KRRT7 wants to merge 2 commits intopdfminer:masterfrom
Conversation
Replace the pure-Python byte-by-byte RC4 implementation with the C-backed ARC4 from the cryptography library (already a dependency). The original used `r += bytes((c ^ k,))` in a loop, allocating a new bytes object per byte of input.
5096279 to
06d7add
Compare
|
Updated the test vectors in This doesn't affect real-world usage — PDF encryption key lengths ( |
|
This seems reasonable, particularly since rolling one's own encryption code is usually a bad idea, and also as you say, the cryptography library is already a dependency (one that unfortunately doesn't build and install everywhere, but that's not a new problem). Have you verified this on real-world "encrypted" (because RC4...) PDFs? I wouldn't trust them to actually follow the standard, but if |
|
Yes — verified on real-world PDFs. A/B testing: Took 8 PDFs (English, Japanese, mixed-script, various sizes) and encrypted each with both RC4-40 and RC4-128 via We also ran a smoke test over all the PDFs in unstructured-inference/sample-docs and internally we've tested against a much larger and more varied corpus. No regressions. This is part of an ongoing effort to optimize Unstructured — we identified the pure-Python RC4 implementation in the hot path and the Re: key sizes — |
Sounds good to me - the only possible issue I can see with this is that
Ah, okay! You could also use PLAYA-PDF which is a fork of pdfminer.six that implements its layout analysis algorithm, but contains numerous performance optimizations and has a much more ergonomic API, as I suggested a while back in a PR 😉 |
I'll take a look at it, thank you |
In the meantime I think we can merge this and some of the other performance PRs - what do you think @pietermarsman ? |
Summary
cryptographylibrary (already a dependency)r += bytes((c ^ k,))in a loop, allocating a newbytesobject per byte of input