Is your feature request related to a problem? Please describe.
First off, thank you for creating and maintaining XADMaster. It's an incredibly comprehensive and well-architected library for archive decompression.
While profiling the RAR5 decompression performance, especially with encrypted archives, I noticed that the AES and CRC32 computations are currently implemented purely in software. While these implementations are correct and robust, they don't leverage the specialized hardware instructions available in most modern CPUs. This can lead to a significant performance gap when decompressing large, encrypted RAR files or when verifying checksums on very fast storage systems.
Describe the solution you'd like
I would like to propose the addition of hardware acceleration for AES and CRC32 calculations within the RAR decompression modules (XADRARAESHandle and XADCRCHandle/CRC.m respectively).
This would involve:
- Runtime CPU feature detection to check for the presence of the required instruction sets.
- Implementing alternative code paths that use CPU intrinsics for acceleration when available.
- Falling back to the existing pure software implementation on older hardware that lacks these features.
This approach would maintain broad compatibility while unlocking significant performance gains on modern systems.
Technical Details & Implementation Suggestions:
1. AES Acceleration (AES-NI on x86 and ARM Cryptography Extensions)
-
x86/x86-64 Architecture (Intel/AMD):
- Instruction Set: AES-NI (Advanced Encryption Standard New Instructions).
- Implementation: Use intrinsics from
<wmmintrin.h> (for SSSE3/AES-NI). Functions like _mm_aesdec_si128 and _mm_aesdeclast_si128 can replace the software-based decryption loops in XADRARAESHandle.m. CPU support can be detected at runtime via __cpuid.
-
ARMv8-A Architecture (Apple Silicon, modern iOS devices, etc.):
- Instruction Set: ARMv8 Cryptography Extensions (ARMCE).
- Implementation: Use NEON intrinsics from
<arm_neon.h>. Instructions like vaesdq_u8 (AES decrypt one round) and vaesimcq_u8 (AES inverse mix columns) can provide hardware-accelerated decryption. CPU feature detection can be done via sysctlbyname on macOS/iOS or getauxval on Linux.
2. CRC32 Acceleration (SSE4.2 on x86 and ARM CRC32 Instructions)
XADMaster already uses a highly optimized Slicing-by-16 software implementation (XADCalculateCRCFast), which is great. Hardware acceleration could push this even further.
Describe alternatives you've considered
The current software implementation is a perfectly valid alternative for ensuring maximum portability. However, given that hardware support for these instructions is now nearly ubiquitous across all major platforms (macOS, iOS, Windows, Linux on both x86 and ARM), adding accelerated paths seems like a natural evolution that would greatly benefit users without sacrificing compatibility.
Additional context
The performance improvement would be most noticeable in the following scenarios:
- Extracting large files from password-protected RAR archives.
- Testing the integrity of large archives (
-t command in unar).
- Operating on battery-powered devices like MacBooks and iPhones, where hardware-accelerated crypto is not only faster but also significantly more power-efficient.
Thank you for considering this feature request. I believe it would be a valuable enhancement to an already excellent library.
Is your feature request related to a problem? Please describe.
First off, thank you for creating and maintaining XADMaster. It's an incredibly comprehensive and well-architected library for archive decompression.
While profiling the RAR5 decompression performance, especially with encrypted archives, I noticed that the AES and CRC32 computations are currently implemented purely in software. While these implementations are correct and robust, they don't leverage the specialized hardware instructions available in most modern CPUs. This can lead to a significant performance gap when decompressing large, encrypted RAR files or when verifying checksums on very fast storage systems.
Describe the solution you'd like
I would like to propose the addition of hardware acceleration for AES and CRC32 calculations within the RAR decompression modules (
XADRARAESHandleandXADCRCHandle/CRC.mrespectively).This would involve:
This approach would maintain broad compatibility while unlocking significant performance gains on modern systems.
Technical Details & Implementation Suggestions:
1. AES Acceleration (AES-NI on x86 and ARM Cryptography Extensions)
x86/x86-64 Architecture (Intel/AMD):
<wmmintrin.h>(for SSSE3/AES-NI). Functions like_mm_aesdec_si128and_mm_aesdeclast_si128can replace the software-based decryption loops inXADRARAESHandle.m. CPU support can be detected at runtime via__cpuid.ARMv8-A Architecture (Apple Silicon, modern iOS devices, etc.):
<arm_neon.h>. Instructions likevaesdq_u8(AES decrypt one round) andvaesimcq_u8(AES inverse mix columns) can provide hardware-accelerated decryption. CPU feature detection can be done viasysctlbynameon macOS/iOS orgetauxvalon Linux.2. CRC32 Acceleration (SSE4.2 on x86 and ARM CRC32 Instructions)
XADMaster already uses a highly optimized Slicing-by-16 software implementation (
XADCalculateCRCFast), which is great. Hardware acceleration could push this even further.x86/x86-64 Architecture (Intel/AMD):
crc32instruction._mm_crc32_u8,_mm_crc32_u32, and_mm_crc32_u64intrinsics from<nmmintrin.h>. This can often outperform even the best software-based table methods, especially when the bottleneck is the CPU.ARMv8-A Architecture (Apple Silicon, etc.):
__crc32dor__builtin_arm_crc32dto compute CRC32 on 64-bit chunks of data at a time. This is significantly faster than software methods on ARM hardware.Describe alternatives you've considered
The current software implementation is a perfectly valid alternative for ensuring maximum portability. However, given that hardware support for these instructions is now nearly ubiquitous across all major platforms (macOS, iOS, Windows, Linux on both x86 and ARM), adding accelerated paths seems like a natural evolution that would greatly benefit users without sacrificing compatibility.
Additional context
The performance improvement would be most noticeable in the following scenarios:
-tcommand inunar).Thank you for considering this feature request. I believe it would be a valuable enhancement to an already excellent library.