Hi FIDESlib Team,
I am currently a developer exploring the capabilities of FIDESlib. First of all, thank you for this excellent library; its performance on CKKS bootstrapping is truly impressive.
While applying FIDESlib to a large-scale iterative project (specifically, a multi-datacenter PageRank implementation), I encountered a performance challenge regarding data transfer. Currently, my workflow involves using OpenFHE on the CPU for Encoding and Encryption, and then transferring the expanded ciphertexts to the GPU. In scenarios with frequent synchronizations, the PCIe bandwidth seems to become a significant bottleneck due to the ciphertext expansion.
I am considering moving the "front-end" operations (Encryption and Encoding) directly to the GPU so that I only need to transfer original plaintext vectors (e.g., std::vector) to the device. However, as I'm diving into the codebase, I realized the data structures (RNSPoly, LimbPartition, etc.) are quite sophisticated for high-efficiency computation.
I would highly appreciate your guidance on a few points:
Does the current architecture support a way to "natively" load plaintext data directly into the RNS structures on the GPU without going through the CPU-side OpenFHE encryption first?
If I were to implement a GPU-native sampling (using cuRAND for the error distribution) and encoding kernel, are there any recommended entry points or best practices for accessing the underlying VRAM pointers while respecting the library’s memory management (like CudaStream and Limb alignment)?
Are there any plans in your roadmap to include GPU-side encryption or high-speed data loading primitives to further mitigate PCIe overhead?
Thank you for your time and for this great contribution to the FHE community!
Best regards,
Hi FIDESlib Team,
I am currently a developer exploring the capabilities of FIDESlib. First of all, thank you for this excellent library; its performance on CKKS bootstrapping is truly impressive.
While applying FIDESlib to a large-scale iterative project (specifically, a multi-datacenter PageRank implementation), I encountered a performance challenge regarding data transfer. Currently, my workflow involves using OpenFHE on the CPU for Encoding and Encryption, and then transferring the expanded ciphertexts to the GPU. In scenarios with frequent synchronizations, the PCIe bandwidth seems to become a significant bottleneck due to the ciphertext expansion.
I am considering moving the "front-end" operations (Encryption and Encoding) directly to the GPU so that I only need to transfer original plaintext vectors (e.g., std::vector) to the device. However, as I'm diving into the codebase, I realized the data structures (RNSPoly, LimbPartition, etc.) are quite sophisticated for high-efficiency computation.
I would highly appreciate your guidance on a few points:
Does the current architecture support a way to "natively" load plaintext data directly into the RNS structures on the GPU without going through the CPU-side OpenFHE encryption first?
If I were to implement a GPU-native sampling (using cuRAND for the error distribution) and encoding kernel, are there any recommended entry points or best practices for accessing the underlying VRAM pointers while respecting the library’s memory management (like CudaStream and Limb alignment)?
Are there any plans in your roadmap to include GPU-side encryption or high-speed data loading primitives to further mitigate PCIe overhead?
Thank you for your time and for this great contribution to the FHE community!
Best regards,