Falcon: GPU-Based Floating-point Adaptive Lossless Compression

Falcon is a high-performance GPU-accelerated lossless compression framework specifically designed for floating-point time series data. It achieves unprecedented compression ratios and throughput by leveraging modern GPU architectures through three key innovations: asynchronous pipeline, precise float-to-integer conversion, and adaptive sparse bit-plane encoding.

📊 Performance Highlights

Compression Ratio: Average 0.299 (21% improvement over best CPU competitors)
Compression Throughput: Average 10.82 GB/s (2.43× faster than fastest GPU competitors)
Decompression Throughput: Average 12.32 GB/s (2.4× faster than fastest GPU competitors)

🚀 Key Features

🎯 Asynchronous Pipeline

Event-Driven Scheduler: Hides I/O latency during CPU-GPU data transmission
Multi-stream Processing: Supports up to 16 concurrent streams
Bidirectional PCIe Utilization: Overlaps H2D and D2H communications

🔢 Precision-Preserving Conversion

Theoretical Guarantees: Eliminates floating-point arithmetic errors
Adaptive Digit Transformation: Handles both normal (β≤15, α≤22) and exceptional cases
Lossless Recovery: Exact reconstruction of original floating-point values

🎚️ Adaptive Sparse Bit-Plane Encoding

Dual Storage Schemes: Sparse storage for zero-dominated planes, dense storage for others
Outlier Resilience: Mitigates sparsity degradation caused by anomalies
Warp Divergence Minimization: Optimized for GPU parallel execution

🛠️ Prerequisite

Verified Environments

Base Environment 1 (WSL2)

OS: Ubuntu 22.04.5 LTS
Compiler: g++ 11.4
Build System: CMake 3.22.1
CUDA: nvcc 12.8/11.6
GPU: NVIDIA GeForce RTX 3050

Base Environment 2 (Native Ubuntu)

OS: Ubuntu 24.04.2 LTS
Compiler: g++ 11.4
Build System: CMake 3.28.1
CUDA: nvcc 12.0
GPU: NVIDIA GeForce RTX 5080

Required Dependencies

Essential Build Tools

# For Ubuntu 22.04/24.04
sudo apt update && sudo apt upgrade
sudo apt install -y git build-essential

CMake Installation

# Ubuntu 22.04 (CMake 3.22)
sudo apt install -y cmake

# Ubuntu 24.04 (CMake 3.28) or for newer version
wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | sudo apt-key add -
sudo apt-add-repository 'deb https://apt.kitware.com/ubuntu/ jammy main'
sudo apt update
sudo apt install -y cmake

CUDA Toolkit Installation

# For CUDA 12.x (compatible with RTX 3050/5080)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-0

# For CUDA 11.x (if needed for compatibility)
sudo apt install -y cuda-toolkit-11-8

Required Libraries

# Boost (program_options component)
sudo apt install -y libboost-all-dev

# Google Test (GTest)
sudo apt install -y libgtest-dev
cd /usr/src/gtest
sudo cmake .
sudo make
sudo cp lib/*.a /usr/lib

# Google Benchmark
sudo apt install -y libbenchmark-dev

# NVIDIA nvcomp (for baseline comparisons)
sudo apt-get -y install nvcomp-cuda-11
# or
sudo apt-get -y install nvcomp-cuda-12

Environment Verification

# Check compiler versions
g++ --version
cmake --version
nvcc --version

# Verify CUDA installation
nvidia-smi

🏗️ Code Architecture

Header Files Structure

GPU Base Version (1025 elements per thread)

Falcon_compressor.cuh - Optimized GPU compressor (1 thread processes 1025 elements)
Falcon_decompressor.cuh - Optimized GPU decompressor (1 thread processes 1025 elements)

GPU Single Precision Version

Falcon_float_compressor.cuh - Single precision floating-point GPU compressor
Falcon_float_decompressor.cuh - Single precision floating-point GPU decompressor

GPU Pipeline Version

Falcon_pipeline.cuh - Pipeline implementation with ablation interfaces
Falcon_float_pipeline.cuh - Single precision floating-point pipeline implementation

Source Implementation

text

src/
├── gpu/           # GPU kernel implementations
└── utils/         # Bit stream utilities and helper functions

Parallelism Design

Chunk Size: 1025 elements per GPU thread
Thread Mapping: Each thread processes one complete chunk
Warp Efficiency: Optimized for 32-thread warp execution
Memory Access: Coalesced global memory access patterns

🔨 Building

Quick Build Script

#!/bin/bash
set -x
mkdir -p build
cd build
cmake ..
make -j$(nproc)

Manual Building

Clone the repository:
```
git clone <repository-url>
cd Falcon
```

Generate CMake building system:

cmake -S . -B ./build -DCMAKE_BUILD_TYPE=Release

Build all targets:

cmake --build ./build --config Release -j$(nproc)

🧪 Testing

Test Structure

test/
├── baseline/          # Comparison algorithms (ALP, ndzip, elf, etc.)
├── data/             # Test datasets
├── Falcon_test_*.cu  # Main GPU test suites
└── test_*.cpp/cu     # Specific algorithm tests

Running Tests

Basic Usage for All Tests

./test/test_${test_name} --dir ../test/data/use/

Benchmark Tests (vs Baselines)

# Main GPU implementation 
./test/test_gpu --dir ../test/data/use/

# GPU without packing optimization
./test/test_gpu_nopack --dir ../test/data/use/

# GPU with bit-reduction optimization
./test/test_gpu_br --dir ../test/data/use/

# GPU with sparse optimization
./test/test_gpu_spare --dir ../test/data/use/

Multi-stream Performance Tests

# Multi-stream with 3-step blocking
./test/test_muti_3step_block --dir ../test/data/use/

# Multi-stream with 3-step non-blocking
./test/test_muti_3step_noblock --dir ../test/data/use/

# Optimized multi-stream
./test/test_muti_stream --dir ../test/data/use/

Ablation Studies

Encoding Strategy Ablation

Full Sparse: All bit-planes use sparse storage
Full Dense: All bit-planes use dense storage
Brute-force Error: Inaccurate decimal place calculation
Standard: Adaptive sparse/dense selection (default)

Pipeline Ablation

Single-stream: Sequential processing
Blocking: Synchronous multi-stream
Non-blocking: Asynchronous multi-stream
Standard: Event-driven scheduler (default)

Complete Test Script

#!/bin/bash
set -x
cd Falcon
mkdir -p build
cd build

# Compile project
cmake ..
make -j

# Run all tests
run_test() {
    local test_name=$1
    echo "===== Running ${test_name} ====="
    ./test/test_${test_name} --dir ../test/data/use/
}

# Core GPU tests
run_test "gpu"
run_test "gpu_nopack"
run_test "gpu_br"
run_test "gpu_spare"

# Multi-stream tests
run_test "muti_3step_block"
run_test "muti_3step_noblock"
run_test "muti_stream_opt"

📊 Experimental Results

Compression Ratio Comparison

Method	Average Ratio	Improvement vs Falcon
Falcon	0.299	-
ALP	0.329	9.1% worse
Elf*	0.339	13.4% worse
Elf	0.380	27.1% worse
ndzip	0.996	233% worse

Throughput Performance

Operation	Falcon	Best Competitor	Speedup
Compression	10.82 GB/s	4.46 GB/s (GDeflate)	2.43×
Decompression	12.32 GB/s	5.13 GB/s (GPU:Elf*)	2.4×

🔧 Configuration

Default Parameters

Chunk Size: 1025 elements per thread
Batch Size: 1025 × 1024 × 4 elements
Pipeline Streams: 16
GPU Architecture: Compute Capability 7.0+

Chunk Size Considerations

1025 elements: Optimized for memory space utilization
Thread Mapping: Each GPU thread processes exactly one chunk

Build Options

-DCMAKE_BUILD_TYPE=Release for optimized performance
-DCMAKE_CUDA_ARCHITECTURES=70 for specific GPU architecture

FilesExpand file tree

README.md

Latest commit

History