Skip to content

Latest commit

 

History

History
323 lines (230 loc) · 8.48 KB

File metadata and controls

323 lines (230 loc) · 8.48 KB

Falcon: GPU-Based Floating-point Adaptive Lossless Compression

Falcon is a high-performance GPU-accelerated lossless compression framework specifically designed for floating-point time series data. It achieves unprecedented compression ratios and throughput by leveraging modern GPU architectures through three key innovations: asynchronous pipeline, precise float-to-integer conversion, and adaptive sparse bit-plane encoding.

📊 Performance Highlights

  • Compression Ratio: Average 0.299 (21% improvement over best CPU competitors)
  • Compression Throughput: Average 10.82 GB/s (2.43× faster than fastest GPU competitors)
  • Decompression Throughput: Average 12.32 GB/s (2.4× faster than fastest GPU competitors)

🚀 Key Features

🎯 Asynchronous Pipeline

  • Event-Driven Scheduler: Hides I/O latency during CPU-GPU data transmission
  • Multi-stream Processing: Supports up to 16 concurrent streams
  • Bidirectional PCIe Utilization: Overlaps H2D and D2H communications

🔢 Precision-Preserving Conversion

  • Theoretical Guarantees: Eliminates floating-point arithmetic errors
  • Adaptive Digit Transformation: Handles both normal (β≤15, α≤22) and exceptional cases
  • Lossless Recovery: Exact reconstruction of original floating-point values

🎚️ Adaptive Sparse Bit-Plane Encoding

  • Dual Storage Schemes: Sparse storage for zero-dominated planes, dense storage for others
  • Outlier Resilience: Mitigates sparsity degradation caused by anomalies
  • Warp Divergence Minimization: Optimized for GPU parallel execution

🛠️ Prerequisite

Verified Environments

Base Environment 1 (WSL2)

  • OS: Ubuntu 22.04.5 LTS
  • Compiler: g++ 11.4
  • Build System: CMake 3.22.1
  • CUDA: nvcc 12.8/11.6
  • GPU: NVIDIA GeForce RTX 3050

Base Environment 2 (Native Ubuntu)

  • OS: Ubuntu 24.04.2 LTS
  • Compiler: g++ 11.4
  • Build System: CMake 3.28.1
  • CUDA: nvcc 12.0
  • GPU: NVIDIA GeForce RTX 5080

Required Dependencies

Essential Build Tools

# For Ubuntu 22.04/24.04
sudo apt update && sudo apt upgrade
sudo apt install -y git build-essential

CMake Installation

# Ubuntu 22.04 (CMake 3.22)
sudo apt install -y cmake

# Ubuntu 24.04 (CMake 3.28) or for newer version
wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | sudo apt-key add -
sudo apt-add-repository 'deb https://apt.kitware.com/ubuntu/ jammy main'
sudo apt update
sudo apt install -y cmake

CUDA Toolkit Installation

# For CUDA 12.x (compatible with RTX 3050/5080)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-0

# For CUDA 11.x (if needed for compatibility)
sudo apt install -y cuda-toolkit-11-8

Required Libraries

# Boost (program_options component)
sudo apt install -y libboost-all-dev

# Google Test (GTest)
sudo apt install -y libgtest-dev
cd /usr/src/gtest
sudo cmake .
sudo make
sudo cp lib/*.a /usr/lib

# Google Benchmark
sudo apt install -y libbenchmark-dev

# NVIDIA nvcomp (for baseline comparisons)
sudo apt-get -y install nvcomp-cuda-11
# or
sudo apt-get -y install nvcomp-cuda-12

Environment Verification

# Check compiler versions
g++ --version
cmake --version
nvcc --version

# Verify CUDA installation
nvidia-smi

🏗️ Code Architecture

Header Files Structure

GPU Base Version (1025 elements per thread)

  • Falcon_compressor.cuh - Optimized GPU compressor (1 thread processes 1025 elements)
  • Falcon_decompressor.cuh - Optimized GPU decompressor (1 thread processes 1025 elements)

GPU Single Precision Version

  • Falcon_float_compressor.cuh - Single precision floating-point GPU compressor
  • Falcon_float_decompressor.cuh - Single precision floating-point GPU decompressor

GPU Pipeline Version

  • Falcon_pipeline.cuh - Pipeline implementation with ablation interfaces
  • Falcon_float_pipeline.cuh - Single precision floating-point pipeline implementation

Source Implementation

text

src/
├── gpu/           # GPU kernel implementations
└── utils/         # Bit stream utilities and helper functions

Parallelism Design

  • Chunk Size: 1025 elements per GPU thread
  • Thread Mapping: Each thread processes one complete chunk
  • Warp Efficiency: Optimized for 32-thread warp execution
  • Memory Access: Coalesced global memory access patterns

🔨 Building

Quick Build Script

#!/bin/bash
set -x
mkdir -p build
cd build
cmake ..
make -j$(nproc)

Manual Building

  1. Clone the repository:

    git clone <repository-url>
    cd Falcon
  2. Generate CMake building system:

    cmake -S . -B ./build -DCMAKE_BUILD_TYPE=Release
  3. Build all targets:

    cmake --build ./build --config Release -j$(nproc)

🧪 Testing

Test Structure

test/
├── baseline/          # Comparison algorithms (ALP, ndzip, elf, etc.)
├── data/             # Test datasets
├── Falcon_test_*.cu  # Main GPU test suites
└── test_*.cpp/cu     # Specific algorithm tests

Running Tests

Basic Usage for All Tests

./test/test_${test_name} --dir ../test/data/use/

Benchmark Tests (vs Baselines)

# Main GPU implementation 
./test/test_gpu --dir ../test/data/use/

# GPU without packing optimization
./test/test_gpu_nopack --dir ../test/data/use/

# GPU with bit-reduction optimization
./test/test_gpu_br --dir ../test/data/use/

# GPU with sparse optimization
./test/test_gpu_spare --dir ../test/data/use/

Multi-stream Performance Tests

# Multi-stream with 3-step blocking
./test/test_muti_3step_block --dir ../test/data/use/

# Multi-stream with 3-step non-blocking
./test/test_muti_3step_noblock --dir ../test/data/use/

# Optimized multi-stream
./test/test_muti_stream --dir ../test/data/use/

Ablation Studies

Encoding Strategy Ablation

  • Full Sparse: All bit-planes use sparse storage
  • Full Dense: All bit-planes use dense storage
  • Brute-force Error: Inaccurate decimal place calculation
  • Standard: Adaptive sparse/dense selection (default)

Pipeline Ablation

  • Single-stream: Sequential processing
  • Blocking: Synchronous multi-stream
  • Non-blocking: Asynchronous multi-stream
  • Standard: Event-driven scheduler (default)

Complete Test Script

#!/bin/bash
set -x
cd Falcon
mkdir -p build
cd build

# Compile project
cmake ..
make -j

# Run all tests
run_test() {
    local test_name=$1
    echo "===== Running ${test_name} ====="
    ./test/test_${test_name} --dir ../test/data/use/
}

# Core GPU tests
run_test "gpu"
run_test "gpu_nopack"
run_test "gpu_br"
run_test "gpu_spare"

# Multi-stream tests
run_test "muti_3step_block"
run_test "muti_3step_noblock"
run_test "muti_stream_opt"

📊 Experimental Results

Compression Ratio Comparison

Method Average Ratio Improvement vs Falcon
Falcon 0.299 -
ALP 0.329 9.1% worse
Elf* 0.339 13.4% worse
Elf 0.380 27.1% worse
ndzip 0.996 233% worse

Throughput Performance

Operation Falcon Best Competitor Speedup
Compression 10.82 GB/s 4.46 GB/s (GDeflate) 2.43×
Decompression 12.32 GB/s 5.13 GB/s (GPU:Elf*) 2.4×

🔧 Configuration

Default Parameters

  • Chunk Size: 1025 elements per thread
  • Batch Size: 1025 × 1024 × 4 elements
  • Pipeline Streams: 16
  • GPU Architecture: Compute Capability 7.0+

Chunk Size Considerations

  • 1025 elements: Optimized for memory space utilization
  • Thread Mapping: Each GPU thread processes exactly one chunk

Build Options

  • -DCMAKE_BUILD_TYPE=Release for optimized performance
  • -DCMAKE_CUDA_ARCHITECTURES=70 for specific GPU architecture