Skip to content

CuteChuanChuan/Dive-Into-Iceberg

Repository files navigation

🧊 Iceberg Deep Dive

In-depth exploration of Apache Iceberg features, performance optimizations, and best practices.

📖 About This Project

This project conducts a series of experiments and analyses to deeply investigate various aspects of the Apache Iceberg table format:

  • Performance optimization techniques
  • Data organization strategies
  • Query optimization
  • Storage optimization
  • Best practices

Each experiment includes:

  • 📊 Complete code implementation
  • 🔬 Detailed performance analysis
  • 📝 Experimental results and data
  • 💡 Practical recommendations and insights

🗂️ Project Structure

IcebergDiveInto/
├── src/main/scala/com/example/icebergdiveinto/
│   ├── NullSortExperiment*.scala     # Null sorting experiments
│   └── (future experiments...)
├── profiling-results/                # Performance profiling results
│   ├── flamegraph-*.html            # Flame graphs
│   └── *.md                         # Analysis documents
├── articles/                         # Experiment write-ups
│   └── null-sort-experiment.md
└── README.md

🧪 Experiment Series

1️⃣ Null Sort Experiment: NULLS FIRST vs NULLS LAST

Research Question: How does Iceberg's WRITE ORDERED BY with NULLS FIRST vs NULLS LAST affect query performance?

📋 Experiment Overview

Apache Iceberg supports sorted writes:

ALTER TABLE table_name
WRITE ORDERED BY (column NULLS FIRST/LAST)

But does this sorting strategy actually impact query performance? How much? And why?

🎯 Key Findings

Performance comparison on 500K rows (30% null values):

Metric Finding
Overall Performance NULLS LAST is 9-20% faster
WHERE IS NOT NULL NULLS LAST is 9.4-27.5% faster
Filter Operations NULLS LAST is 29.6% faster
File Reading NULLS LAST is 52% faster
Parquet Stats Identical
Query Plan Identical

💡 Key Insights

  1. Physical data layout DOES matter

    • Even though Parquet stats and query plans are identical
    • Underlying execution efficiency shows significant differences
  2. Performance gains come from CPU microarchitecture optimizations

    • Branch Prediction: NULLS LAST improves CPU branch prediction accuracy
    • Cache Locality: Contiguous non-null data improves cache hit rate
    • Memory Bandwidth: Reduces unnecessary data reads
  3. Usage Recommendations

    • ✅ Recommended: OLAP queries + Null ratio > 10% + frequent IS NOT NULL filtering
    • ⚠️ Caution: Write-performance-sensitive scenarios
    • ❌ Not needed: Null ratio <5% or rarely queried columns

📂 Related Files

src/main/scala/com/example/icebergdiveinto/
├── NullSortExperimentDataGenerator.scala  # Data generation
├── NullSortExperimentComparison.scala     # Performance comparison
└── SparkSessionCreator.scala              # Spark configuration

profiling-results/
├── flamegraph-first.html                  # NULLS First flame graph
├── flamegraph-last.html                   # NULLS Last flame graph
├── WHY-IT-MATTERS.md                      # Deep dive analysis
└── PROFILING-GUIDE.md                     # Profiling tutorial

articles/
└── null-sort-experiment.md                # Complete article

🚀 Quick Start

# 1. Generate test data
sbt "runMain com.example.icebergdiveinto.NullSortExperimentDataGenerator"

# 2. Run performance comparison
sbt "runMain com.example.icebergdiveinto.NullSortExperimentComparison"

# 3. View results
open profiling-results/

🔥 Manual Profiling (Optional)

For detailed CPU-level performance analysis using async-profiler:

Terminal 1: Run with profiling mode

# Profile NULLS First
sbt "runMain com.example.icebergdiveinto.NullSortExperimentComparison first"

# The program will display:
# 🔍 Java PID: 12345
# 📍 Attach profiler now:
#    cd <async-profiler-folder>
#    ./bin/asprof -d 40 -f flamegraph-first.html 12345
# ⏳ Waiting 15 seconds for you to attach profiler...

Terminal 2: Attach profiler (within 15 seconds)

cd <async-profiler-folder>
./bin/asprof -d 40 -f <project-folder>/profiling-results/flamegraph-first.html <PID>

Repeat for NULLS Last:

Terminal 1: Run with profiling mode

# Profile NULLS Last
sbt "runMain com.example.icebergdiveinto.NullSortExperimentComparison last"

# The program will display:
# 🔍 Java PID: 67890
# 📍 Attach profiler now:
#    cd <async-profiler-folder>
#    ./bin/asprof -d 40 -f flamegraph-last.html 67890
# ⏳ Waiting 15 seconds for you to attach profiler...

Terminal 2: Attach profiler (within 15 seconds)

cd <async-profiler-folder>
./bin/asprof -d 40 -f <project-folder>/profiling-results/flamegraph-last.html <PID>

Compare Results:

# Open both flame graphs for comparison
open profiling-results/flamegraph-first.html
open profiling-results/flamegraph-last.html

See Profiling Guide for detailed instructions.

📊 Experiment Details

  • Data Scale: 500,000 rows
  • Partitions: 10 categories
  • Null Ratio: 30%
  • Test Queries:
    • WHERE value IS NOT NULL (COUNT, SUM, AVG)
    • WHERE value IS NULL
    • Full table scan
  • Profiling Tool: async-profiler 4.2
  • Environment: Spark 4.0.1 + Iceberg 1.10.0

🔮 Future Experiment Plans


🛠️ Tech Stack

  • Apache Spark: 4.0.1
  • Apache Iceberg: 1.10.0
  • Scala: 2.13.16
  • Java: 17.0.16
  • async-profiler: 4.2


📄 License

MIT License


👤 Author

Raymond

If this project helps you, please give it a ⭐️!

About

In-depth exploration of Apache Iceberg features, performance optimizations, and best practices.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors