🧊 Iceberg Deep Dive

In-depth exploration of Apache Iceberg features, performance optimizations, and best practices.

📖 About This Project

This project conducts a series of experiments and analyses to deeply investigate various aspects of the Apache Iceberg table format:

Performance optimization techniques
Data organization strategies
Query optimization
Storage optimization
Best practices

Each experiment includes:

📊 Complete code implementation
🔬 Detailed performance analysis
📝 Experimental results and data
💡 Practical recommendations and insights

🗂️ Project Structure

IcebergDiveInto/
├── src/main/scala/com/example/icebergdiveinto/
│   ├── NullSortExperiment*.scala     # Null sorting experiments
│   └── (future experiments...)
├── profiling-results/                # Performance profiling results
│   ├── flamegraph-*.html            # Flame graphs
│   └── *.md                         # Analysis documents
├── articles/                         # Experiment write-ups
│   └── null-sort-experiment.md
└── README.md

🧪 Experiment Series

1️⃣ Null Sort Experiment: NULLS FIRST vs NULLS LAST

Research Question: How does Iceberg's WRITE ORDERED BY with NULLS FIRST vs NULLS LAST affect query performance?

📋 Experiment Overview

Apache Iceberg supports sorted writes:

ALTER TABLE table_name
WRITE ORDERED BY (column NULLS FIRST/LAST)

But does this sorting strategy actually impact query performance? How much? And why?

🎯 Key Findings

Performance comparison on 500K rows (30% null values):

Metric	Finding
Overall Performance	NULLS LAST is 9-20% faster
WHERE IS NOT NULL	NULLS LAST is 9.4-27.5% faster
Filter Operations	NULLS LAST is 29.6% faster
File Reading	NULLS LAST is 52% faster
Parquet Stats	Identical
Query Plan	Identical

💡 Key Insights

Physical data layout DOES matter
- Even though Parquet stats and query plans are identical
- Underlying execution efficiency shows significant differences
Performance gains come from CPU microarchitecture optimizations
- Branch Prediction: NULLS LAST improves CPU branch prediction accuracy
- Cache Locality: Contiguous non-null data improves cache hit rate
- Memory Bandwidth: Reduces unnecessary data reads
Usage Recommendations
- ✅ Recommended: OLAP queries + Null ratio > 10% + frequent IS NOT NULL filtering
- ⚠️ Caution: Write-performance-sensitive scenarios
- ❌ Not needed: Null ratio <5% or rarely queried columns

📂 Related Files

src/main/scala/com/example/icebergdiveinto/
├── NullSortExperimentDataGenerator.scala  # Data generation
├── NullSortExperimentComparison.scala     # Performance comparison
└── SparkSessionCreator.scala              # Spark configuration

profiling-results/
├── flamegraph-first.html                  # NULLS First flame graph
├── flamegraph-last.html                   # NULLS Last flame graph
├── WHY-IT-MATTERS.md                      # Deep dive analysis
└── PROFILING-GUIDE.md                     # Profiling tutorial

articles/
└── null-sort-experiment.md                # Complete article

🚀 Quick Start

# 1. Generate test data
sbt "runMain com.example.icebergdiveinto.NullSortExperimentDataGenerator"

# 2. Run performance comparison
sbt "runMain com.example.icebergdiveinto.NullSortExperimentComparison"

# 3. View results
open profiling-results/

🔥 Manual Profiling (Optional)

For detailed CPU-level performance analysis using async-profiler:

Terminal 1: Run with profiling mode

# Profile NULLS First
sbt "runMain com.example.icebergdiveinto.NullSortExperimentComparison first"

# The program will display:
# 🔍 Java PID: 12345
# 📍 Attach profiler now:
#    cd <async-profiler-folder>
#    ./bin/asprof -d 40 -f flamegraph-first.html 12345
# ⏳ Waiting 15 seconds for you to attach profiler...

Terminal 2: Attach profiler (within 15 seconds)

cd <async-profiler-folder>
./bin/asprof -d 40 -f <project-folder>/profiling-results/flamegraph-first.html <PID>

Repeat for NULLS Last:

Terminal 1: Run with profiling mode

# Profile NULLS Last
sbt "runMain com.example.icebergdiveinto.NullSortExperimentComparison last"

# The program will display:
# 🔍 Java PID: 67890
# 📍 Attach profiler now:
#    cd <async-profiler-folder>
#    ./bin/asprof -d 40 -f flamegraph-last.html 67890
# ⏳ Waiting 15 seconds for you to attach profiler...

Terminal 2: Attach profiler (within 15 seconds)

cd <async-profiler-folder>
./bin/asprof -d 40 -f <project-folder>/profiling-results/flamegraph-last.html <PID>

Compare Results:

# Open both flame graphs for comparison
open profiling-results/flamegraph-first.html
open profiling-results/flamegraph-last.html

See Profiling Guide for detailed instructions.

📊 Experiment Details

Data Scale: 500,000 rows
Partitions: 10 categories
Null Ratio: 30%
Test Queries:
- WHERE value IS NOT NULL (COUNT, SUM, AVG)
- WHERE value IS NULL
- Full table scan
Profiling Tool: async-profiler 4.2
Environment: Spark 4.0.1 + Iceberg 1.10.0

🔮 Future Experiment Plans

🛠️ Tech Stack

Apache Spark: 4.0.1
Apache Iceberg: 1.10.0
Scala: 2.13.16
Java: 17.0.16
async-profiler: 4.2

📄 License

MIT License

👤 Author

Raymond

If this project helps you, please give it a ⭐️!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
profiling-results		profiling-results
project		project
src/main/scala/com/raymond/icebergdiveinto		src/main/scala/com/raymond/icebergdiveinto
.gitignore		.gitignore
.scalafmt.conf		.scalafmt.conf
README.md		README.md
build.sbt		build.sbt
profile-nulls-first.sh		profile-nulls-first.sh
profile-nulls-last.sh		profile-nulls-last.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧊 Iceberg Deep Dive

📖 About This Project

🗂️ Project Structure

🧪 Experiment Series

1️⃣ Null Sort Experiment: NULLS FIRST vs NULLS LAST

📋 Experiment Overview

🎯 Key Findings

💡 Key Insights

📂 Related Files

🚀 Quick Start

🔥 Manual Profiling (Optional)

📊 Experiment Details

🔮 Future Experiment Plans

🛠️ Tech Stack

📄 License

👤 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧊 Iceberg Deep Dive

📖 About This Project

🗂️ Project Structure

🧪 Experiment Series

1️⃣ Null Sort Experiment: NULLS FIRST vs NULLS LAST

📋 Experiment Overview

🎯 Key Findings

💡 Key Insights

📂 Related Files

🚀 Quick Start

🔥 Manual Profiling (Optional)

📊 Experiment Details

🔮 Future Experiment Plans

🛠️ Tech Stack

📄 License

👤 Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages