In-depth exploration of Apache Iceberg features, performance optimizations, and best practices.
This project conducts a series of experiments and analyses to deeply investigate various aspects of the Apache Iceberg table format:
- Performance optimization techniques
- Data organization strategies
- Query optimization
- Storage optimization
- Best practices
Each experiment includes:
- 📊 Complete code implementation
- 🔬 Detailed performance analysis
- 📝 Experimental results and data
- 💡 Practical recommendations and insights
IcebergDiveInto/
├── src/main/scala/com/example/icebergdiveinto/
│ ├── NullSortExperiment*.scala # Null sorting experiments
│ └── (future experiments...)
├── profiling-results/ # Performance profiling results
│ ├── flamegraph-*.html # Flame graphs
│ └── *.md # Analysis documents
├── articles/ # Experiment write-ups
│ └── null-sort-experiment.md
└── README.md
Research Question: How does Iceberg's WRITE ORDERED BY with NULLS FIRST vs NULLS LAST affect query performance?
Apache Iceberg supports sorted writes:
ALTER TABLE table_name
WRITE ORDERED BY (column NULLS FIRST/LAST)But does this sorting strategy actually impact query performance? How much? And why?
Performance comparison on 500K rows (30% null values):
| Metric | Finding |
|---|---|
| Overall Performance | NULLS LAST is 9-20% faster |
| WHERE IS NOT NULL | NULLS LAST is 9.4-27.5% faster |
| Filter Operations | NULLS LAST is 29.6% faster |
| File Reading | NULLS LAST is 52% faster |
| Parquet Stats | Identical |
| Query Plan | Identical |
-
Physical data layout DOES matter
- Even though Parquet stats and query plans are identical
- Underlying execution efficiency shows significant differences
-
Performance gains come from CPU microarchitecture optimizations
- Branch Prediction:
NULLS LASTimproves CPU branch prediction accuracy - Cache Locality: Contiguous non-null data improves cache hit rate
- Memory Bandwidth: Reduces unnecessary data reads
- Branch Prediction:
-
Usage Recommendations
- ✅ Recommended: OLAP queries + Null ratio > 10% + frequent IS NOT NULL filtering
⚠️ Caution: Write-performance-sensitive scenarios- ❌ Not needed: Null ratio <5% or rarely queried columns
src/main/scala/com/example/icebergdiveinto/
├── NullSortExperimentDataGenerator.scala # Data generation
├── NullSortExperimentComparison.scala # Performance comparison
└── SparkSessionCreator.scala # Spark configuration
profiling-results/
├── flamegraph-first.html # NULLS First flame graph
├── flamegraph-last.html # NULLS Last flame graph
├── WHY-IT-MATTERS.md # Deep dive analysis
└── PROFILING-GUIDE.md # Profiling tutorial
articles/
└── null-sort-experiment.md # Complete article
# 1. Generate test data
sbt "runMain com.example.icebergdiveinto.NullSortExperimentDataGenerator"
# 2. Run performance comparison
sbt "runMain com.example.icebergdiveinto.NullSortExperimentComparison"
# 3. View results
open profiling-results/For detailed CPU-level performance analysis using async-profiler:
Terminal 1: Run with profiling mode
# Profile NULLS First
sbt "runMain com.example.icebergdiveinto.NullSortExperimentComparison first"
# The program will display:
# 🔍 Java PID: 12345
# 📍 Attach profiler now:
# cd <async-profiler-folder>
# ./bin/asprof -d 40 -f flamegraph-first.html 12345
# ⏳ Waiting 15 seconds for you to attach profiler...Terminal 2: Attach profiler (within 15 seconds)
cd <async-profiler-folder>
./bin/asprof -d 40 -f <project-folder>/profiling-results/flamegraph-first.html <PID>Repeat for NULLS Last:
Terminal 1: Run with profiling mode
# Profile NULLS Last
sbt "runMain com.example.icebergdiveinto.NullSortExperimentComparison last"
# The program will display:
# 🔍 Java PID: 67890
# 📍 Attach profiler now:
# cd <async-profiler-folder>
# ./bin/asprof -d 40 -f flamegraph-last.html 67890
# ⏳ Waiting 15 seconds for you to attach profiler...Terminal 2: Attach profiler (within 15 seconds)
cd <async-profiler-folder>
./bin/asprof -d 40 -f <project-folder>/profiling-results/flamegraph-last.html <PID>Compare Results:
# Open both flame graphs for comparison
open profiling-results/flamegraph-first.html
open profiling-results/flamegraph-last.htmlSee Profiling Guide for detailed instructions.
- Data Scale: 500,000 rows
- Partitions: 10 categories
- Null Ratio: 30%
- Test Queries:
WHERE value IS NOT NULL(COUNT, SUM, AVG)WHERE value IS NULL- Full table scan
- Profiling Tool: async-profiler 4.2
- Environment: Spark 4.0.1 + Iceberg 1.10.0
- Apache Spark: 4.0.1
- Apache Iceberg: 1.10.0
- Scala: 2.13.16
- Java: 17.0.16
- async-profiler: 4.2
MIT License
Raymond
If this project helps you, please give it a ⭐️!