-
Notifications
You must be signed in to change notification settings - Fork 154
feat: add LMDB storage backend for GeaFlow (#365) #704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
SeasonPilot
wants to merge
1
commit into
apache:master
Choose a base branch
from
SeasonPilot:feature/lmdb-storage-backend
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
feat: add LMDB storage backend for GeaFlow (#365) #704
SeasonPilot
wants to merge
1
commit into
apache:master
from
SeasonPilot:feature/lmdb-storage-backend
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Implement complete LMDB storage backend as alternative to RocksDB, providing superior read performance (30-60% improvement) with lower memory overhead. ## Core Implementation (11 classes, 2,310 lines) **LmdbClient.java** (448 lines) - Core LMDB wrapper with direct ByteBuffer support - Transaction management with single write transaction model - Database (DBI) management for vertex/edge/index data - Read/write/delete operations with MVCC semantics **LmdbIterator.java** (149 lines) - Iterator implementation with lookahead pattern - Prefix scanning support for range queries - Proper resource cleanup with close() handling **BaseLmdbStore.java** (168 lines) - Base class for all LMDB store implementations - Lifecycle management (init/flush/close/drop) - Checkpoint and recovery coordination - Path management and configuration handling **LmdbPersistClient.java** (593 lines) - Checkpoint creation via filesystem copy - Remote storage integration (HDFS, OSS, Local) - Parallel upload/download with thread pool - Archive management and recovery workflows **LmdbStoreBuilder.java** (71 lines) - SPI entry point for store registration - Factory for KV and Graph data models **KVLmdbStore.java** (99 lines) - Key-value storage implementation - Simple put/get/delete API with serde integration **StaticGraphLmdbStore.java** (186 lines) - Static graph storage with vertex/edge operations - Delegates to SyncGraphLmdbProxy adapter **DynamicGraphLmdbStore.java** (164 lines) - Multi-version graph storage for temporal queries - Version-prefixed keys for MVCC support **LmdbConfigKeys.java** (316 lines) - 20+ configuration parameters with comprehensive Javadoc - Map size, sync modes, reader limits, monitoring thresholds ## Proxy Layer (7 classes, 863 lines) Adapter pattern separating LMDB byte operations from GeaFlow graph API: - **SyncGraphLmdbProxy** (276 lines): Single-version graph adapter - **SyncGraphMultiVersionedProxy** (328 lines): Temporal query support - **ProxyBuilder** (63 lines): Factory for proxy creation - **Interface hierarchy**: ILmdbProxy, IGraphLmdbProxy, IGraphMultiVersionedLmdbProxy - **AsyncGraphLmdbProxy** (112 lines): Placeholder for future async support ## Testing Infrastructure (7 tests, 1,547 lines) **Unit Tests**: - KVLmdbStoreTest: CRUD, checkpoint/recovery, multi-checkpoint - LmdbIteratorTest: Basic/prefix/empty/large iteration - LmdbAdvancedFeaturesTest: Map size monitoring, stats, transactions **Performance Tests**: - LmdbPerformanceBenchmark: 8 workload patterns with metrics * Sequential reads: 762,697 ops/sec (1.31 μs) * Random reads: 505,569 ops/sec (1.98 μs) * Sequential writes: 658,812 ops/sec (1.52 μs) * Random writes: 95,963 ops/sec (10.42 μs) **Stability Tests**: - LmdbStabilityTest: 6 long-running reliability tests * 100,000 operations, repeated checkpoint/recovery * Memory stability, large value handling **Test Results**: 27/27 tests passed, 49% coverage (64% on core package) ## Documentation (3 files, 1,426 lines) **README.md** (456 lines) - Feature overview and quick start - Configuration reference with examples - Usage patterns and best practices **MIGRATION.md** (500 lines) - RocksDB to LMDB migration guide - 3 migration approaches (gradual, full, parallel) - Configuration mapping and validation **PERFORMANCE.md** (470 lines) - Comprehensive benchmark results - Comparison with RocksDB (30-60% read improvement) - Tuning recommendations for different workloads ## Key Technical Decisions 1. **Direct ByteBuffer**: Off-heap memory for LMDB memory-mapped I/O 2. **Single Write Transaction**: LMDB constraint, synchronized with write lock 3. **Lookahead Iterator**: Correct hasNext() semantics with prefix matching 4. **Periodic Map Size Monitoring**: Every 100 flushes with 80% warning threshold 5. **Filesystem-Based Checkpoints**: Simple copy of data.mdb/lock.mdb files 6. **Proxy Adapter Layer**: Clean separation between LMDB and graph API ## Performance Characteristics **Advantages**: - 30-60% faster read operations vs RocksDB - 60-80% lower memory overhead - Zero-copy reads via memory-mapped I/O - No compaction overhead (B+tree structure) - Stable sub-2μs read latencies **Trade-offs**: - 10-20% slower random writes (acceptable) - Requires pre-allocated map size - Single write transaction per environment ## Configuration Register LMDB backend via SPI: - META-INF/services/org.apache.geaflow.store.IStoreBuilder - geaflow.store.type=LMDB Dependencies: - lmdbjava 0.8.3 ## Integration - Updated StoreType enum to include LMDB - Added geaflow-store-lmdb module to parent POM - Follows existing GeaFlow storage abstraction patterns - Compatible with all data models (KV, StaticGraph, DynamicGraph)
Contributor
Author
|
@tanghaodong25 PTAL |
Contributor
Author
Contributor
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR implements a complete LMDB storage backend for Apache GeaFlow as an alternative to RocksDB, providing superior read performance and lower memory overhead.
Core Implementation (11 classes, 2,310 lines):
Proxy Layer (7 classes, 863 lines):
Documentation (3 files, 1,426 lines):
Key Technical Decisions:
Performance Characteristics:
Integration:
How was this PR tested?
Testing Infrastructure (7 test classes, 1,547 lines):
Unit Tests:
Performance Benchmarks:
Stability Tests:
Test Results:
Quality Checks: