Index Organization #98

adsharma · 2025-11-29T17:35:28Z

adsharma
Nov 29, 2025
Maintainer

The Node Table storage organization in LadybugDB is not very different from DuckDB and other columnar databases. The core concepts are the same (NodeGroup == RowGroup in DuckDB). There may be minor implementation differences.

But the primary key indexing strategy is very different. LadybugDB uses a hash index and DuckDB uses Adaptive Radix Tree (ART) index. Both of them use zonemaps. But DuckDB records min-max and handles range queries better. LadybugDB uses them only to record/skip NULLs.

LadybugDB is also very space inefficient with respect to index implementation, although in-theory a hash index should be more compact. Details in #97

Below is a poll on what we should do. Please vote.

What should our PK Indexing strategy be?

Make existing hash index more efficient

25%

Add ART Index like DuckDB

37%

Replace NodeTable implementation with DuckDB code

25%

Implement min-max zonemaps

12%

8 votes

adsharma · 2025-11-29T17:44:04Z

adsharma
Nov 29, 2025
Maintainer Author

Advantages of ART over Hash Indexes

Range query support: ART can run range queries, though not as fast as B+-Trees since it must perform tree traversals DuckDB, whereas hash tables cannot efficiently support range queries at all. This is crucial for DuckDB's analytical workload.

Sorted order: ART maintains data in sorted order, which allows efficient range queries Database of Databases, while hash tables are inherently unordered.

Cache locality with skewed workloads: When the access pattern is skewed, ART takes advantage of cache effects, whereas hash tables have no data locality when there is locality in the query set. With significant skew, ART performs nearly 50% faster than a hash table Database of Databases.

Single structure for multiple needs: Using one structure for both constraint checks and range queries instead of two is more code efficient and maintainable DuckDB.

Advantages of Hash Indexes over ART

Better random access performance: For uniformly random lookups on sparse key sets, hash tables generally outperform ART due to their O(1) average-case complexity.

Simpler implementation: Hash tables are conceptually simpler and require less complex memory management.
Lower memory overhead for sparse keys: ART can be wasteful with respect to memory consumption under sparse distributions DuckDB.

Performance Considerations

ART's lookup performance is comparable to that of hash tables, and in some cases can even beat them, particularly when keys are dense or access patterns are skewed. However, ART's L3 cache misses rapidly exceed those of hash tables as the index size increases DuckDB, because ART decomposes a key into bytes and traverses the tree, potentially incurring more cache misses.

For DuckDB's analytical use case, the ability to support range queries and maintain sorted order makes ART the better choice despite hash tables potentially having an edge in pure point-lookup scenarios.

0 replies

SevorisDoe · 2025-11-29T18:33:04Z

SevorisDoe
Nov 29, 2025

I think in terms of primary key indices, the hash index that is already present is without alternative; just for the reason that nodes must have unique identity, and that must be enforced, or you'll get problems with the rel tables.

That said, ART indices make sense to me for secondary index support of a kind (DuckDB just labels these as primary index-like to my understanding, because in an analytical relational DB that expects to run lots of aggregations on tables, ART indexing best support point access patterns you'd use for something like a PK index) and would possibly enable some neat filtering patterns (such as efficiently finding a collection of nodes that match a certain range or property equality) that the rest of the database would profit from - assuming the query planner can find them.

However in the direct terms, I suspect min-max statistics would have the most direct pay-off by letting the scan skip chunks that are irrelevant (and does not ask the query planner to identify the opportunity and optimize) and hence I put my vote behind that for now.

Side remark: having a DuckDB-like storage structure for node tables (but with the hash index enforcing PK uniqueness) could be interesting insofar as that it might support having a common embedded storage layer that can still support aggregation and other more ""classical"" relational workloads, given Ladybug's inherited data model already being coached in very relational-like term. That said, I am worried about compatibility implications with other forks.

9 replies

SevorisDoe Nov 29, 2025

I was thinking of queries of the form

match (a:user_session), (b:event_records)
where (b.ts >= a.session_start_ts) and (b.ts <= a.session_end_ts

or maybe to be a bit more cypher:

match (a:user_session {event_type: "login")-[:major_event*1..10)->(c:user_session {event_type: "logout"}),
(b:event_records)
where (b.ts >= a.ts) and (b.ts <= c.ts

(Classical way of having events slice a log)

Ladybug rn would handle such cases as a cartesian join-and-filter, because it has no inequality join operators at all.

DuckDB can handle such cases in a bunch of ways.

adsharma Nov 29, 2025
Maintainer Author

You can also do match and nested optional match. That's what a LLM did when I translated SQLite based Epstein GraphRAG to cypher.

But the end result at the physical operator layer is the same like you observe.

Instead of trying to fix all that, pushing it down to a query/storage layer that already does them well is appealing.

adsharma Nov 29, 2025
Maintainer Author

Re: does it help GraphRAG

Requires more market study. There are many benchmarks out there they don't accurately represent real world use cases that businesses have.

SevorisDoe Nov 29, 2025

You can also do match and nested optional match. That's what a LLM did when I translated SQLite based Epstein GraphRAG to cypher.

But the end result at the physical operator layer is the same like you observe.

Instead of trying to fix all that, pushing it down to a query/storage layer that already does them well is appealing.

The other interesting aspect, at least to me, is also kind of the inverse - being able to pre- or post-process graph pattern matching results using relational tools. There's a lot of temporal data problems out there that profit from this ability to interplay high-level structures with low-level streams of data.

You might also "glue together" two separate graph patterns using temporal range joins, for example.

adsharma Dec 1, 2025
Maintainer Author

You can use parquet or apache arrow in-memory to do those type of things already. But it requires python, not easy via lbug CLI.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index Organization #98

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 9 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Index Organization #98

Uh oh!

adsharma Nov 29, 2025 Maintainer

Replies: 2 comments · 9 replies

Uh oh!

adsharma Nov 29, 2025 Maintainer Author

Advantages of ART over Hash Indexes

Advantages of Hash Indexes over ART

Performance Considerations

Uh oh!

SevorisDoe Nov 29, 2025

Uh oh!

SevorisDoe Nov 29, 2025

Uh oh!

adsharma Nov 29, 2025 Maintainer Author

Uh oh!

adsharma Nov 29, 2025 Maintainer Author

Uh oh!

SevorisDoe Nov 29, 2025

Uh oh!

adsharma Dec 1, 2025 Maintainer Author

adsharma
Nov 29, 2025
Maintainer

Replies: 2 comments 9 replies

adsharma
Nov 29, 2025
Maintainer Author

SevorisDoe
Nov 29, 2025

adsharma Nov 29, 2025
Maintainer Author

adsharma Nov 29, 2025
Maintainer Author

adsharma Dec 1, 2025
Maintainer Author