Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 14 additions & 5 deletions bench/cross-system/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,9 @@ across all (selected) systems.
|---|---|---|
| gqlite (lazy backend) | [`gqlite/`](gqlite/) | ✅ implemented |
| GraphQLite — colliery-io/graphqlite (Cypher, SQLite-backed) | [`graphqlite/`](graphqlite/) | ✅ implemented |
| GraphLite — GraphLite-AI/GraphLite (ISO GQL, Sled-backed) | — | not yet integrated |
| GQLite — webbery/gqlite (custom DSL, dead since April 2023) | — | not yet integrated |
| GQLite — auksys/gqlite, [gqlite.org](https://gqlite.org/) (OpenCypher, SQLite/Redb/Postgres backends; PyPI: `gqlitedb`) | [`auksys_gqlite/`](auksys_gqlite/) | ⚠ integration scaffolded; **fails to load LDBC SF0.1 in reasonable time** — see [`auksys_gqlite/DIVERGENCES.md`](auksys_gqlite/DIVERGENCES.md) |
| GraphLite — GraphLite-AI/GraphLite (ISO GQL, Sled-backed) | [`graphlite/`](graphlite/) | ⚠ integration scaffolded; **load hangs on the comments phase** — see [`graphlite/DIVERGENCES.md`](graphlite/DIVERGENCES.md) |
| GQLite — webbery/gqlite (custom DSL, dead since April 2023) | — | not integrated; `auksys/gqlite` above is the actively-maintained successor |

## Setup

Expand Down Expand Up @@ -86,20 +87,28 @@ the comment-link explains.

## Reading the results

`comparison.txt` has three sections:
`comparison.txt` has four sections:

0. **Errored param rows** — for each system, the count of param rows
where any iter returned `result_count = -1` (sentinel for runner-
level failure: SDK error, panic, etc.). A high tally means the
latency table below is over a partial sample. The integrated-but-
blocked systems (graphlite, auksys_gqlite) document their failure
modes in their per-system DIVERGENCES.md.
1. **Per-cell summary** — for each (params_row, system) pair, median
latency, p95, iter count, the result_count, and the result_shape
(per-row type signatures, deduped — e.g. `i,s,s,i,s,i|i,s,s,i,n,i`
for IC2 where `c.content` is sometimes null).
for IC2 where `c.content` is sometimes null). Errored rows are
excluded.
2. **Count + shape consistency** — for each params_row, do all systems
agree on row count AND per-row column types? Without ORDER BY the
actual row contents legitimately differ (each system picks a
different N rows from the full result), but the column count and
types must match. `WARN` flags disagreement, which means a
per-system query translation bug.
3. **Side-by-side latency** — one row per params_row, one column per
system, median ms.
system, median ms. Cells where a system errored out show as `--`
(no successful samples to median).

## Out of scope

Expand Down
308 changes: 308 additions & 0 deletions bench/cross-system/auksys_gqlite/DIVERGENCES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,308 @@
# auksys/gqlite (gqlite.org) — divergences and integration notes

auksys/gqlite is included as the third external system in the cross-system
bench, after the teacher pointed at [gqlite.org](https://gqlite.org/).
It's distinct from the dead `webbery/gqlite` listed in the original plan
and from our own `gqlite/` — three different systems share the name.

This file documents:
1. The integration choices and divergences from spec-faithful execution
that affect the bench numbers.
2. **The integration journey itself**, including approaches that didn't
work and why. This is here on purpose: the teacher's framing applies
to `auksys/gqlite` as much as to GraphLite — bench-time friction is
itself a finding worth reporting, and the trail of "what we tried"
matters as much as the final code.

## What it is

- Rust core, OpenCypher subset, with C / C++ / Python / Ruby / Crystal / Rune bindings.
- Multiple backends: **SQLite** (default), Redb, PostgreSQL.
- Active project; latest changelog entry is v0.9 (2026-04-09) for the
engine and v1.5 for the Python distribution. Author: Cyrille Berger
/ auKsys org.
- PyPI distribution name: `gqlitedb`. Python module name: `gqlite`.
(There's a totally unrelated `gqlite` package on PyPI — a GraphQL
HTTP client — that name-squats the obvious slot. The wrong one
installs cleanly but exposes no `Connection` class. Smoke check:
`hasattr(gqlite, "execute_oc_query")`.)

## What's in the comparison

- Same LDBC SF0.1 dataset as every other system.
- Same 15 IC2 substitution-param rows.
- Same cross-system CSV schema, dispatched the same way as graphqlite
(Python runner, no shell wrapper).
- Native parameter binding via `Connection.execute_oc_query(query, bindings)`
with `$param` syntax in queries and `{"$param": value}` dicts.

## Documented divergences from spec-faithful execution

### 1. IC2 uses `WHERE c:Comment OR c:Post`, not `(c:Comment|Post)`

auksys/gqlite supports the cleaner `(c:Comment|Post)` Cypher 5+ union-
label syntax in standalone patterns, but in **multi-hop patterns
containing it** the planner errors with `CompileTime: UnknownFunction:
get_source` (gqlitedb 1.5.1). Reproducer: the IC2 query exact shape

```cypher
MATCH (a)-[:knows]-(b)<-[:hasCreator]-(c:Comment|Post)
```

triggers it, while replacing the terminal node with `(c)` and adding
`WHERE c:Comment OR c:Post` works fine. The OR-form is what graphqlite
uses for the same logical query (graphqlite's dialect rejects `:A|B`
entirely), so this matches the apples-to-apples shape across systems.

### 2. No `CREATE INDEX` in the parser

The Cypher parser does not accept any of the standard index-creation
syntaxes:

```cypher
CREATE INDEX FOR (n:Person) ON (n.id) -- rejected
CREATE INDEX ON :Person(id) -- rejected
CREATE INDEX person_id IF NOT EXISTS FOR (n:Person) ON (n.id) -- rejected
CREATE INDEX FOR (n:Person) ON n.id -- rejected
```

Each fails with the same `expected node_pattern or edge_pattern` error,
because the parser handles `CREATE` only as the node/edge-creation form.
There is no DDL surface for indexes in v1.5.1.

This is a real divergence from "Cypher" the spec — but it doesn't break
the bench because the bench's load idiom (see below) avoids needing one.

### 3. Properties stored as JSON in a TEXT column → no per-property index

Inspecting the underlying SQLite schema (from a freshly-loaded DB):

```
gqlite_default_nodes:
id INTEGER PRIMARY KEY -- internal node rowid
node_key BLOB -- 128-bit node key (returned by id(n))
labels TEXT -- '["Person"]'
properties TEXT -- '{"id":1,"firstName":"Alice"}'
indexes: only the auto sqlite_autoindex on metadata
```

Property values live JSON-encoded in a single `TEXT` column. Without a
JSON-extract expression index (which Cypher CREATE INDEX would have to
generate), `MATCH (n:Person {id: X})` is a full table scan with JSON
parsing per row. The bench's data-load idiom sidesteps this entirely
by using fresh node-variables in the same statement (see below). At
query time, since IC2 starts with a single `(p:Person {id: $personId})`,
the engine eats one full scan per query — that's reflected in the
latency numbers.

## Setup — what we tried before settling on the canonical idiom

This is documented because the journey took ~2 hours, and the wrong
turns are themselves data points about the system's API ergonomics.

### Attempt 1: per-row UNWIND batches with property-MATCH for edges

```python
conn.execute_oc_query(
"UNWIND $rows AS r CREATE (:Person {id: r.id, ...})",
{"$rows": [...]},
)
# ... then for edges:
conn.execute_oc_query(
"UNWIND $rows AS r "
"MATCH (a:Person {id: r.s}), (b:Person {id: r.d}) "
"CREATE (a)-[:knows]->(b)",
{"$rows": [...]},
)
```

**Outcome:** nodes loaded in ~seconds (288K of them via the UNWIND
CREATE path; that part is fine). **Edges then ground for an hour
without making meaningful progress.** Each MATCH lookup is O(N) due
to the JSON-property storage (above), and the LDBC IC2 hasCreator
phase has 287K edges × 2 lookups each, giving roughly 180 billion
property comparisons.

### Attempt 2: add `CREATE INDEX FOR (n:Person) ON (n.id)` before edges

The textbook fix. **Doesn't work** — see divergence 2. The parser
rejects every index-creation syntax we tried.

### Attempt 3: id_map dance — load nodes, RETURN id(n), use internal IDs

```python
result = conn.execute_oc_query(
"UNWIND $rows AS r CREATE (p:Person {id: r.id, ...}) RETURN id(p), p.id",
{"$rows": [...]},
)
# Build Python-side dict: ldbc_id → internal_id (128-bit)
# Then for edges:
conn.execute_oc_query(
"UNWIND $rows AS r "
"MATCH (a) WHERE id(a) = r.aid AND id(b) = r.bid "
"CREATE (a)-[:knows]->(b)",
{"$rows": [{"aid": ..., "bid": ...}]},
)
```

The same trick the graphqlite Python API does internally (returns
external→rowid maps from its bulk loaders). **Killed before completion**
because we found a better way before the run finished — see Attempt 4.
Whether `MATCH (a) WHERE id(a) = $x` actually hits a fast path in this
engine is unclear; the `node_key` BLOB column has no SQLite index in
the schema we inspected, so it'd likely also be a scan, just without
JSON parsing. Faster than Attempt 1 but probably not by enough to
matter at SF0.1 scale.

### Attempt 4 (the one that works): single big CREATE with shared variables

This is the canonical idiom from auksys's own benchmarks. Their bench
crate (`crates/gqlitedb/benches/common/pokec.rs`) loads the Pokec
social-network dataset by reading the entire file
`pokec_*_import.cypher` into one string and passing it to
`execute_oc_query`. The file is shaped:

```cypher
CREATE
(user_4826:User {id: 4826, age: 22, ...}),
(user_3317:User {id: 3317, age: 21, ...}),
...
(user_4826)-[:Friend]->(user_3317),
(user_3317)-[:Friend]->(user_4502),
...
```

**One CREATE statement.** Variables defined when nodes are created
(`user_4826`, etc.) are reused for edges *within the same statement*,
so the engine binds them directly — no MATCH lookup, no property scan.
The pokec_small_import.cypher they ship is 132K lines.

This is what `setup.py` does today. For LDBC SF0.1:
- 1.5K Persons + 151K Comments + 136K Posts = 288K node patterns
- 28K knows × 2 + 287K hasCreator = 315K edge patterns
- ~600K total patterns, ~46 MB query string
- Python build time: ~2 seconds
- gqlite ingest time: TBD on this run, but should match their pokec
scale expectation since the shape is the same.

### Why this took two hours

Lazy debugging order. We probed the Python API surface, hit the
property-MATCH wall, tried index DDL, designed the id_map workaround,
ran half of it — and only THEN cloned the upstream repo to look at
how *they* benchmark. The upstream answer was sitting in
`crates/gqlitedb/benches/common/pokec.rs` the whole time. It would
have taken five minutes to read first.

The lesson, written here so it sticks: **for any external system
in the cross-system bench, look at how the upstream's own benchmarks
load data before designing your loader.** The integration shape they
ship is almost certainly faster (and certainly more idiomatic) than
whatever you'd reverse-engineer from their public API.

## Why even the canonical idiom doesn't scale to LDBC SF0.1

After Attempt 4 ran for ~10 minutes with the DB growing at ~440 KB/sec
and no end in sight, we did a deep source-and-issues review of the
upstream. Findings (with file:line citations, all paths relative to
`auksys/gqlite` repo at the dev/1 branch):

1. **All "alternative" code paths converge at `execute_oc_query`.**
- The CLI's `.read FILE` (`crates/gqlitecli/src/main.rs`) reads
lines and calls `execute_oc_query`. No fast-import path.
- `gqlitebrowser` (web UI), `gqb` (query builder), `gqls` (ORM)
all wrap `execute_oc_query`.
- All five language bindings (Python, Ruby, Crystal, C++, Rune)
expose only `new` / `execute_oc_query` / `close`.

2. **`Connection::builder().set_option(...)` accepts a fixed key set:**
`path`, `backend`, `url`, `host`, `user`, `password`. **No PRAGMA,
`journal_mode`, `synchronous`, cache-size, batch-mode, or
durability knobs.** (`crates/gqlitedb/src/connection.rs:225-310`)

3. **SQLite is opened with library defaults** —
`crates/gqlitedb/src/store/sqlite.rs:265` calls
`rusqlite::Connection::open(&path)`. That gives you rollback
journal, `synchronous=FULL`, 2 MB cache, no mmap. **Per-row fsync
per commit.**

4. **The schema has no index on the node key column.** The CREATE
TABLE template (`crates/gqlitedb/templates/sql/sqlite/graph_create.sql`)
has only `id INTEGER PRIMARY KEY` and a `node_key` BLOB. **No index
on `node_key`, none on `properties`, none on `labels`.** Every
`MATCH (n {id:X})` is a full-table scan with `json_extract` per
row.

5. **Redb has the same shape.** `crates/gqlitedb/src/store/redb.rs:540-555`:
when matching by a property (`{id: X}`), it does
`nodes_table.range::<PersistentKey>(..)` — a full scan, then a
linear filter. Switching backend doesn't help.

6. **The interpreter doesn't batch even when the storage trait would
allow it.** `crates/gqlitedb/src/interpreter/evaluators.rs:1275,1286`:
each CREATE pattern calls `store.create_nodes(&mut tx, &name, vec![&n])`
— a Vec of ONE. The Store trait accepts iterators but the
interpreter never gathers them.

7. **The missing pieces are on the published roadmap, not shipped.**
Open GitLab issues at gitlab.com/auksys/GQLite (none merged):
- #169 — custom indexes
- #196 — streaming / pipeline execution
- #198 — refactor interpreter into a stream pipeline
- #200 — introduce logical planner
- #202 — deterministic planner
The README itself states: *"Development effort has now slowed down."*

## Designed-for scale, by the project's own benchmarks

The Pokec social-network benchmark in `crates/gqlitedb/benches/`:

| File | Nodes | Friend edges | Bytes | Patterns |
|---|---|---|---|---|
| pokec_micro (smallest) | 138 | 138 | 17 KB | ~280 |
| **pokec_tiny (largest in bench enum)** | **4,538** | **12,681** | **870 KB** | **~17K** |
| pokec_small (file exists, not in `PokecSize`) | 10,000 | 121,716 | 5.6 MB | ~132K |
| **LDBC SF0.1 (us)** | **~289K** | **~315K** | **46 MB** | **~604K** |

Their `PokecSize` enum is `{Micro, Tiny}` — `Small` exists as a file
but **isn't even part of their benchmark suite**. Their largest *run*
is 17K patterns. We're throwing **35× their benched scale at the
system, and 5× their largest data file**. There's no realistic
expectation this would work; it's past the cliff.

## Verdict for the bench writeup

`auksys/gqlite` v1.5.1 cannot load LDBC SF0.1 in reasonable time.
This isn't a configuration miss or a missing idiom — the deep review
confirmed via source citations that the architecture lacks property
indexes, lacks batched store calls, lacks SQLite tuning, and the
upstream acknowledges these as roadmap items. LDBC SF0.1 is the
**smallest** scale factor LDBC publishes; this system can't ingest
it. That's the finding.

## Project signals

- Active project. Last engine release v0.9 (2026-04-09); changelog
entries are recent and meaningful (parser rewrite, postgres
backend, schema generation). Sub-1.0, but moving.
- README: *"still in its early stage"* and *"Development effort has
now slowed down."*
- Multiple language bindings, multiple backends, gqlitebrowser web UI.
Breadth over depth — they shipped lots of surface area before
shipping the load/index machinery underneath.

## What would change this story

1. **Property indexes (issue #169)** — would let MATCH-by-property
scale. Today only the shared-variable single-CREATE trick avoids
needing one, and that trick doesn't scale past their tested
pokec_tiny.
2. **Batched interpreter (issue #198)** — would let the engine
amortize parse/plan/execute over many patterns instead of paying
per-row.
3. **`CREATE INDEX` in the parser** — fixes divergence 2.
4. **Fix the `(c:A|B)` planner bug in multi-hop patterns** — fixes
divergence 1.

Until those land — none earlier than v0.10 per the roadmap —
auksys/gqlite is not a candidate for LDBC-scale benchmarks.
40 changes: 40 additions & 0 deletions bench/cross-system/auksys_gqlite/ic2.cypher
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
// OpenCypher translation of bench/ldbc-queries/ic2.toml for
// auksys/gqlite (gqlite.org, distribution `gqlitedb` on PyPI).
//
// Source-of-truth IC2 lives in that toml; this file is a language
// translation. Substitution placeholders use Cypher's $-prefixed
// parameter syntax — auksys/gqlite supports it natively via
// `Connection.execute_oc_query(query, bindings)` where `bindings`
// is a dict whose keys INCLUDE the `$` prefix (see DIVERGENCES.md).
//
// Divergences from spec, applied to every system for apples-to-apples:
// - no ORDER BY (gqlite parser doesn't support it; we drop it
// from this translation even though auksys/gqlite handles it
// fine — fairness with our own engine)
// - no `coalesce(c.content, c.imageFile)` — we return c.content
// directly (Comment+Post both have a `content` column in our
// loaded subset; the original LDBC spec uses imageFile only for
// image-only Posts which we don't represent)
// - lowercase :knows / :hasCreator (loader convention)
//
// Structural shape: one MATCH with a label-disjunction predicate.
// auksys/gqlite supports `(c:Comment|Post)` syntax in standalone
// patterns (Cypher 5+ feature) — but in multi-hop patterns
// containing it, the planner errors with
// `CompileTime: UnknownFunction: get_source` (gqlitedb 1.5.1).
// Reproducer is one MATCH with two named edges and a union-label
// terminal node:
//
// MATCH (a)-[:knows]-(b)<-[:hasCreator]-(c:Comment|Post)
//
// fails, while the same query with `(c)` and `WHERE c:Comment OR
// c:Post` works. We use the OR-form here, which is also what
// graphqlite uses for the same logical query (graphqlite's dialect
// doesn't accept `:A|B` at all). The structural shape — one MATCH
// then a label predicate — matches gqlite's `(c: Comment | Post)`.
MATCH (p:Person {id: $personId})-[:knows]-(friend:Person)<-[:hasCreator]-(c)
WHERE (c:Comment OR c:Post) AND c.creationDate <= $maxDate
RETURN friend.id AS friend_id, friend.firstName AS friend_firstName,
friend.lastName AS friend_lastName,
c.id AS c_id, c.content AS c_content, c.creationDate AS c_creationDate
LIMIT 20
1 change: 1 addition & 0 deletions bench/cross-system/auksys_gqlite/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
gqlitedb>=1.5.1
Loading