Skip to content

pleiad/frogql

Repository files navigation

froGQL

A Rust graph database implementing ISO GQL path pattern matching, with single-file storage inspired by SQLite.

Install

Python (PyPI):

pip install frogql

Wheels ship for CPython 3.8+ on Linux (x86_64, aarch64), macOS (x86_64, arm64), and Windows (x86_64).

CLI / library (build from source):

cargo build --release

Quick Start (REPL)

# Import a CSV dataset and open the REPL
./target/release/frogql movies.gdb --import-csv path/to/Spanner_Instance/

# Open an existing database
./target/release/frogql movies.gdb

# Skip the typechecker for the session
./target/release/frogql movies.gdb --no-typecheck
gql> .schema
GRAPH TYPE DEFAULT (active)
Node types:
    (:Movie  {released INT, title STRING, votes INT, *})
    (:Person {name STRING, *})
Edge types:
    (:Person)-[:ACTED_IN {roles LIST<STRING>}]->(:Movie)
    (:Person)-[:DIRECTED]->(:Movie)
    (:Person)-[:REVIEWED {rating INT, summary STRING}]->(:Movie)
    ...

gql> MATCH (p: Person) -[:ACTED_IN]-> (m: Movie) WHERE m.released = 1999 RETURN p.name, m.title
p.name | m.title
-------+--------
"Keanu Reeves" | "The Matrix"
"Carrie-Anne Moss" | "The Matrix"
...

gql> MATCH (a: Person) -[:ACTED_IN]-> (m: Movie), (d: Person) -[:DIRECTED]-> (m) RETURN a.name, d.name

REPL meta-commands

The REPL follows the SQLite convention: every meta-command starts with ..

Command Effect
.schema alias for SHOW GRAPH TYPE DEFAULT (the auto-inferred schema)
.schema simple grouped by-label renderer of the inferred schema
.graph-types alias for SHOW GRAPH TYPES
.indexes alias for SHOW INDEXES
.save atomically persist the in-RAM mutations to the open .gdb (tmp+rename)
.dump-json <path> pg_dump-style JSON snapshot of the merged graph
.dump-gql <path> pg_dump-style GQL script that re-creates the graph
.help list meta-commands and quick query syntax
.quit / .exit exit (bare quit / exit also work)

Opening a path that does not exist creates an empty database (sqlite3 convention):

./target/release/frogql /tmp/fresh.gdb
creating new database: /tmp/fresh.gdb
gql> INSERT (a:Person {name: 'Alice'})
gql> .save

Example Databases

The examples/ directory contains ready-to-use .gdb files:

Database Nodes Edges Domain
movies.gdb 171 253 Movies (Person, Movie, ACTED_IN, DIRECTED, ...)
fraud_detection.gdb 14,550 31,564 Financial fraud (Transaction, Account, TRIGGERED_ALERT, ...)
bom.gdb 1,500 2,248 Bill of Materials (Product, Assembly, ConsistsOf, ...)

Regenerate from CSV: ./target/release/frogql examples/movies.gdb --import-csv <csv_dir>

What you can do

Topic One-liner Details
Load data CSV (Spanner / Text2GQL / Cypher), LDBC SNB, or JSON; DEFAULT graph type is inferred at import. docs/data-import.md
Query MATCH, OPTIONAL MATCH, comma-joins, WHERE, RETURN, EXISTS, repetition {n,m}, label algebra. docs/query-language.md
Mutate INSERT, SET, REMOVE, [DETACH] DELETE (ISO §13 MVP-1). In-RAM overlay until .save. docs/data-modification.md
Schemas Persistent named graph types with CREATE / USE / DROP / SHOW / VALIDATE GRAPH TYPE; reserved DEFAULT is auto-inferred. docs/graph-types.md
Indexes Auto-built hash + btree on every uniquely-keyed (label, prop) (rebuilt on open); explicit CREATE [HASH | BTREE] INDEX for the rest, persisted in the .gdb header chain so they survive close/reopen. docs/secondary-indexes.md

Storage

froGQL uses a single .gdb file with 4KB pages:

  • All node/edge IDs are u32 internally (no string overhead at query time)
  • Three storage backends: in-memory Graph, LazyGraphStore (topology in RAM, labels/props from disk via LRU page cache), DiskGraphStore (minimal RAM)
  • The REPL uses LazyGraphStore for efficient memory usage on large graphs
  • The GRAPH TYPE catalog persists in its own page chain, so CREATE / USE / DROP GRAPH TYPE survive close/reopen

See docs/internals/storage-architecture.md for the full format specification.

Join strategy

Comma-joins and chains of directed/undirected edges are executed with Leapfrog Triejoin (LTJ), a worst-case-optimal multi-way join that binds variables one at a time across all participating patterns simultaneously, with no intermediate materialisation. Each directed edge is modelled as a triple (src, label, tgt) indexed in six sorted orderings. LTJ activates automatically when the pattern decomposes into triples; non-decomposable shapes (any-direction edges, repetitions with named variables, unbounded {n,}) fall back to pairwise hash-join. Speedups on soc-LiveJournal1-100k (limit 1000) range from 14× (3-clique) to 4097× (4-path), with a 4-clique going from "hung" to 43 ms. See docs/internals/JOIN_STRATEGY_NOTES.md and CLAUDE.md for the algorithm in detail.

Optimizer

The compiler pipeline is parse → elaborate → typecheck → optimize → run. The optimizer is reserved for performance-preserving rewrites; ISO syntactic sugar lives in elaborate. Major passes:

  • WHERE pushdown into descriptor value_preds and value_filters (type ascriptions like is T, value comparisons like <, <=, =).
  • Index-driven constant folding: Eq predicates that hit a hash index resolve to a single NodeId, get pre-bound in the result tuple, and excluded from the VEO. < / <= / > / >= that hit a btree precompute the matching sorted set and become an O(log n) membership test.
  • Bounded Repeat unrolling: (P){lb, ub} with single-edge inner and ub - lb + 1 ≤ 4 rewrites to Union(P^lb, …, P^ub) and distributes the Union out of any surrounding Concat — each arm becomes a flat chain that decomposes natively into triples.
  • ORDER BY alias resolution: ORDER BY <RETURN-alias> lowers to the underlying AttrLookup so the runtime routes the sort through the pre-projection top-k heap (and the btree-driven path) instead of fully projecting every row before truncating.
  • OPTIONAL MATCH bind-pushdown (runtime-side, in run_match_chain): for each outer row, the inner pattern runs as a small LTJ with the shared variables pre-pinned — SQLite's correlated nested-loop with index lookup, mapped onto the LTJ. Cuts a representative LDBC IS5 query from 52 s → 0.56 s on SF0.1 (~93×).
  • BTree-driven top-k: when the sort key has a btree (auto-built or declared) the runtime walks the btree in primary-attr order, runs an LTJ pin per id, and stops at LIMIT k (or at the first cohort boundary past k for multi-spec ORDER BY). Multi-label Or descriptors (Comment | Post) drive the search via a k-way merge over per-label btrees. Drops LDBC IC2 on SF0.1 from 3.0 s → 0.086 s with the right indexes declared.

See docs/internals/implemented-optimizations.md for benchmark numbers and code-level detail.

Building and Testing

cargo build --release                    # build all binaries

# Strict clippy (run before every commit)
cargo clippy --workspace --all-targets -- -D clippy::all

# Lib + integration sweep (bench_test is excluded — pre-existing failures)
cargo test --lib
cargo test --test parser_test --test runtime_test --test store_runtime_test \
           --test text2gql_test --test parse_and_run_test --test count_test \
           --test null_test --test record_test --test list_test \
           --test compile_diagnostics --test elaborate_test --test float_test \
           --test graph_type_test --test typecheck_smoke --test typecheck_test \
           --test optional_match_test --test multi_match_test \
           --test aggregates_proptest --test lattice_proptest --test multi_match_proptest \
           --test exists_fold_test --test exists_runtime_test \
           --test parser_dm_test --test lazy_mut_test --test dm_runtime_test \
           --test dm_persistence_test --test dm_schema_test --test dm_default_test \
           --test dump_test --test dm_set_test --test dm_remove_test \
           --test dm_label_test --test dm_delete_expr_test

# Single test
cargo test --test runtime_test test_join_star_any_label -- --exact

Other binaries in src/bin/:

  • bench_queries — generic benchmark runner
  • bench_setup — downloads + extracts LDBC datasets (bench/data/ is gitignored)
  • ldbc_bench — LDBC interactive-complete driver, queries in bench/ldbc-queries/*.toml
  • internal_bench — gqlite-only diagnostic bench (typechecker on/off, lazy/disk backend)
  • convert_edgelist — edge-list format converter

Documentation

User-facing:

Internals:

About

Embedded GQL graph database in Rust. ISO/IEC 39075:2024 path patterns, Leapfrog-Triejoin runtime, single-file .gdb storage. CLI (frogql, SQLite-style), Python (PyPI) and Node.js (npm) bindings.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors