Skip to content

Version the graph SQLite store so an incompatible cache is rebuilt, not crashed on#115

Merged
zzet merged 2 commits into
mainfrom
feat/graph-store-schema-versioning
Jun 18, 2026
Merged

Version the graph SQLite store so an incompatible cache is rebuilt, not crashed on#115
zzet merged 2 commits into
mainfrom
feat/graph-store-schema-versioning

Conversation

@zzet

@zzet zzet commented Jun 18, 2026

Copy link
Copy Markdown
Owner

What & why

The graph SQLite store (internal/graph/store_sqlite) had no schema-version mechanism: Open ran a single CREATE ... IF NOT EXISTS batch plus an ad-hoc ALTER retrofit for the promoted node columns, with no PRAGMA user_version and no version check. Same structural shape the sidecar store had before its migration runner landed.

The hazard is latent today but real: the moment a future schema gains a typed column (or a column-dependent index) without a backfill, opening an older on-disk store fails at prepare()/SELECT, store_sqlite.Open returns the error, and the daemon refuses to start — recoverable only by the user hand-deleting store.sqlite. Loud and early, never silent-wrong, but a footgun.

Approach — a version mechanism sized for a derived cache

Unlike the sidecar (irreplaceable user data → must migrate in place), the graph store is a rebuildable cache: every row is reconstructable by re-indexing the source. So this is not the sidecar's in-place-only runner. It's a forward-only registry keyed on PRAGMA user_version where each step declares one strategy:

  • rebuild — the change needs data only re-indexing can supply → drop the DB and let the daemon repopulate it (always correct; the common case for a cache).
  • inPlace — a mechanical change derivable from the existing store (a new index, a denormalisation) → applied in a transaction, sparing a large repo a multi-minute reindex.

Open now reads user_version before applying the schema and reconciles:

Stored version Action
== current no-op
0 (fresh / pre-versioning) baseline-stamp (schemaSQL + the existing column retrofit reconcile it); wipe instead if a later migration would be skipped
> current (newer build) drop + rebuild
0 < v < current run registered in-place steps, or rebuild if any pending step needs it

Safety

  • No silent mis-stamp. A test asserts the registry reaches currentSchemaVersion, so bumping the version without a matching migration entry fails CI rather than silently stamping an un-migrated store at runtime.
  • Explicit rebuild signal. A wiped store reports the existing NeedsRebuild() capability (the hook cmd/gortex.storeNeedsRebuild already probes for and wires into warm restart), so the daemon forces a full re-index — instead of relying on the side effect that a total wipe also empties file_mtimes.
  • Wipe is race-free. The destructive rebuild runs only under the daemon's exclusive store.sqlite.lock, held around Open for the sole writable on-disk lifecycle.
  • Data preserved on baseline. Additive only — no DROP, no rebuild of existing tables on the non-wipe paths.

Tests

internal/graph/store_sqlite/schema_version_test.go — fresh-stamp, pre-versioning baseline (data preserved), newer-DB rebuild (data wiped), the steady-state stored == current no-op (highest-frequency path), an in-place upgrade driven end-to-end through Open (effect visible, ordering after schemaSQL, data survives, version stamped), in-place failure leaves the version unstamped, :memory: under a wipe plan, NeedsRebuild after a wipe, registry well-formedness (+ rejection of every dangerous misconfiguration), and a shared-transaction rollback assertion. 133 store_sqlite tests pass under -race; go build ./..., go vet, and golangci-lint clean.

Reviewed

Implementation was put through an adversarial review (concurrency/wipe-safety, version-decision logic, data-loss/rebuild, test rigor); the confirmed findings are folded into this PR.

Wipe is gated on the store lock (second commit)

The destructive rebuild is now opt-in. Open refuses to wipe by default — it returns ErrSchemaRebuildRequired and leaves the file intact — and performs the drop-and-recreate only when the caller passes WithRebuild(). NewSharedServer threads that opt-in through OpenBackend only in the branch where it actually acquired the exclusive store flock, so the permission to wipe and the lock that makes wiping safe are derived from the same fact.

No behaviour change for the daemon (it holds the lock and rebuilds as before); a future non-locked on-disk sqlite caller fails safe with a clear error instead of unlinking a database another process may have open. This deliberately does not broaden the flock's acquisition condition (that would force exclusivity on a hypothetical shared read-only sqlite consumer and touches the fragile daemon-restart lock path) — the opt-in achieves the same guarantee without that risk.

zzet added 2 commits June 18, 2026 22:07
…ot crashed on

The graph store had no schema-version mechanism: Open ran one CREATE ...
IF NOT EXISTS batch plus an ad-hoc ALTER retrofit for the promoted node
columns, with no PRAGMA user_version. The hazard is latent today but real —
the moment a future schema gains a typed column or a column-dependent index
without a backfill, opening an older on-disk store would fail at prepare()/
SELECT, store_sqlite.Open would return the error, and the daemon would refuse
to start, recoverable only by the user hand-deleting store.sqlite.

Add a version mechanism sized for a derived cache. Open now reads PRAGMA
user_version before applying the schema and reconciles via a forward-only
registry: a fresh or pre-versioning store baseline-stamps to the current
version (schemaSQL + the existing column retrofit already reconcile it); a
store from a newer build, or one crossing a migration that needs data only
re-indexing can supply, is dropped and rebuilt; a known prior version runs its
registered in-place steps (the cheap path that spares a large repo a full
reindex for mechanical changes like a new index). Each migration declares
exactly one strategy — in-place or rebuild — and a test asserts the registry
reaches the current version, so bumping the version without a matching entry
fails CI instead of silently mis-stamping a store at runtime.

A wiped store reports the existing NeedsRebuild capability so the daemon forces
a full re-index on warm restart rather than relying on the side effect that a
total wipe also empties file_mtimes. The destructive rebuild runs only under
the daemon's exclusive store lock, which is held around Open for the sole
writable on-disk lifecycle.
The destructive drop-and-recreate in store_sqlite.Open was unconditional: its
cross-process safety relied on the daemon's exclusive store flock, but that
lock is acquired in a separate place under a different condition
(Lifecycle.Writable() && sqlite). A future caller that opened an on-disk sqlite
store without that lock could reach the wipe and unlink a database another
process had open — silent split-brain rather than a clean failure. Latent today
(the only oneshot path resolves to the memory backend), but the invariant was
convention-only.

Make it intrinsic. Open now refuses to wipe by default, returning
ErrSchemaRebuildRequired and leaving the file intact; the destructive rebuild
happens only when the caller passes WithRebuild. NewSharedServer passes it
through OpenBackend solely in the branch where it actually acquired the
exclusive flock, so the permission to wipe and the lock that makes wiping safe
are now derived from the same fact. No behaviour change for the daemon (it holds
the lock and still rebuilds); a non-locked caller fails safe.
@zzet zzet merged commit a834727 into main Jun 18, 2026
10 checks passed
@zzet zzet deleted the feat/graph-store-schema-versioning branch June 18, 2026 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant