Version the graph SQLite store so an incompatible cache is rebuilt, not crashed on#115
Merged
Conversation
…ot crashed on The graph store had no schema-version mechanism: Open ran one CREATE ... IF NOT EXISTS batch plus an ad-hoc ALTER retrofit for the promoted node columns, with no PRAGMA user_version. The hazard is latent today but real — the moment a future schema gains a typed column or a column-dependent index without a backfill, opening an older on-disk store would fail at prepare()/ SELECT, store_sqlite.Open would return the error, and the daemon would refuse to start, recoverable only by the user hand-deleting store.sqlite. Add a version mechanism sized for a derived cache. Open now reads PRAGMA user_version before applying the schema and reconciles via a forward-only registry: a fresh or pre-versioning store baseline-stamps to the current version (schemaSQL + the existing column retrofit already reconcile it); a store from a newer build, or one crossing a migration that needs data only re-indexing can supply, is dropped and rebuilt; a known prior version runs its registered in-place steps (the cheap path that spares a large repo a full reindex for mechanical changes like a new index). Each migration declares exactly one strategy — in-place or rebuild — and a test asserts the registry reaches the current version, so bumping the version without a matching entry fails CI instead of silently mis-stamping a store at runtime. A wiped store reports the existing NeedsRebuild capability so the daemon forces a full re-index on warm restart rather than relying on the side effect that a total wipe also empties file_mtimes. The destructive rebuild runs only under the daemon's exclusive store lock, which is held around Open for the sole writable on-disk lifecycle.
The destructive drop-and-recreate in store_sqlite.Open was unconditional: its cross-process safety relied on the daemon's exclusive store flock, but that lock is acquired in a separate place under a different condition (Lifecycle.Writable() && sqlite). A future caller that opened an on-disk sqlite store without that lock could reach the wipe and unlink a database another process had open — silent split-brain rather than a clean failure. Latent today (the only oneshot path resolves to the memory backend), but the invariant was convention-only. Make it intrinsic. Open now refuses to wipe by default, returning ErrSchemaRebuildRequired and leaving the file intact; the destructive rebuild happens only when the caller passes WithRebuild. NewSharedServer passes it through OpenBackend solely in the branch where it actually acquired the exclusive flock, so the permission to wipe and the lock that makes wiping safe are now derived from the same fact. No behaviour change for the daemon (it holds the lock and still rebuilds); a non-locked caller fails safe.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What & why
The graph SQLite store (
internal/graph/store_sqlite) had no schema-version mechanism:Openran a singleCREATE ... IF NOT EXISTSbatch plus an ad-hocALTERretrofit for the promoted node columns, with noPRAGMA user_versionand no version check. Same structural shape the sidecar store had before its migration runner landed.The hazard is latent today but real: the moment a future schema gains a typed column (or a column-dependent index) without a backfill, opening an older on-disk store fails at
prepare()/SELECT,store_sqlite.Openreturns the error, and the daemon refuses to start — recoverable only by the user hand-deletingstore.sqlite. Loud and early, never silent-wrong, but a footgun.Approach — a version mechanism sized for a derived cache
Unlike the sidecar (irreplaceable user data → must migrate in place), the graph store is a rebuildable cache: every row is reconstructable by re-indexing the source. So this is not the sidecar's in-place-only runner. It's a forward-only registry keyed on
PRAGMA user_versionwhere each step declares one strategy:rebuild— the change needs data only re-indexing can supply → drop the DB and let the daemon repopulate it (always correct; the common case for a cache).inPlace— a mechanical change derivable from the existing store (a new index, a denormalisation) → applied in a transaction, sparing a large repo a multi-minute reindex.Opennow readsuser_versionbefore applying the schema and reconciles:== current0(fresh / pre-versioning)> current(newer build)0 < v < currentSafety
currentSchemaVersion, so bumping the version without a matching migration entry fails CI rather than silently stamping an un-migrated store at runtime.NeedsRebuild()capability (the hookcmd/gortex.storeNeedsRebuildalready probes for and wires into warm restart), so the daemon forces a full re-index — instead of relying on the side effect that a total wipe also emptiesfile_mtimes.store.sqlite.lock, held aroundOpenfor the sole writable on-disk lifecycle.DROP, no rebuild of existing tables on the non-wipe paths.Tests
internal/graph/store_sqlite/schema_version_test.go— fresh-stamp, pre-versioning baseline (data preserved), newer-DB rebuild (data wiped), the steady-statestored == currentno-op (highest-frequency path), an in-place upgrade driven end-to-end throughOpen(effect visible, ordering afterschemaSQL, data survives, version stamped), in-place failure leaves the version unstamped,:memory:under a wipe plan,NeedsRebuildafter a wipe, registry well-formedness (+ rejection of every dangerous misconfiguration), and a shared-transaction rollback assertion. 133 store_sqlite tests pass under-race;go build ./...,go vet, andgolangci-lintclean.Reviewed
Implementation was put through an adversarial review (concurrency/wipe-safety, version-decision logic, data-loss/rebuild, test rigor); the confirmed findings are folded into this PR.
Wipe is gated on the store lock (second commit)
The destructive rebuild is now opt-in.
Openrefuses to wipe by default — it returnsErrSchemaRebuildRequiredand leaves the file intact — and performs the drop-and-recreate only when the caller passesWithRebuild().NewSharedServerthreads that opt-in throughOpenBackendonly in the branch where it actually acquired the exclusive store flock, so the permission to wipe and the lock that makes wiping safe are derived from the same fact.No behaviour change for the daemon (it holds the lock and rebuilds as before); a future non-locked on-disk sqlite caller fails safe with a clear error instead of unlinking a database another process may have open. This deliberately does not broaden the flock's acquisition condition (that would force exclusivity on a hypothetical shared read-only sqlite consumer and touches the fragile daemon-restart lock path) — the opt-in achieves the same guarantee without that risk.