Skip to content

Commit 887c361

Browse files
authored
Merge pull request #366 from arabold/vector-dimension
feat(store): add embedding model change safety
2 parents 7c3c1fa + 5407360 commit 887c361

25 files changed

Lines changed: 1303 additions & 39 deletions

File tree

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
-- Migration: Create metadata table for tracking global configuration state.
2+
-- Used to persist the active embedding model and vector dimension so the system
3+
-- can detect configuration changes between startups and prevent silent data corruption.
4+
CREATE TABLE IF NOT EXISTS metadata (
5+
key TEXT PRIMARY KEY,
6+
value TEXT NOT NULL
7+
);

docs/concepts/data-storage.md

Lines changed: 22 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,11 @@
22

33
## Overview
44

5-
The storage system uses SQLite with a normalized, four-table schema design for efficient document storage, retrieval, and version management. The schema supports page-level metadata tracking, ETag-based change detection, and hierarchical document chunking.
5+
The storage system uses SQLite with a normalized schema design for efficient document storage, retrieval, and version management. The schema supports page-level metadata tracking, ETag-based change detection, hierarchical document chunking, and embedding model identity tracking.
66

77
## Database Schema
88

9-
The database consists of four core tables with normalized relationships:
9+
The database consists of four core tables with normalized relationships, plus a metadata table for system-level configuration tracking:
1010

1111
```mermaid
1212
erDiagram
@@ -58,6 +58,11 @@ erDiagram
5858
blob embedding
5959
datetime created_at
6060
}
61+
62+
metadata {
63+
text key PK
64+
text value
65+
}
6166
```
6267

6368
### Libraries Table
@@ -137,6 +142,18 @@ Document chunks with embeddings and hierarchical metadata.
137142

138143
**Code Reference:** `src/store/types.ts` lines 39-48 (DbChunk interface)
139144

145+
### Metadata Table
146+
147+
Key-value store for system-level configuration tracking, independent of library/version data.
148+
149+
**Schema:**
150+
- `key` (TEXT PRIMARY KEY): Configuration key name
151+
- `value` (TEXT NOT NULL): Configuration value
152+
153+
**Purpose:** Tracks the active embedding model identity (`embedding_model` and `embedding_dimension` keys) to detect incompatible configuration changes between server restarts. When a model or dimension change is detected, the server prompts the user to confirm vector invalidation before proceeding.
154+
155+
**Code Reference:** `db/migrations/013-create-metadata-table.sql`, `src/store/DocumentStore.ts` - `getEmbeddingMetadata()`, `setEmbeddingMetadata()`, `checkEmbeddingModelChange()`
156+
140157
## Schema Evolution
141158

142159
### Migration System
@@ -154,6 +171,9 @@ Sequential SQL migrations in `db/migrations/`:
154171
9. `008-case-insensitive-names.sql` - Case-insensitive library name handling
155172
10. `009-add-pages-table.sql` - Page-level metadata normalization
156173
11. `010-add-depth-to-pages.sql` - Crawl depth tracking for refresh operations
174+
12. `011-add-vector-triggers.sql` - FTS and vector table trigger maintenance
175+
13. `012-add-source-content-type.sql` - Source content type tracking on pages
176+
14. `013-create-metadata-table.sql` - Key-value metadata table for embedding model tracking
157177

158178
**Code Reference:** All migration files in `db/migrations/` directory
159179

docs/guides/embedding-models.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -113,3 +113,36 @@ AZURE_OPENAI_API_VERSION="2024-02-01" \
113113
DOCS_MCP_EMBEDDING_MODEL="microsoft:text-embedding-ada-002" \
114114
npx @arabold/docs-mcp-server@latest
115115
```
116+
117+
## Changing the Embedding Model
118+
119+
When you change the embedding model or vector dimension after initial setup, existing embedding vectors become semantically incompatible with the new configuration. The server detects this automatically by tracking the active model identity in a metadata table.
120+
121+
### What Happens on Model Change
122+
123+
**Interactive mode (TTY connected):** The server displays a warning and prompts for confirmation before proceeding. Rejecting the prompt aborts startup with no changes made.
124+
125+
```
126+
⚠️ Embedding model change detected:
127+
Previous: openai:text-embedding-3-small (1536 dimensions)
128+
Current: openai:text-embedding-ada-002 (1536 dimensions)
129+
130+
All existing embedding vectors will be invalidated.
131+
Libraries must be re-scraped to restore vector search.
132+
Full-text search will continue working for all existing documents.
133+
134+
Proceed with model change? (y/N)
135+
```
136+
137+
**Non-interactive mode (MCP/stdio, CI/CD):** The server fails startup entirely with a descriptive error message. To resolve the change, start the server interactively once to confirm the migration.
138+
139+
### After Confirming a Model Change
140+
141+
- All stored embedding vectors are set to NULL
142+
- The vector search index (`documents_vec`) is recreated empty with the new dimension
143+
- Full-text search continues working for all existing documents
144+
- Libraries must be re-scraped to regenerate embeddings with the new model
145+
146+
### Vector Dimension Override
147+
148+
The vector dimension defaults to the model's native dimension (e.g., 1536 for `text-embedding-3-small`). You can override it with `embeddings.vectorDimension` in the config file or `DOCS_MCP_EMBEDDINGS_VECTOR_DIMENSION` as an environment variable. The value must be a positive integer (minimum 1).

docs/setup/configuration.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -223,7 +223,7 @@ Settings for the vector embedding generation.
223223
| `batchChars` | `50000` | Maximum total characters per embedding batch. |
224224
| `requestTimeoutMs` | `30000` | Timeout for each embedding API request (ms). |
225225
| `initTimeoutMs` | `30000` | Timeout for the initial test embedding during model initialization (ms). |
226-
| `vectorDimension` | `1536` | Dimension of the vector space (must match model). |
226+
| `vectorDimension` | `1536` | Dimension of the vector space. Must be a positive integer (minimum 1). Override with `DOCS_MCP_EMBEDDINGS_VECTOR_DIMENSION`. Changing this value triggers a model change confirmation on next startup. |
227227

228228
### Search (`search`)
229229

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
schema: spec-driven
2+
created: 2026-03-17
Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
## Context
2+
3+
The system uses embedding vectors for semantic search via `sqlite-vec`. Vectors are produced by a configured embedding model (e.g., `openai:text-embedding-3-small`) and stored in `documents.embedding` (JSON blob) alongside an indexed `documents_vec` virtual table. The vector table is currently created by a SQL migration with a hard-coded dimension of 1536.
4+
5+
PR 330 introduces runtime-configurable vector dimensions so different embedding providers (e.g., 1536d for OpenAI, 3584d for Gemini) can work without migration changes. However, it has critical gaps:
6+
7+
1. **Silent backfill of incompatible vectors**: When the dimension changes, `ensureVectorTable()` drops and recreates the vec table, then backfills old vectors that were padded to the previous dimension. These old vectors are dimensionally and semantically incompatible with the new configuration.
8+
2. **No model change detection**: Switching between models with the same dimension (e.g., `text-embedding-3-small` to `text-embedding-ada-002`, both 1536d) is completely undetected. Old vectors remain, producing silently degraded search results.
9+
3. **No user feedback**: There is no warning, confirmation prompt, or guidance when the embedding configuration changes.
10+
11+
The existing specs document this as a "known gap" in `embedding-generation/spec.md:154` but provide no solution.
12+
13+
**Stakeholders**: All users who configure embedding models, especially those running the server in MCP/stdio mode where there is no TTY for interactive prompts.
14+
15+
## Goals / Non-Goals
16+
17+
**Goals:**
18+
- Detect embedding model and/or vector dimension changes at startup by comparing the current configuration against persisted metadata in the database.
19+
- Require explicit user confirmation (interactive TTY) before proceeding with a destructive model change.
20+
- Fail startup entirely in non-interactive mode (no TTY) when a model change is detected, with a clear error message explaining what happened and how to resolve it.
21+
- On confirmed change, invalidate all existing vectors (set to NULL) and recreate the vec table empty -- no backfill of incompatible data.
22+
- Silently initialize metadata on first startup or upgrade from a pre-metadata database.
23+
- Adopt PR 330's `ensureVectorTable()` approach for runtime-configurable dimensions, but fix the broken backfill.
24+
25+
**Non-Goals:**
26+
- Per-library or per-version model tracking. A global metadata record is sufficient; per-vector model tracking would bloat table sizes with no practical benefit since you cannot mix models within a search anyway.
27+
- Automatic re-scraping after model change. The system invalidates vectors and continues in FTS-only mode; the user must manually re-scrape to regenerate vectors.
28+
- Supporting multiple embedding models simultaneously. The system uses one model globally.
29+
- Migrating or converting vectors between models. This is mathematically impossible for different embedding spaces.
30+
31+
## Decisions
32+
33+
### Decision 1: Global Metadata Table for Model Tracking
34+
35+
**Choice**: Store embedding model identity and dimension in a new `metadata` key-value table (`CREATE TABLE metadata (key TEXT PRIMARY KEY, value TEXT NOT NULL)`).
36+
37+
**Alternatives considered**:
38+
- *Per-vector model column in `documents`*: Rejected -- massive storage bloat (one string per row) for information that is always the same globally. The system enforces a single model across the entire database.
39+
- *Per-version model tracking*: Rejected -- adds complexity without benefit. Even if tracked per-version, you cannot search across versions with mixed models. A global record is sufficient to detect changes.
40+
- *File-based metadata (e.g., `.embedding-meta.json`)*: Rejected -- metadata should live alongside the data it describes. A separate file can get out of sync with the database.
41+
42+
**Keys stored**:
43+
- `embedding_model`: The full model spec string (e.g., `openai:text-embedding-3-small`). Compared against `config.app.embeddingModel`.
44+
- `embedding_dimension`: The configured vector dimension as a string (e.g., `"1536"`). Compared against `config.embeddings.vectorDimension`.
45+
46+
### Decision 2: Fail Non-Interactive Startup on Model Change
47+
48+
**Choice**: When a model or dimension change is detected and `process.stdout.isTTY` is falsy, throw a fatal `EmbeddingModelChangedError` that prevents startup entirely.
49+
50+
**Rationale**: In MCP/stdio mode, there is no user present to see warnings. Silently degrading search quality or silently dropping vectors could lead to hours of confusion. A hard failure with a descriptive error message is the safest option for automated environments.
51+
52+
**Error message format**:
53+
```
54+
Embedding model change detected:
55+
Previous: openai:text-embedding-3-small (1536 dimensions)
56+
Current: gemini:embedding-001 (768 dimensions)
57+
58+
All existing vectors are incompatible and must be invalidated.
59+
To confirm this change, start the server interactively (with a TTY connected)
60+
and follow the prompts.
61+
```
62+
63+
**Alternative considered**:
64+
- *Escape hatch env var (`DOCS_MCP_CONFIRM_MODEL_CHANGE=true`)*: Not adopted initially to keep the surface area small. Can be added later if automated deployments need it.
65+
66+
### Decision 3: Interactive Confirmation Flow
67+
68+
**Choice**: When a model change is detected and `process.stdout.isTTY` is truthy, display a warning and prompt the user for explicit confirmation before proceeding.
69+
70+
**Flow**:
71+
1. Display warning with previous vs. current model/dimension
72+
2. Explain consequences: "All existing embedding vectors will be invalidated. Libraries must be re-scraped to restore vector search."
73+
3. Prompt: `Proceed with model change? (y/N)`
74+
4. Default is `N` (abort). Only `y` or `Y` confirms.
75+
5. On confirm: invalidate vectors, update metadata, continue startup.
76+
6. On abort: throw error, exit.
77+
78+
**Implementation**: The confirmation prompt is handled at the CLI layer (not in `DocumentStore`). `DocumentStore.initialize()` detects the mismatch and throws a structured error (`EmbeddingModelChangedError`) containing old/new model info. The CLI command handler catches this error and either prompts (TTY) or re-throws (non-TTY).
79+
80+
### Decision 4: Vector Invalidation Strategy
81+
82+
**Choice**: On confirmed model change, execute two operations in a transaction:
83+
1. `UPDATE documents SET embedding = NULL` -- clears all stored embedding blobs
84+
2. Drop and recreate `documents_vec` as empty (no backfill)
85+
86+
**Rationale**: Setting embeddings to NULL rather than deleting documents preserves all indexed content. FTS search continues working immediately. The user can re-scrape at their convenience to regenerate vectors.
87+
88+
**Alternative considered**:
89+
- *Delete only from `documents_vec`, keep `documents.embedding` blobs*: Rejected -- stale blobs would be backfilled into the vec table on next dimension reconciliation, recreating the incompatibility problem.
90+
91+
### Decision 5: First-Run / Upgrade Behavior
92+
93+
**Choice**: If the `metadata` table does not contain `embedding_model` or `embedding_dimension` keys, silently store the current configuration values without prompting.
94+
95+
**Rationale**: Existing deployments upgrading to this version have no stored metadata. Requiring confirmation on upgrade would break all existing automated deployments. The first run establishes the baseline; subsequent changes are detected against it.
96+
97+
### Decision 6: Initialization Order in DocumentStore
98+
99+
**Choice**: The startup sequence in `DocumentStore.initialize()` becomes:
100+
1. Load `sqlite-vec` extension
101+
2. Apply migrations (including metadata table creation)
102+
3. Check embedding model/dimension metadata (detect changes, throw if mismatch)
103+
4. If no mismatch (or after confirmed invalidation): `ensureVectorTable()` with current dimension
104+
5. Prepare SQL statements
105+
6. Initialize embeddings client
106+
107+
This ensures the metadata check happens before any table manipulation, preventing the table from being left in an inconsistent state if the user aborts.
108+
109+
### Decision 7: Adopt PR 330's ensureVectorTable() With Fixes
110+
111+
**Choice**: Keep PR 330's approach of creating `documents_vec` at runtime with configurable dimensions, but modify the backfill behavior:
112+
- On **first creation** (table doesn't exist): Create empty. Vectors will be populated during scraping.
113+
- On **dimension match** (table exists, same dimension): No-op. Existing vectors are valid.
114+
- On **dimension mismatch**: This case is now handled by the model change detection in step 3 above. By the time `ensureVectorTable()` runs, vectors have already been invalidated if needed.
115+
116+
The original PR 330 backfill (`INSERT OR REPLACE ... FROM documents WHERE embedding IS NOT NULL`) is removed entirely.
117+
118+
## Risks / Trade-offs
119+
120+
**[Risk] Non-interactive failure blocks automated deployments that intentionally change models** → Mitigation: The error message clearly explains how to resolve it (interactive startup). A future enhancement could add a `DOCS_MCP_CONFIRM_MODEL_CHANGE=true` env var escape hatch if demand warrants it.
121+
122+
**[Risk] First-run silent initialization stores wrong baseline if config is misconfigured** → Mitigation: The baseline is only stored after successful embedding initialization. If the model fails to initialize (bad credentials, invalid model), no metadata is stored, so the next attempt with correct config becomes the new "first run."
123+
124+
**[Risk] Setting all embeddings to NULL is expensive on large databases** → Mitigation: This is a single UPDATE statement that SQLite handles efficiently. The alternative (deleting and re-inserting documents) would be far more expensive and would break FTS indexes.
125+
126+
**[Risk] Users may not realize they need to re-scrape after model change** → Mitigation: The confirmation prompt and post-invalidation log message explicitly state that re-scraping is required. The system continues in FTS-only mode as a safety net.
127+
128+
**[Trade-off] Global model tracking vs per-version tracking** → We chose global for simplicity. This means the system cannot detect if individual versions were scraped with different models (e.g., if a user changed models between scraping two libraries). This is acceptable because: (a) search is always performed within a single version, and (b) the model change detection catches the transition point.
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
## Why
2+
3+
The system currently has no mechanism to detect when the configured embedding model or vector dimension changes between startups. Vectors produced by different embedding models are semantically incompatible -- cosine similarity between them is meaningless -- yet the system silently continues serving degraded search results. PR 330 introduces configurable vector dimensions but compounds this problem by silently backfilling old vectors into a potentially incompatible table. This change adds safety rails to prevent silent data corruption and enforce explicit user acknowledgment before invalidating existing vectors.
4+
5+
## What Changes
6+
7+
- **BREAKING**: Non-interactive startup (MCP/stdio) SHALL fail with a descriptive error when an embedding model or dimension change is detected, requiring manual interactive confirmation.
8+
- Store the active embedding model identifier and vector dimension as global metadata in the database to enable change detection across restarts.
9+
- On first startup (or upgrade from a database without metadata), silently initialize the metadata record with the current configuration -- no user action required.
10+
- When a model or dimension change is detected during interactive startup, display a warning explaining that all existing vectors will be invalidated, and require explicit user confirmation before proceeding.
11+
- On confirmed model/dimension change, set all `documents.embedding` values to `NULL` and recreate the `documents_vec` table empty (no backfill of incompatible vectors).
12+
- Adopt PR 330's runtime-configurable `documents_vec` creation via `ensureVectorTable()`, but remove the broken backfill that inserts old-dimension vectors into a new-dimension table.
13+
- Add Zod validation ensuring `vectorDimension >= 1`.
14+
15+
## Capabilities
16+
17+
### New Capabilities
18+
- `embedding-model-change-safety`: Defines startup-time detection of embedding model/dimension changes, interactive confirmation flow, non-interactive failure behavior, vector invalidation on confirmed change, and metadata persistence for tracking the active embedding configuration.
19+
20+
### Modified Capabilities
21+
- `embedding-generation`: Close the documented "known gap" about model change tracking. Add requirements for vector invalidation when the model changes, and update dimension normalization to cover runtime-configurable vector table creation.
22+
- `embedding-resolution`: Add requirements for persisting resolved model identity to the database and detecting mismatches on subsequent startups.
23+
- `configuration`: Add `embeddings.vectorDimension` as a documented configurable parameter with Zod validation (`>= 1`).
24+
25+
## Impact
26+
27+
- **`src/store/DocumentStore.ts`**: Major changes to `initialize()`, new `ensureVectorTable()` method (from PR 330, modified), new metadata read/write methods, new vector invalidation method.
28+
- **`src/store/embeddings/`**: No structural changes, but `EmbeddingConfig` output is now persisted to the database.
29+
- **`src/utils/config.ts`**: Minor -- add `.min(1)` validation on `vectorDimension` (from PR 330).
30+
- **`db/migrations/`**: New migration to create `metadata` table. Migration 012 from PR 330 (drop `documents_vec`) is adopted.
31+
- **`src/cli/`**: Startup commands need to handle interactive confirmation prompts and non-interactive failure for model changes.
32+
- **User-facing docs**: `docs/guides/embedding-models.md` and `docs/setup/configuration.md` need updates explaining model change behavior and `vectorDimension` configuration.

0 commit comments

Comments
 (0)