Skip to content

Commit 6250f94

Browse files
devwhodevsclaude
andauthored
v0.6.0: Write pipeline with sqlite-vec migration (#7)
* feat: add vecstore module with sqlite-vec integration Wrap sqlite-vec for vector search, replacing HNSW-based approach. Provides init, insert, delete, search (with tombstone filtering), and clear operations on a vec0 virtual table. Includes 5 unit tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: integrate sqlite-vec into Store with transaction helpers Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: replace HNSW semantic lane with sqlite-vec in search All search code paths now use store.search_vec() instead of HnswIndex::search(). The hnsw module remains but is unused — deletion is deferred to Task 5. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: replace HNSW rebuild with sqlite-vec inserts in indexer - Remove HnswIndex import and HNSW rebuild steps (11-12) - Insert vectors into vec0 table during chunk write loop - Delete from vec0 when files are deleted or changed - Clear vec0 on full rebuild - Use store.next_vector_id() instead of scanning all vectors - Add folder centroid computation and storage after indexing - Add folder_centroids table migration and upsert/get methods in Store Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: remove hnsw_rs dependency, delete hnsw.rs — vectors now in sqlite-vec Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: auto-migrate existing BLOB vectors to sqlite-vec on startup Adds `migrate_vectors_to_vec0()` which copies BLOB vectors from `chunks.vector` into the `chunks_vec` vec0 virtual table. Called from `init()` after `init_vec_table()` so the virtual table is guaranteed to exist. No-ops when vec0 is already populated or no BLOBs are present. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: add tag registry with fuzzy resolution Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add link discovery module for auto-wikilinks Scans note content for potential wikilink targets using exact filename and alias matching. Supports case-insensitive search, word boundary checking, existing wikilink skipping, and longest-match-first priority. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add folder placement module with type rules and semantic centroids Three-strategy cascade: type-based rules (person/daily/workout + content pattern detection) → semantic centroid matching against precomputed folder embeddings → inbox fallback. 12 tests covering all strategies. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add write pipeline orchestrator with create, append, update, and move Implements the writer module that ties together content analysis, tag resolution, link discovery, folder placement, and atomic write+index. - CreateNoteInput: 5-step pipeline (filename, tags, links, placement, write) - AppendInput: append content with mtime conflict detection - UpdateMetadataInput: frontmatter-only updates without re-chunking - move_note: relocate files with store record updates - All writes use temp+rename for atomicity with transaction rollback - Pre-computes embeddings before holding DB lock - Adds Store::resolve_file() for path/basename/#docid resolution - Adds time crate for date formatting Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: resolve clippy warnings in writer, links, and placement modules * feat: add create, append, update_metadata, and move_note MCP write tools Extends the MCP server with 4 write tools that expose the writer module pipeline to Claude Code clients, completing the read-write tool surface. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add write CLI subcommands (create, append) Adds `engraph write create` and `engraph write append` subcommands backed by the writer module pipeline. Both support --content flag or stdin for content input, with --json output mode. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: add crash recovery — cleanup orphan .tmp files on startup Scans the vault for leftover `.md.tmp` files on both `engraph index` and `engraph serve` startup, removing any that survived a previous crash mid-write. Logs the count if any are removed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test: add write pipeline integration tests Three #[ignore] tests covering create_note searchability, append index update, and mtime conflict detection. Run with: cargo test --test write_pipeline -- --ignored Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: v0.6.0 — write pipeline, sqlite-vec migration, tombstone removal Remove redundant tombstone writes from indexer (delete_vec handles it). Replace tombstone loading in search with empty set. Fix clippy warning in writer.rs. Apply cargo fmt across all modules. Bump version to 0.6.0. Update CLAUDE.md with 19 modules, 190 tests, write pipeline docs, and sqlite-vec architecture. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: update stored mtime after rename to prevent false conflict detection * feat: add archive/unarchive for soft-delete with index exclusion - archive: moves note to 04-Archive/, adds archived frontmatter, removes from index - unarchive: restores to original location (via archived_from), re-indexes - indexer auto-excludes archive folder during walks - MCP tools: archive, unarchive (13 total tools now) - CLI: engraph write archive/unarchive Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * style: apply cargo fmt to archive/unarchive code * feat: complete v0.6 spec coverage — content analysis, suggested_folder, incremental centroids, orphan cleanup, tag queries - Gap 1: Add suggestion field to PlacementResult; add ticket ID detection (BRE-XXXX/DRIFT-XXX), meeting note detection, decision type_hint - Gap 2: Inject suggested_folder frontmatter when semantic placement finds a below-threshold match during inbox fallback - Gap 3: Incrementally update folder centroids after each note creation (weighted merge with existing centroid) - Gap 4: Add verify_index_integrity() to clean orphan DB entries for files that no longer exist on disk; called on index and serve startup - Gap 5: Add agent_created_tags(), low_usage_tags(), stale_tags() queries to store for tag hygiene tooling Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 0be5225 commit 6250f94

18 files changed

Lines changed: 3503 additions & 637 deletions

CLAUDE.md

Lines changed: 24 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ Local hybrid search CLI for Obsidian vaults. Rust, MIT licensed.
44

55
## Architecture
66

7-
Single binary with 14 modules behind a lib crate:
7+
Single binary with 19 modules behind a lib crate:
88

99
- `config.rs` — loads `~/.engraph/config.toml` and `vault.toml`, merges CLI args, provides `data_dir()`
1010
- `chunker.rs` — smart chunking with break-point scoring algorithm. Finds optimal split points considering headings, code fences, blank lines, and thematic breaks. `split_oversized_chunks()` handles token-aware secondary splitting with overlap
@@ -14,46 +14,56 @@ Single binary with 14 modules behind a lib crate:
1414
- `fts.rs` — FTS5 full-text search support. Re-exports `FtsResult` from store. BM25-ranked keyword search
1515
- `fusion.rs` — Reciprocal Rank Fusion (RRF) engine. Merges semantic + FTS5 + graph results. Supports lane weighting, `--explain` output with per-lane detail
1616
- `context.rs` — context engine. Six functions: `read` (full note content + metadata), `list` (filtered note listing), `vault_map` (structure overview), `who` (person context bundle), `project` (project context bundle), `context_topic` (rich topic context with budget trimming). Pure functions taking `ContextParams` — no model loading except `context_topic` which reuses `search_internal`
17-
- `serve.rs` — MCP stdio server via rmcp SDK. Exposes 7 read-only tools (search, read, list, vault_map, who, project, context). EngraphServer struct with Arc+Mutex wrapping for async handlers. Loads all resources at startup.
17+
- `vecstore.rs` — sqlite-vec virtual table integration. Manages the `vec_chunks` vec0 table for vector storage and KNN search. Handles insert, delete, and search operations against the virtual table
18+
- `tags.rs` — tag registry module. Maintains a `tag_registry` table tracking known tags with source attribution. Supports fuzzy matching for tag suggestions during note creation
19+
- `links.rs` — link discovery module. Scans note content for potential wikilink targets using fuzzy basename matching and heading detection. Suggests links that could be added to improve vault connectivity
20+
- `placement.rs` — folder placement engine. Uses folder centroids (average embeddings per folder) to suggest the best folder for new notes. Falls back to inbox when confidence is low
21+
- `writer.rs` — write pipeline orchestrator. 5-step pipeline: resolve tags (fuzzy match + register new), discover links, place in folder, atomic file write (temp + rename), and index update. Supports create, append, update_metadata, and move_note operations with mtime-based conflict detection and crash recovery via temp file cleanup
22+
- `serve.rs` — MCP stdio server via rmcp SDK. Exposes 11 tools: 7 read (search, read, list, vault_map, who, project, context) + 4 write (create, append, update_metadata, move_note). EngraphServer struct with Arc+Mutex wrapping for async handlers. Loads all resources at startup
1823
- `graph.rs` — vault graph agent. Extracts wikilink targets, expands search results by following graph connections 1-2 hops. Relevance filtering via FTS5 term check and shared tags
1924
- `profile.rs` — vault profile detection. Auto-detects PARA/Folders/Flat structure, vault type (Obsidian/Logseq/Plain), wikilinks, frontmatter, tags. Writes/loads `vault.toml`
20-
- `store.rs` — SQLite persistence. Tables: `meta`, `files` (with docid), `chunks` (with vector BLOBs), `chunks_fts` (FTS5), `edges` (vault graph), `tombstones`. Handles incremental diffing via content hashes
21-
- `hnsw.rs`thin wrapper around `hnsw_rs`. **Important:** `hnsw_rs` does not support inserting after `load_hnsw()`. The index is rebuilt from vectors stored in SQLite on every index run
22-
- `indexer.rs`orchestrates vault walking (via `ignore` crate for `.gitignore` support), diffing, chunking, embedding (Rayon for parallel chunking, serial embedding since `Embedder` is not `Send`), serial writes to store + HNSW + FTS5, and vault graph edge building (wikilinks + people detection)
25+
- `store.rs` — SQLite persistence. Tables: `meta`, `files` (with docid), `chunks` (with vector BLOBs), `chunks_fts` (FTS5), `edges` (vault graph), `tombstones`, `tag_registry`, `folder_centroids`. `vec_chunks` virtual table (sqlite-vec) for KNN search. Handles incremental diffing via content hashes
26+
- `indexer.rs`orchestrates vault walking (via `ignore` crate for `.gitignore` support), diffing, chunking, embedding (Rayon for parallel chunking, serial embedding since `Embedder` is not `Send`), serial writes to store + sqlite-vec + FTS5, vault graph edge building (wikilinks + people detection), and folder centroid computation
27+
- `search.rs`hybrid search orchestrator. Runs semantic (sqlite-vec KNN), keyword (FTS5 BM25), and graph expansion lanes, then fuses via RRF
2328

24-
`main.rs` is a thin clap CLI (async via `#[tokio::main]`). Subcommands: `index`, `search` (with `--explain`), `status`, `clear`, `init`, `configure`, `models`, `graph` (show/stats), `context` (read/list/vault-map/who/project/topic), `serve` (MCP stdio server).
29+
`main.rs` is a thin clap CLI (async via `#[tokio::main]`). Subcommands: `index`, `search` (with `--explain`), `status`, `clear`, `init`, `configure`, `models`, `graph` (show/stats), `context` (read/list/vault-map/who/project/topic), `write` (create/append/update-metadata/move), `serve` (MCP stdio server).
2530

2631
## Key patterns
2732

28-
- **3-lane hybrid search:** Queries run through three lanes — semantic (HNSW embeddings), keyword (FTS5 BM25), and graph (wikilink expansion). Results are fused via Reciprocal Rank Fusion (RRF) with configurable lane weights (semantic 1.0, FTS 1.0, graph 0.8)
33+
- **3-lane hybrid search:** Queries run through three lanes — semantic (sqlite-vec KNN embeddings), keyword (FTS5 BM25), and graph (wikilink expansion). Results are fused via Reciprocal Rank Fusion (RRF) with configurable lane weights (semantic 1.0, FTS 1.0, graph 0.8)
2934
- **Vault graph:** `edges` table stores bidirectional wikilink edges and mention edges. Built during indexing after all files are written. People detection scans for person name/alias mentions using notes from the configured People folder
30-
- **Graph agent:** Expands seed results by following wikilinks 1-2 hops. Decay: 0. for 1-hop, 0. for 2-hop. Relevance filter: must contain query term (FTS5) or share tags with seed. Multi-parent merge takes highest score
35+
- **Graph agent:** Expands seed results by following wikilinks 1-2 hops. Decay: 0.8x for 1-hop, 0.5x for 2-hop. Relevance filter: must contain query term (FTS5) or share tags with seed. Multi-parent merge takes highest score
3136
- **Smart chunking:** Break-point scoring algorithm assigns scores to potential split points (headings 50-100, code fences 80, thematic breaks 60, blank lines 20). Code fence protection prevents splitting inside code blocks
32-
- **Incremental indexing:** `diff_vault()` compares file content hashes in SQLite against disk. Changed files have their old chunks and edges deleted, then are re-processed. FTS5 entries cleaned up alongside vector entries
33-
- **HNSW rebuild on every run:** Vectors stored as BLOBs. Full HNSW index rebuilt from `store.get_all_vectors()` after SQLite update (hnsw_rs limitation)
37+
- **Incremental indexing:** `diff_vault()` compares file content hashes in SQLite against disk. Changed files have their old chunks, vectors, and edges deleted, then are re-processed. FTS5 and sqlite-vec entries cleaned up alongside store entries
38+
- **sqlite-vec for vector search:** Vectors stored in a `vec_chunks` virtual table (vec0). KNN search via `vec_distance_cosine()`. Real deletes — no tombstone filtering needed during search
39+
- **Write pipeline:** 5-step process for creating/modifying notes: (1) resolve tags via fuzzy matching against tag registry, (2) discover potential wikilinks via basename matching, (3) suggest folder placement via centroid similarity, (4) atomic file write (temp + rename for crash safety), (5) immediate index update (embed + insert into sqlite-vec + FTS5 + edges)
3440
- **Docids:** Each file gets a deterministic 6-char hex ID. Displayed in search results
3541
- **Vault profiles:** `engraph init` auto-detects vault structure and writes `vault.toml`
3642
- **Pluggable models:** `ModelBackend` trait enables future model swapping
3743

3844
## Data directory
3945

40-
`~/.engraph/` — hardcoded via `Config::data_dir()`. Contains `engraph.db` (SQLite with FTS5 + edges), `hnsw/` (index files), `models/` (ONNX model + tokenizer), `vault.toml` (vault profile), `config.toml` (user config).
46+
`~/.engraph/` — hardcoded via `Config::data_dir()`. Contains `engraph.db` (SQLite with FTS5 + sqlite-vec + edges), `models/` (ONNX model + tokenizer), `vault.toml` (vault profile), `config.toml` (user config).
4147

4248
Single vault only. Re-indexing a different vault path triggers a confirmation prompt.
4349

4450
## Dependencies to be aware of
4551

4652
- `ort` (2.0.0-rc.12) — ONNX Runtime Rust bindings. Pre-release API. Does not provide prebuilt binaries for all targets
47-
- `hnsw_rs` (0.3) — pure Rust HNSW. `Box::leak` in `load()`. Read-only after load
53+
- `sqlite-vec` (0.1.8-alpha.1) — SQLite extension for vector search. Provides vec0 virtual tables with KNN via `vec_distance_cosine()`
54+
- `zerocopy` (0.7) — zero-copy serialization for vector data passed to sqlite-vec
55+
- `strsim` (0.11) — string similarity for fuzzy tag matching in the write pipeline
56+
- `time` (0.3) — date/time handling for frontmatter timestamps
4857
- `tokenizers` (0.22) — HuggingFace tokenizer. Needs `fancy-regex` feature
4958
- `ignore` (0.4) — vault walking with `.gitignore` support
5059
- `rusqlite` (0.32) — bundled SQLite with FTS5 support
60+
- `rmcp` (1.2) — MCP server SDK for stdio transport
5161

5262
## Testing
5363

54-
- Unit tests in each module (`cargo test --lib`) — 146 tests, no network required
64+
- Unit tests in each module (`cargo test --lib`) — 190 tests, no network required
5565
- 1 ignored smoke test (`test_embed_smoke`) — downloads ONNX model, verifies embedding
56-
- Integration tests (`cargo test --test integration -- --ignored`) — 8 tests, require model download
66+
- Integration tests (`cargo test --test integration -- --ignored`) — require model download
5767

5868
## CI/CD
5969

0 commit comments

Comments
 (0)