Upgrade embedder + reranker to 8K context, raise chunk size by lupuletic · Pull Request #5 · lupuletic/code-recall

lupuletic · 2026-05-19T21:20:02Z

Summary

The bi-encoder and cross-encoder were both pre-2024 and the chunker was tuned to the old 512-token ceiling. Raising the context window everywhere — including chunk size — is a bigger quality lever than swapping for a +0.3 MTEB peer.

Changes

Embedder: `BAAI/bge-small-en-v1.5` (Sep 2023, 384d, 512 tok) → `nomic-ai/nomic-embed-text-v1.5-Q` (130MB, 768d, 8K tokens).
Reranker: `Xenova/ms-marco-MiniLM-L-6-v2` (2021) → `jinaai/jina-reranker-v1-tiny-en` (130MB, 8K tokens). Chose tiny over `bge-reranker-base` because base is 1GB — too heavy for a local-first tool.
Chunks: `MAX_CHUNK_CHARS` 2000 → 8000. Previously a 5-turn conversation truncated mid-stream to fit bge-small's 512-token limit; now an entire window fits in one chunk.
Asymmetric prefixes: `embed()` now uses `passage_embed()`, `embed_single()` uses `query_embed()`. The new model differentiates passages from queries (nomic prefixes), so this matters for retrieval quality.
Dim binding: `chunks_vec` dim is now sourced from `Embedder.DIM` rather than a hardcoded `float[384]`. Existing local indexes with the wrong dim get their `chunks_vec` dropped and rebuilt automatically — no manual reindex required.

Subtle bug fixed

nomic returns `float64` while sqlite-vec stores `float[N]` as `float32`. Without a cast, sqlite-vec reads the 6144-byte buffer as 1536 floats and throws "Dimension mismatch". Cast to `float32` at the Embedder boundary.

Validation

331 tests pass (the existing suite already covers the indexer, vec writes, and search pipeline end-to-end).
Smoke eval on 7 synthetic vague-memory queries: 6/7, including the two "you remember the shape not the title" examples the README highlights (`"the stripe thing"` → correct session, `"that slow query I fixed"` → correct session).

Migration impact for existing users

On next `code-recall` run, the dim-mismatch check drops `chunks_vec` and the indexer re-embeds all existing chunks with the new model. One-time cost (~minutes depending on session count).
Existing chunks stay at the old 2000-char size — only newly-indexed sessions benefit from the 8K chunk size. Over time the index migrates organically. Users who want immediate uplift can `code-recall index --force` to rechunk + re-embed everything.

Size impact

Old: bge-small ~67MB + ms-marco-MiniLM ~80MB ≈ 150MB
New: nomic-Q ~130MB + jina-tiny ~130MB ≈ 260MB

Both still under `pip install 'code-recall[all]'`. No new dependencies.

Test plan

`uv run pytest -q` → 331 passing
On a populated index: `code-recall index` triggers dim-mismatch migration cleanly
Spot-check a few real "vague memory" queries against personal history

Two stale models with a context-window mismatch were the real ceiling on recall, not the +0.3 MTEB delta between same-class peers: - Bi-encoder: BAAI/bge-small-en-v1.5 (Sep 2023, 384d, 512 tok limit) -> nomic-ai/nomic-embed-text-v1.5-Q (130MB, 768d, 8K tokens) - Cross-encoder: Xenova/ms-marco-MiniLM-L-6-v2 (2021, ~512 tok) -> jinaai/jina-reranker-v1-tiny-en (130MB, 8K tokens) - MAX_CHUNK_CHARS: 2000 -> 8000 (chunks were truncating mid-conversation at the old 512-token ceiling; an entire 5-turn window now fits). The new embedder uses asymmetric prefixes, so embed()/embed_single() now go through passage_embed()/query_embed() rather than the bare embed(). Dim is sourced from Embedder.DIM and chunks_vec auto-migrates: if an existing table has the wrong dim, it's dropped and the indexer re-embeds all chunks on the next run. Subtle bug fixed along the way: nomic returns float64 while sqlite-vec expects float32, so raw embeddings tripped "Dimension mismatch (received 1536)" — cast to float32 at the Embedder boundary. 331 tests pass. Smoke eval on 7 synthetic vague-memory queries: 6/7, including the two queries the README highlights.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 058622e2da

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Two follow-on UX fixes surfaced when testing on a real 1695-session index: 1. TUI crashed with MarkupError on session data containing unbalanced square-bracket sequences (e.g. a prompt with literal "[/dim]" in it). Even with escape() applied at every visible interpolation site, a missed call anywhere in the render path takes down the whole TUI. Added _safe_markup() — validates Rich markup with Text.from_markup, falls back to fully-escaped on MarkupError so the worst case is "tags render as literal text" rather than "TUI dies". Wrapped the three composite render sites that mix user data with markup. 2. _generate_embeddings was one giant batched embedder.embed(texts) call followed by a per-50-chunks counter inside the insert loop. The actual embedding compute (where time is spent) was silent. Now embeds in batches of 64 with a tqdm progress bar showing count / rate / ETA, so users see steady feedback instead of a frozen "Generating embeddings for N chunks..." line. Falls back to the old line-print when stderr isn't a tty. 331 tests still pass.

chatgpt-codex-connector Bot reviewed May 19, 2026

View reviewed changes

Comment thread src/code_recall/db.py

lupuletic added 2 commits May 19, 2026 23:48

Merge main and guard vector dimension upgrades

5811110

lupuletic merged commit 1635e6d into main May 20, 2026
10 checks passed

lupuletic deleted the upgrade/embedder-context branch May 20, 2026 13:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade embedder + reranker to 8K context, raise chunk size#5

Upgrade embedder + reranker to 8K context, raise chunk size#5
lupuletic merged 3 commits into
mainfrom
upgrade/embedder-context

lupuletic commented May 19, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lupuletic commented May 19, 2026

Summary

Changes

Subtle bug fixed

Validation

Migration impact for existing users

Size impact

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant