Skip to content

Upgrade embedder + reranker to 8K context, raise chunk size#5

Merged
lupuletic merged 3 commits into
mainfrom
upgrade/embedder-context
May 20, 2026
Merged

Upgrade embedder + reranker to 8K context, raise chunk size#5
lupuletic merged 3 commits into
mainfrom
upgrade/embedder-context

Conversation

@lupuletic
Copy link
Copy Markdown
Owner

Summary

The bi-encoder and cross-encoder were both pre-2024 and the chunker was tuned to the old 512-token ceiling. Raising the context window everywhere — including chunk size — is a bigger quality lever than swapping for a +0.3 MTEB peer.

Changes

  • Embedder: `BAAI/bge-small-en-v1.5` (Sep 2023, 384d, 512 tok) → `nomic-ai/nomic-embed-text-v1.5-Q` (130MB, 768d, 8K tokens).
  • Reranker: `Xenova/ms-marco-MiniLM-L-6-v2` (2021) → `jinaai/jina-reranker-v1-tiny-en` (130MB, 8K tokens). Chose tiny over `bge-reranker-base` because base is 1GB — too heavy for a local-first tool.
  • Chunks: `MAX_CHUNK_CHARS` 2000 → 8000. Previously a 5-turn conversation truncated mid-stream to fit bge-small's 512-token limit; now an entire window fits in one chunk.
  • Asymmetric prefixes: `embed()` now uses `passage_embed()`, `embed_single()` uses `query_embed()`. The new model differentiates passages from queries (nomic prefixes), so this matters for retrieval quality.
  • Dim binding: `chunks_vec` dim is now sourced from `Embedder.DIM` rather than a hardcoded `float[384]`. Existing local indexes with the wrong dim get their `chunks_vec` dropped and rebuilt automatically — no manual reindex required.

Subtle bug fixed

nomic returns `float64` while sqlite-vec stores `float[N]` as `float32`. Without a cast, sqlite-vec reads the 6144-byte buffer as 1536 floats and throws "Dimension mismatch". Cast to `float32` at the Embedder boundary.

Validation

  • 331 tests pass (the existing suite already covers the indexer, vec writes, and search pipeline end-to-end).
  • Smoke eval on 7 synthetic vague-memory queries: 6/7, including the two "you remember the shape not the title" examples the README highlights (`"the stripe thing"` → correct session, `"that slow query I fixed"` → correct session).

Migration impact for existing users

  • On next `code-recall` run, the dim-mismatch check drops `chunks_vec` and the indexer re-embeds all existing chunks with the new model. One-time cost (~minutes depending on session count).
  • Existing chunks stay at the old 2000-char size — only newly-indexed sessions benefit from the 8K chunk size. Over time the index migrates organically. Users who want immediate uplift can `code-recall index --force` to rechunk + re-embed everything.

Size impact

  • Old: bge-small ~67MB + ms-marco-MiniLM ~80MB ≈ 150MB
  • New: nomic-Q ~130MB + jina-tiny ~130MB ≈ 260MB

Both still under `pip install 'code-recall[all]'`. No new dependencies.

Test plan

  • `uv run pytest -q` → 331 passing
  • On a populated index: `code-recall index` triggers dim-mismatch migration cleanly
  • Spot-check a few real "vague memory" queries against personal history

Two stale models with a context-window mismatch were the real ceiling on
recall, not the +0.3 MTEB delta between same-class peers:

- Bi-encoder: BAAI/bge-small-en-v1.5 (Sep 2023, 384d, 512 tok limit)
  -> nomic-ai/nomic-embed-text-v1.5-Q (130MB, 768d, 8K tokens)
- Cross-encoder: Xenova/ms-marco-MiniLM-L-6-v2 (2021, ~512 tok)
  -> jinaai/jina-reranker-v1-tiny-en (130MB, 8K tokens)
- MAX_CHUNK_CHARS: 2000 -> 8000 (chunks were truncating mid-conversation
  at the old 512-token ceiling; an entire 5-turn window now fits).

The new embedder uses asymmetric prefixes, so embed()/embed_single() now
go through passage_embed()/query_embed() rather than the bare embed().

Dim is sourced from Embedder.DIM and chunks_vec auto-migrates: if an
existing table has the wrong dim, it's dropped and the indexer
re-embeds all chunks on the next run.

Subtle bug fixed along the way: nomic returns float64 while sqlite-vec
expects float32, so raw embeddings tripped "Dimension mismatch
(received 1536)" — cast to float32 at the Embedder boundary.

331 tests pass. Smoke eval on 7 synthetic vague-memory queries: 6/7,
including the two queries the README highlights.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 058622e2da

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/code_recall/db.py
lupuletic added 2 commits May 19, 2026 23:48
Two follow-on UX fixes surfaced when testing on a real 1695-session
index:

1. TUI crashed with MarkupError on session data containing unbalanced
   square-bracket sequences (e.g. a prompt with literal "[/dim]" in it).
   Even with escape() applied at every visible interpolation site, a
   missed call anywhere in the render path takes down the whole TUI.
   Added _safe_markup() — validates Rich markup with Text.from_markup,
   falls back to fully-escaped on MarkupError so the worst case is
   "tags render as literal text" rather than "TUI dies". Wrapped the
   three composite render sites that mix user data with markup.

2. _generate_embeddings was one giant batched embedder.embed(texts)
   call followed by a per-50-chunks counter inside the insert loop.
   The actual embedding compute (where time is spent) was silent.
   Now embeds in batches of 64 with a tqdm progress bar showing
   count / rate / ETA, so users see steady feedback instead of a
   frozen "Generating embeddings for N chunks..." line. Falls back to
   the old line-print when stderr isn't a tty.

331 tests still pass.
@lupuletic lupuletic merged commit 1635e6d into main May 20, 2026
10 checks passed
@lupuletic lupuletic deleted the upgrade/embedder-context branch May 20, 2026 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant