Skip to content

feat(embedding): authenticated, size-robust API embedder + vector persistence + warm-restart fix#109

Open
avfirsov wants to merge 4 commits into
zzet:mainfrom
avfirsov:pr/embedder-robustness
Open

feat(embedding): authenticated, size-robust API embedder + vector persistence + warm-restart fix#109
avfirsov wants to merge 4 commits into
zzet:mainfrom
avfirsov:pr/embedder-robustness

Conversation

@avfirsov

Copy link
Copy Markdown
Contributor

Makes the OpenAI-compatible api embedder usable against real hosted backends and large repos, fixes vector persistence under the bulk loader, and fixes a warm-restart re-index bug. Four focused commits; CI-green; tests included.

What & why

  1. feat(embedding) — authenticated + size-robust embedder

    • Send Authorization: Bearer from GORTEX_EMBEDDINGS_API_KEY (falls back to OPENAI_API_KEY only for *openai.com* URLs). The embedder was Ollama-oriented and keyless → OpenAI returned 401.
    • Head-truncate each embedding input to 8000 bytes. OpenAI rejects >8192-token inputs with a 400 that aborts the whole vector index; tokens ≤ bytes for ASCII source, so an 8000-byte head is provably safe.
    • GORTEX_EMBEDDINGS_MAX_SYMBOLS env override for the vector-index size cap — embedding.max_symbols config did not reach the indexer via the flag/env embedder path.
  2. feat(embedding,indexer,daemon) — vector persistence + warm-restart prefix fix

    • Vectors never persisted under the bulk loader (bulkVectorSink): during a bulk index idx.graph is the in-memory shadow (no VectorSearcher), so buildSearchIndex skipped BulkUpsertEmbeddings — the sqlite vectors table stayed empty and a restart had no vectors to restore. Fix: capture the disk store at the shadow swap and persist there.
    • Warm-restart prefix bug: single-repo daemons persist file_mtimes under prefix "" but priorMtimesFromStore looked them up under the path basename → 0 rows → every restart did a full cold re-index (and a paid re-embed). Fix: single-repo lookup under "".
  3. feat(daemon) — expose --embeddings-url / --embeddings-model on daemon start.

  4. fix(embedding) — probe API embedder dims at startup + tolerate a /v1 base URL.

Tests

internal/embedding/api_test.go (auth header, no-key case, truncation, dims probe / /v1), internal/indexer/vector_persist_test.go, cmd/gortex/daemon_state_test.go. go build ./... clean; embedding / indexer / serverstack / cmd/gortex test packages pass.

These were developed in a downstream fork and split out here as a self-contained, generic bundle. Happy to split further or adjust per your preference.

🤖 Generated with Claude Code

avfirsov and others added 4 commits June 18, 2026 10:32
Make the OpenAI-compatible api embedder usable for real hosted backends and
large repos.

- api.go: send Authorization: Bearer from GORTEX_EMBEDDINGS_API_KEY (falling
  back to OPENAI_API_KEY only for *openai.com* URLs). The embedder was
  Ollama-oriented and keyless → OpenAI returned 401.
- api.go: head-truncate each embedding input to 8000 bytes. OpenAI rejects
  >8192-token inputs with a 400 that aborts the WHOLE vector index; tokens ≤
  bytes for ASCII source, so an 8000-byte head is provably safe.
- indexer.go: GORTEX_EMBEDDINGS_MAX_SYMBOLS env override for the vector-index
  size cap — embedding.max_symbols config did not reach the indexer via the
  flag/env embedder path.

Tests in api_test.go cover the auth header, the no-key case, and truncation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit e31b66e)
…tence, warm-restart prefix fix

Four robustness fixes surfaced while indexing a large external repo (Apache
Drools, ~10.5k files) with an OpenAI embedder on a memory-constrained host,
plus a running operational playbook (AGENTS.md).

internal/embedding (api.go, api_test.go):
- Send `Authorization: Bearer` from GORTEX_EMBEDDINGS_API_KEY (fallback to
  OPENAI_API_KEY only for *.openai.com) — the api-embedder was Ollama-oriented
  and keyless, so OpenAI returned 401.
- Head-truncate each input to 8000 bytes: OpenAI rejects >8192-token inputs
  with a 400 that aborts the WHOLE vector index; an 8000-byte head guarantees
  <=8000 tokens (BPE never emits more tokens than chars; ASCII char==byte).
- Accumulate usage.total_tokens (atomic) and expose TokensUsed(); the indexer
  logs `embed_tokens` on "vector index built" so a paid pass reports its spend.

internal/indexer (indexer.go, vector_persist_test.go):
- GORTEX_EMBEDDINGS_MAX_SYMBOLS env override for the vector-index size cap that
  config plumbing didn't reach.
- Persist the vector index under the bulk loader. During a bulk index idx.graph
  is the in-memory shadow (no graph.VectorSearcher), so buildSearchIndex never
  called BulkUpsertEmbeddings — vectors lived only in the in-process HNSW and
  the sqlite `vectors` table stayed empty, lost on restart. Capture the disk
  store at the shadow swap (bulkVectorSink) and persist against it (the vectors
  table has no FK to nodes, so upsert before FlushBulk is safe).

cmd/gortex/daemon (daemon_state.go, daemon_state_test.go):
- Warm-restart prefix fix (warmMtimePrefix). Single-repo daemons index
  unprefixed (file_mtimes rows keyed by ""), but priorMtimesFromStore looked
  them up under the path basename, so 0 rows matched and every restart did a
  full cold re-index (+ a paid re-embed). Look up under "" in single-repo mode.

Together the last two enable a two-pass index on a RAM-tight box: embed-only
first (vectors persist to sqlite), then a warm restart restores graph+vectors
with no re-parse/re-embed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit 7a38cd9)
…tart`

Mirror `gortex mcp`'s embedding-API flags onto `gortex daemon start` so the long-lived daemon can use an explicit OpenAI-compatible (or Ollama) embedding endpoint instead of only the built-in GloVe/transformer providers. The flags thread into the existing serverstack.EmbedderRequest{FlagURL,FlagModel} -> ResolveEmbedder path (a non-empty URL forces the api provider; key via $GORTEX_EMBEDDINGS_API_KEY or $OPENAI_API_KEY). No new embedding code — the OpenAI APIProvider already existed; this just makes the daemon flag-drivable like mcp.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit ee712fa)
An APIProvider reported Dimensions()==0 until its first embed, so the
daemon logged dim:0 and the snapshot-vector reload gate
(daemon_state.go: vec.Dims == EmbedderDims) rejected a correctly-sized
cached index, re-embedding the whole graph on every restart.

Add APIProvider.ProbeDimensions(ctx): one tiny embed call that caches the
true width up front — idempotent, best-effort (a failure only warns and
the lazy path still fills it in), and doubles as an early key/URL
connectivity check. NewSharedServer probes any API-backed provider before
logging "embeddings enabled", so the width is truthful from the start.

Also fix a double-/v1 bug: NewAPIProvider("…/v1") + embedOpenAI appending
"/v1/embeddings" produced "…/v1/v1/embeddings" → 404 → silent fallback to
BM25. OpenAI-compatible bases are conventionally given with /v1 (OpenAI,
OpenRouter), so append it only when absent.

Tests: probe unit/error/URL-variant tests + a live OpenAI integration test
(skipped without a key) asserting a 1536-d width and token accounting.
Verified live: daemon now logs "embedding dimension probed dim:1536".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit d4b6a41)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant