feat: embedding-based entry point resolution#130
Conversation
Adds semantic resolution to bridge the vocabulary gap in the lexical-only resolver. When query terms don't appear literally in file paths/symbols/headings, embeddings find related files via cosine similarity. - buildIdentity(): compact semantic fingerprint per file - embedTexts(): batch OpenAI text-embedding-3-small (512 dims) - resolveHybridEntryPoints(): merge lexical + semantic scores - Serializable cache, only re-embeds on content hash change
- Export embedding resolver functions from public API - Add embeddingCache to ContextGraphEngine - Add queryHybrid() method: semantic+lexical → graph traversal → packed - Add buildEmbeddings(), loadEmbeddingCache(), saveEmbeddingCache() - Falls back to lexical-only when no cache available
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 47b65a2f63
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for (const [fileId, entry] of cache.entries) { | ||
| if (entry.embedding?.length > 0) { | ||
| semanticScores.set(fileId, cosineSimilarity(queryEmbedding, entry.embedding)); | ||
| } |
There was a problem hiding this comment.
Ignore cache entries not present in current graph
resolveHybridEntryPoints scores every cached embedding without checking whether that fileId still exists in graph.nodes. If a stale or wrong cache is loaded (for example after switching repos or loading an old cache file), unrelated IDs can occupy the top-K results, and traverseGraph later drops them as missing nodes, which can leave queryHybrid() with little or no usable context even when lexical matches exist. Filter semantic scoring/merging to IDs that are present in the current graph before ranking.
Useful? React with 👍 / 👎.
| if (root.firstSentence) { | ||
| parts.push(`Purpose: ${root.firstSentence}`); |
There was a problem hiding this comment.
Read purpose sentence from tree metadata
buildIdentity reads root.firstSentence, but TreeNode stores that value under root.meta.firstSentence. As written, this branch never adds the file purpose text, so identity strings lose a key semantic signal and hybrid retrieval quality drops for files that rely on prose context. This should reference root.meta?.firstSentence.
Useful? React with 👍 / 👎.
Problem
The context graph resolver is purely lexical. When the user queries "how does authentication work?" but no file contains "auth" in its path, symbol names, or headings, the resolver returns 0 entry points. The graph traversal never starts.
This is the #1 limitation documented in the architecture analysis.
Solution
Add an embedding-based resolution layer that bridges the vocabulary gap.
New file:
embeddingResolver.tsbuildIdentity()— compact semantic fingerprint per FileNode (~100 tokens): path + exports + headings + first sentencebuildEmbeddingCache()— batch embed identities via OpenAItext-embedding-3-small(512 dims). Only re-embeds when content hash changes.resolveHybridEntryPoints()— merge lexical + semantic scores. Drop-in replacement forresolveEntryPoints()serializeCache()/deserializeCache()— persist cache between sessionsUpdated:
index.tsContextGraphEnginegains:queryHybrid(): semantic+lexical entry points → graph traversal → packed contextbuildEmbeddings(): build/refresh embedding cacheloadEmbeddingCache()/saveEmbeddingCache(): persistenceUpdated:
types.tsHybridEntryPointextendsEntryPointwithlexicalScore+semanticScoreEmbeddingCacheDatafor serializationHybrid scoring
The 0.6 semantic weight ensures vocabulary-gap queries get resolved, while lexical still contributes for exact matches.
Usage
Cost
Related
Also pushed to
agent-skillsrepo: Python equivalent (embed_resolve.py) + updatedpack_context.pywith--semanticflag.