Skip to content

feat: embedding-based entry point resolution#130

Open
VictorGjn wants to merge 3 commits intomasterfrom
feat/embedding-resolver
Open

feat: embedding-based entry point resolution#130
VictorGjn wants to merge 3 commits intomasterfrom
feat/embedding-resolver

Conversation

@VictorGjn
Copy link
Copy Markdown
Owner

Problem

The context graph resolver is purely lexical. When the user queries "how does authentication work?" but no file contains "auth" in its path, symbol names, or headings, the resolver returns 0 entry points. The graph traversal never starts.

This is the #1 limitation documented in the architecture analysis.

Solution

Add an embedding-based resolution layer that bridges the vocabulary gap.

New file: embeddingResolver.ts

  • buildIdentity() — compact semantic fingerprint per FileNode (~100 tokens): path + exports + headings + first sentence
  • buildEmbeddingCache() — batch embed identities via OpenAI text-embedding-3-small (512 dims). Only re-embeds when content hash changes.
  • resolveHybridEntryPoints() — merge lexical + semantic scores. Drop-in replacement for resolveEntryPoints()
  • serializeCache() / deserializeCache() — persist cache between sessions

Updated: index.ts

  • New exports for embedding resolver
  • ContextGraphEngine gains:
    • queryHybrid(): semantic+lexical entry points → graph traversal → packed context
    • buildEmbeddings(): build/refresh embedding cache
    • loadEmbeddingCache() / saveEmbeddingCache(): persistence
    • Falls back to lexical-only when no embedding cache available

Updated: types.ts

  • HybridEntryPoint extends EntryPoint with lexicalScore + semanticScore
  • EmbeddingCacheData for serialization

Hybrid scoring

combined = lexical * 0.4 + semantic * 0.6

The 0.6 semantic weight ensures vocabulary-gap queries get resolved, while lexical still contributes for exact matches.

Usage

const engine = new ContextGraphEngine();
engine.scan(rootPath, files);

// Build embeddings once (persists, only re-embeds changed files)
await engine.buildEmbeddings(apiKey);

// Hybrid query
const packed = await engine.queryHybrid("how does auth work?", apiKey, 8000);

Cost

  • ~$0.01 per 500 files indexed
  • ~$0.0001 per query
  • Cache is content-hash-aware: incremental updates only re-embed changed files

Related

Also pushed to agent-skills repo: Python equivalent (embed_resolve.py) + updated pack_context.py with --semantic flag.

Adds semantic resolution to bridge the vocabulary gap in the
lexical-only resolver. When query terms don't appear literally
in file paths/symbols/headings, embeddings find related files
via cosine similarity.

- buildIdentity(): compact semantic fingerprint per file
- embedTexts(): batch OpenAI text-embedding-3-small (512 dims)
- resolveHybridEntryPoints(): merge lexical + semantic scores
- Serializable cache, only re-embeds on content hash change
- Export embedding resolver functions from public API
- Add embeddingCache to ContextGraphEngine
- Add queryHybrid() method: semantic+lexical → graph traversal → packed
- Add buildEmbeddings(), loadEmbeddingCache(), saveEmbeddingCache()
- Falls back to lexical-only when no cache available
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 47b65a2f63

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +293 to +296
for (const [fileId, entry] of cache.entries) {
if (entry.embedding?.length > 0) {
semanticScores.set(fileId, cosineSimilarity(queryEmbedding, entry.embedding));
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Ignore cache entries not present in current graph

resolveHybridEntryPoints scores every cached embedding without checking whether that fileId still exists in graph.nodes. If a stale or wrong cache is loaded (for example after switching repos or loading an old cache file), unrelated IDs can occupy the top-K results, and traverseGraph later drops them as missing nodes, which can leave queryHybrid() with little or no usable context even when lexical matches exist. Filter semantic scoring/merging to IDs that are present in the current graph before ranking.

Useful? React with 👍 / 👎.

Comment on lines +84 to +85
if (root.firstSentence) {
parts.push(`Purpose: ${root.firstSentence}`);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Read purpose sentence from tree metadata

buildIdentity reads root.firstSentence, but TreeNode stores that value under root.meta.firstSentence. As written, this branch never adds the file purpose text, so identity strings lose a key semantic signal and hybrid retrieval quality drops for files that rely on prose context. This should reference root.meta?.firstSentence.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant