Skip to content

Hybrid Retrieval

Lisa edited this page Apr 13, 2026 · 5 revisions

Hybrid Retrieval (v7.4 / LIP v2.0)

CKB v7.4 introduced hybrid retrieval that combines graph-based ranking with traditional text search to dramatically improve search quality. CKB v8.x extends this with LIP v2.0 semantic embeddings for deeper integration: novelty detection in PRs, semantic test discovery, file boundary analysis, and architecture coupling signals.

Overview

Traditional code search relies on text matching (FTS), which finds symbols by name but doesn't understand code relationships. Hybrid retrieval adds Personalized PageRank (PPR) over the symbol graph to boost results that are structurally related to your query.

Results

Metric Before After Improvement
Recall@10 62.1% 100% +61%
MRR 0.546 0.914 +67%
Latency 29.4ms 29.0ms ~0%

Note: Latency remains similar because PPR computation is cheap. The improvement is in search quality, not speed.

How It Works

1. Initial Search (FTS)

When you search for a symbol, CKB first uses SQLite FTS5 for fast text matching:

Query: "Engine"
FTS Results: Engine, Engine#logger, Engine#config, EngineMock, ...

2. Graph-Based Re-ranking (PPR)

CKB then builds a symbol graph from SCIP edges and runs Personalized PageRank:

Seeds: Top FTS hits (Engine, Engine#logger, ...)
Graph: Call edges, reference edges, type edges
PPR: Propagate importance through graph
Output: Re-ranked by graph proximity + FTS score

Seed Expansion

When FTS returns struct fields (e.g., Engine#logger), seed expansion automatically includes related methods:

FTS seeds: Engine#logger, Engine#config, Engine#db
Expanded: + Engine#SearchSymbols(), Engine#GetCallGraph(), ...

This helps PPR discover cross-module dependencies through method calls, not just field references.

3. Combined Scoring

The final ranking combines FTS position with PPR scores:

combined = 0.6 * position_score + 0.4 * ppr_score

Where:

  • position_score = 1 / (rank + 1) — Original FTS ranking bonus
  • ppr_score — Normalized PPR importance from graph traversal

This simple approach achieves 100% recall without the complexity of multi-signal fusion.

Eval Suite

CKB includes an evaluation framework to measure retrieval quality.

Running Eval

# Run built-in tests
ckb eval

# Custom fixtures
ckb eval --fixtures=./my-tests.json

# JSON output
ckb eval --format=json

Test Types

Needle tests - Find at least one expected symbol in top-K:

{
  "id": "find-engine",
  "type": "needle",
  "query": "Engine",
  "expectedSymbols": ["Engine", "query.Engine"],
  "topK": 10
}

Ranking tests - Verify expected symbol is highly ranked:

{
  "id": "engine-first",
  "type": "ranking",
  "query": "query engine",
  "expectedSymbols": ["Engine"],
  "topK": 3
}

Expansion tests - Check graph connectivity:

{
  "id": "engine-connects-backends",
  "type": "expansion",
  "query": "Engine",
  "expectedSymbols": ["Engine", "Orchestrator", "SCIPAdapter"],
  "topK": 20
}

Metrics

  • Recall@K - % of tests where expected symbol was in top-K
  • MRR - Mean Reciprocal Rank (higher = expected found earlier)
  • Latency - Average query time

PPR Algorithm

Personalized PageRank computes importance scores relative to seed nodes.

Algorithm

Input:
  - seeds: FTS hit symbol IDs
  - graph: SCIP call/reference edges
  - damping: 0.85 (probability of following edge)
  - iterations: 20 (max power iterations)

Process:
  1. Initialize scores: seeds get 1/n, others get 0
  2. Iterate: score[i] = damping * Σ(edge_weight * score[neighbor])
                        + (1-damping) * teleport[i]
  3. Stop when converged or max iterations

Output:
  - Ranked nodes with scores
  - Backtracked paths explaining "why"

Edge Weights

Edge Type Weight Meaning
Call 1.0 Function calls function
Definition 0.9 Reference to definition
Reference 0.8 General reference
Implements 0.7 Type implements interface
Type-of 0.6 Instance of type
Same-module 0.3 Co-located symbols

Export Organizer

The exportForLLM tool now includes an organizer step that structures output for better LLM comprehension.

Before (v7.3)

## internal/query/
  ! engine.go
    $ Engine
    # SearchSymbols()
    # GetSymbol()
  ! symbols.go
    # rankSearchResults()

After (v7.4)

## Module Map

| Module | Symbols | Files | Key Exports |
|--------|---------|-------|-------------|
| internal/query | 150 | 12 | Engine, SearchSymbols |
| internal/backends | 80 | 8 | Orchestrator, SCIPAdapter |

## Cross-Module Connections

- internal/query → internal/backends
- internal/mcp → internal/query

## Module Details

### internal/query/

**engine.go**
  $ Engine
  # SearchSymbols() [c=12] ★★
  # GetSymbol() [c=5]

Benefits

  • Module Map - Overview of codebase structure at a glance
  • Cross-Module Bridges - Key integration points highlighted
  • Importance Ordering - Most important symbols first
  • Context Efficiency - LLMs understand structure before details

Fast Tier: LIP Embedding Re-ranking

When no SCIP index is available (LSP-only setup), PPR can't run — there's no symbol graph. CKB uses LIP v2.0 semantic similarity at two levels.

Level 1 — Re-ranking (FTS results exist)

FTS5 symbol search → lexical ranking
                              ↓
    LIP GetEmbeddingsBatch(all candidate file URIs)   ← single round-trip
                              ↓
         centroid of top-5 seed embeddings (L2-normalised)
                              ↓
  score = 0.6 × lexical_position + 0.4 × dot_product(vec, centroid)
                              ↓
                     re-sorted results

The centroid of the top lexically-ranked files acts as a "query neighbourhood" — results from semantically similar files rise, exactly as PPR lifts graph-adjacent symbols.

Level 2 — Semantic fallback (FTS found nothing)

When no symbol name matches the query literally, CKB falls back to LIP's nearest-neighbour index:

LIP NearestByTextFiltered(query, top_k=20, filter="", min_score=0)
                              ↓                  ← HNSW index on all repo files
  for each matching file URI:
      resolve symbols from FTS content table
                              ↓
  deduplicated symbol results, scored by LIP similarity
                              ↓
  LIP re-ranking pass (same centroid algorithm)

This is the path for queries like "connection pool with retry" or "rate limiter token bucket" where no symbol is named that literally. Works without SCIP, without LSP — LIP alone is sufficient.

The filter parameter accepts glob patterns (e.g. "internal/api/**") to restrict results to specific directory subtrees.

Graceful degradation

All LIP calls degrade silently. If LIP is not running:

  • Re-ranking: lexical order is preserved (0.6 × position still dominates)
  • Semantic fallback: skipped, empty results returned
  • No errors, no configuration required

Activating LIP

LIP is a separate persistent daemon. When it's running and has indexed your repo, CKB picks it up automatically — no configuration required for the code graph. Semantic embeddings require an extra env var.

Install (requires Rust/cargo):

cargo install lip-cli          # installs lip-cli v2.0 from crates.io

Start (code graph only — no embeddings required):

lip daemon --socket ~/.local/share/lip/lip.sock
lip index .                    # index the current repo (tree-sitter, no HTTP)

Enable semantic embeddings (optional — required for nearest-by-text, novelty, boundaries):

# Set before starting the daemon. Use any OpenAI-compatible endpoint.
# Ollama example (nomic-embed-text handles files up to ~8k tokens):
export LIP_EMBEDDING_URL=http://localhost:11434/v1/embeddings
export LIP_EMBEDDING_MODEL=nomic-embed-text    # matches your model name
lip daemon --socket ~/.local/share/lip/lip.sock
lip index .

Large files: models with small context windows (8k tokens) will silently skip files over that limit. You'll see N pending in ckb doctor. Use a model with 32k+ context or a truncating proxy to reach 100% coverage.

Verify:

ckb doctor
# ✓ lip: LIP daemon running — 486 files indexed; 99% embedded (483/486 files)
# ✗ lip: LIP daemon not running → cargo install lip-cli

LIP v2.0 is published as two crates. See the LIP website and documentation for full details.

Crate Purpose Install
lip-core Library (Rust API) add to Cargo.toml
lip-cli CLI + daemon cargo install lip-cli

LIP v2.0 Features Integrated in CKB

Beyond basic embedding lookup, CKB v8.x uses the following LIP v2.0 functions:

LIP Function CKB Integration
NearestByTextFiltered searchSymbols — semantic fallback + re-ranking with optional file-glob filter
NearestByFileFiltered getAffectedTests — finds semantically proximate *_test.go files
FindBoundaries explainFile — appends semantic_boundaries (per-region shift magnitude)
NoveltyScore reviewPRsemantic-novelty check flags files with score ≥ 0.7
SimilarityMatrix getArchitecture — emits semantic_coupling matrix across modules
GetCentroid getArchitecture — computes repo-level embedding centroid
Coverage doctor — reports % of repo files embedded
StaleEmbeddings doctor — counts files with outdated embeddings
ExplainMatch Available directly via lip.ExplainMatch for custom tooling
BatchNearestByText Available for batch semantic search
Cluster Available for grouping semantically similar files
SemanticDiff Available for comparing text blobs by embedding distance
NearestByContrast Available for contrast-based retrieval (like X but not Y)
Outliers Available for finding semantically isolated files
ExtractTerminology Available for domain term extraction

All calls degrade silently — CKB never errors if LIP is unavailable.


Configuration

No configuration required. Hybrid retrieval is automatic:

  • Standard tier (SCIP available): PPR over symbol graph, activates when result count > 3
  • Fast tier (LSP-only): LIP embedding re-ranking, activates when result count > 3 and LIP is running

Disabling PPR

If you need to disable PPR re-ranking (not recommended):

// .ckb/config.json
{
  "queryPolicy": {
    "enablePPR": false
  }
}

Research Basis

Hybrid retrieval is based on 2024-2025 research:

Paper Key Insight
HippoRAG 2 (ICML 2025) PPR over knowledge graphs improves associative retrieval
CodeRAG (Sep 2025) Multi-path retrieval + reranking beats single-path
GraphCoder (Jun 2024) Code context graphs for repo-level retrieval
GraphRAG surveys Explicit organizer step improves context packing

What's NOT Included

Per CKB's "structured over semantic" principle:

Feature Status Rationale
Embeddings (Standard tier) Not used PPR over SCIP graph is sufficient and deterministic
Embeddings (Fast tier) Used via LIP Graph unavailable; LIP similarity fills the gap
Learned reranker Not used Deterministic scoring is auditable and reproducible
External vector DB Not used Violates single-binary principle; LIP is a local daemon

Troubleshooting

Low Recall@K

  1. Index freshness - Run ckb index to rebuild
  2. FTS population - Check ckb status for FTS symbol count
  3. Query specificity - More specific queries work better

Slow Queries

  1. Graph size - Very large codebases may need graph pruning
  2. PPR iterations - Default 20 is usually sufficient
  3. Cache - Subsequent queries benefit from caching

Debugging

# Check index status
ckb status

# Run diagnostics
ckb doctor

# Verbose eval output
ckb eval --verbose

Related Pages

Clone this wiki locally