Hybrid Retrieval

Hybrid Retrieval (v7.4 / LIP v2.0)

CKB v7.4 introduced hybrid retrieval that combines graph-based ranking with traditional text search to dramatically improve search quality. CKB v8.x extends this with LIP v2.0 semantic embeddings for deeper integration: novelty detection in PRs, semantic test discovery, file boundary analysis, and architecture coupling signals.

Overview

Traditional code search relies on text matching (FTS), which finds symbols by name but doesn't understand code relationships. Hybrid retrieval adds Personalized PageRank (PPR) over the symbol graph to boost results that are structurally related to your query.

Results

Metric	Before	After	Improvement
Recall@10	62.1%	100%	+61%
MRR	0.546	0.914	+67%
Latency	29.4ms	29.0ms	~0%

Note: Latency remains similar because PPR computation is cheap. The improvement is in search quality, not speed.

How It Works

1. Initial Search (FTS)

When you search for a symbol, CKB first uses SQLite FTS5 for fast text matching:

Query: "Engine"
FTS Results: Engine, Engine#logger, Engine#config, EngineMock, ...

2. Graph-Based Re-ranking (PPR)

CKB then builds a symbol graph from SCIP edges and runs Personalized PageRank:

Seeds: Top FTS hits (Engine, Engine#logger, ...)
Graph: Call edges, reference edges, type edges
PPR: Propagate importance through graph
Output: Re-ranked by graph proximity + FTS score

Seed Expansion

When FTS returns struct fields (e.g., Engine#logger), seed expansion automatically includes related methods:

FTS seeds: Engine#logger, Engine#config, Engine#db
Expanded: + Engine#SearchSymbols(), Engine#GetCallGraph(), ...

This helps PPR discover cross-module dependencies through method calls, not just field references.

3. Combined Scoring

The final ranking combines FTS position with PPR scores:

combined = 0.6 * position_score + 0.4 * ppr_score

Where:

position_score = 1 / (rank + 1) — Original FTS ranking bonus
ppr_score — Normalized PPR importance from graph traversal

This simple approach achieves 100% recall without the complexity of multi-signal fusion.

Eval Suite

CKB includes an evaluation framework to measure retrieval quality.

Running Eval

# Run built-in tests
ckb eval

# Custom fixtures
ckb eval --fixtures=./my-tests.json

# JSON output
ckb eval --format=json

Test Types

Needle tests - Find at least one expected symbol in top-K:

{
  "id": "find-engine",
  "type": "needle",
  "query": "Engine",
  "expectedSymbols": ["Engine", "query.Engine"],
  "topK": 10
}

Ranking tests - Verify expected symbol is highly ranked:

{
  "id": "engine-first",
  "type": "ranking",
  "query": "query engine",
  "expectedSymbols": ["Engine"],
  "topK": 3
}

Expansion tests - Check graph connectivity:

{
  "id": "engine-connects-backends",
  "type": "expansion",
  "query": "Engine",
  "expectedSymbols": ["Engine", "Orchestrator", "SCIPAdapter"],
  "topK": 20
}

Metrics

Recall@K - % of tests where expected symbol was in top-K
MRR - Mean Reciprocal Rank (higher = expected found earlier)
Latency - Average query time

PPR Algorithm

Personalized PageRank computes importance scores relative to seed nodes.

Algorithm

Input:
  - seeds: FTS hit symbol IDs
  - graph: SCIP call/reference edges
  - damping: 0.85 (probability of following edge)
  - iterations: 20 (max power iterations)

Process:
  1. Initialize scores: seeds get 1/n, others get 0
  2. Iterate: score[i] = damping * Σ(edge_weight * score[neighbor])
                        + (1-damping) * teleport[i]
  3. Stop when converged or max iterations

Output:
  - Ranked nodes with scores
  - Backtracked paths explaining "why"

Edge Weights

Edge Type	Weight	Meaning
Call	1.0	Function calls function
Definition	0.9	Reference to definition
Reference	0.8	General reference
Implements	0.7	Type implements interface
Type-of	0.6	Instance of type
Same-module	0.3	Co-located symbols

Export Organizer

The exportForLLM tool now includes an organizer step that structures output for better LLM comprehension.

Before (v7.3)

## internal/query/
  ! engine.go
    $ Engine
    # SearchSymbols()
    # GetSymbol()
  ! symbols.go
    # rankSearchResults()

After (v7.4)

## Module Map

| Module | Symbols | Files | Key Exports |
|--------|---------|-------|-------------|
| internal/query | 150 | 12 | Engine, SearchSymbols |
| internal/backends | 80 | 8 | Orchestrator, SCIPAdapter |

## Cross-Module Connections

- internal/query → internal/backends
- internal/mcp → internal/query

## Module Details

### internal/query/

**engine.go**
  $ Engine
  # SearchSymbols() [c=12] ★★
  # GetSymbol() [c=5] ★

Benefits

Module Map - Overview of codebase structure at a glance
Cross-Module Bridges - Key integration points highlighted
Importance Ordering - Most important symbols first
Context Efficiency - LLMs understand structure before details

Fast Tier: LIP Embedding Re-ranking

When no SCIP index is available (LSP-only setup), PPR can't run — there's no symbol graph. CKB uses LIP v2.0 semantic similarity at two levels.

Level 1 — Re-ranking (FTS results exist)

FTS5 symbol search → lexical ranking
                              ↓
    LIP GetEmbeddingsBatch(all candidate file URIs)   ← single round-trip
                              ↓
         centroid of top-5 seed embeddings (L2-normalised)
                              ↓
  score = 0.6 × lexical_position + 0.4 × dot_product(vec, centroid)
                              ↓
                     re-sorted results

The centroid of the top lexically-ranked files acts as a "query neighbourhood" — results from semantically similar files rise, exactly as PPR lifts graph-adjacent symbols.

Level 2 — Semantic fallback (FTS found nothing)

When no symbol name matches the query literally, CKB falls back to LIP's nearest-neighbour index:

LIP NearestByTextFiltered(query, top_k=20, filter="", min_score=0)
                              ↓                  ← HNSW index on all repo files
  for each matching file URI:
      resolve symbols from FTS content table
                              ↓
  deduplicated symbol results, scored by LIP similarity
                              ↓
  LIP re-ranking pass (same centroid algorithm)

This is the path for queries like "connection pool with retry" or "rate limiter token bucket" where no symbol is named that literally. Works without SCIP, without LSP — LIP alone is sufficient.

The filter parameter accepts glob patterns (e.g. "internal/api/**") to restrict results to specific directory subtrees.

Graceful degradation

All LIP calls degrade silently. If LIP is not running:

Re-ranking: lexical order is preserved (0.6 × position still dominates)
Semantic fallback: skipped, empty results returned
No errors, no configuration required

Activating LIP

LIP is a separate persistent daemon. When it's running and has indexed your repo, CKB picks it up automatically — no configuration required for the code graph. Semantic embeddings require an extra env var.

Install (requires Rust/cargo):

cargo install lip-cli          # installs lip-cli v2.0 from crates.io

Start (code graph only — no embeddings required):

lip daemon --socket ~/.local/share/lip/lip.sock
lip index .                    # index the current repo (tree-sitter, no HTTP)

Enable semantic embeddings (optional — required for nearest-by-text, novelty, boundaries):

# Set before starting the daemon. Use any OpenAI-compatible endpoint.
# Ollama example (nomic-embed-text handles files up to ~8k tokens):
export LIP_EMBEDDING_URL=http://localhost:11434/v1/embeddings
export LIP_EMBEDDING_MODEL=nomic-embed-text    # matches your model name
lip daemon --socket ~/.local/share/lip/lip.sock
lip index .

Large files: models with small context windows (8k tokens) will silently skip files over that limit. You'll see N pending in ckb doctor. Use a model with 32k+ context or a truncating proxy to reach 100% coverage.

Verify:

ckb doctor
# ✓ lip: LIP daemon running — 486 files indexed; 99% embedded (483/486 files)
# ✗ lip: LIP daemon not running → cargo install lip-cli

LIP v2.0 is published as two crates. See the LIP website and documentation for full details.

Crate	Purpose	Install
`lip-core`	Library (Rust API)	add to `Cargo.toml`
`lip-cli`	CLI + daemon	`cargo install lip-cli`

LIP v2.0 Features Integrated in CKB

Beyond basic embedding lookup, CKB v8.x uses the following LIP v2.0 functions:

LIP Function	CKB Integration
`NearestByTextFiltered`	`searchSymbols` — semantic fallback + re-ranking with optional file-glob filter
`NearestByFileFiltered`	`getAffectedTests` — finds semantically proximate `*_test.go` files
`FindBoundaries`	`explainFile` — appends `semantic_boundaries` (per-region shift magnitude)
`NoveltyScore`	`reviewPR` — `semantic-novelty` check flags files with score ≥ 0.7
`SimilarityMatrix`	`getArchitecture` — emits `semantic_coupling` matrix across modules
`GetCentroid`	`getArchitecture` — computes repo-level embedding centroid
`Coverage`	`doctor` — reports `%` of repo files embedded
`StaleEmbeddings`	`doctor` — counts files with outdated embeddings
`ExplainMatch`	Available directly via `lip.ExplainMatch` for custom tooling
`BatchNearestByText`	Available for batch semantic search
`Cluster`	Available for grouping semantically similar files
`SemanticDiff`	Available for comparing text blobs by embedding distance
`NearestByContrast`	Available for contrast-based retrieval (like X but not Y)
`Outliers`	Available for finding semantically isolated files
`ExtractTerminology`	Available for domain term extraction

All calls degrade silently — CKB never errors if LIP is unavailable.

Configuration

No configuration required. Hybrid retrieval is automatic:

Standard tier (SCIP available): PPR over symbol graph, activates when result count > 3
Fast tier (LSP-only): LIP embedding re-ranking, activates when result count > 3 and LIP is running

Disabling PPR

If you need to disable PPR re-ranking (not recommended):

// .ckb/config.json
{
  "queryPolicy": {
    "enablePPR": false
  }
}

Research Basis

Hybrid retrieval is based on 2024-2025 research:

Paper	Key Insight
HippoRAG 2 (ICML 2025)	PPR over knowledge graphs improves associative retrieval
CodeRAG (Sep 2025)	Multi-path retrieval + reranking beats single-path
GraphCoder (Jun 2024)	Code context graphs for repo-level retrieval
GraphRAG surveys	Explicit organizer step improves context packing

What's NOT Included

Per CKB's "structured over semantic" principle:

Feature	Status	Rationale
Embeddings (Standard tier)	Not used	PPR over SCIP graph is sufficient and deterministic
Embeddings (Fast tier)	Used via LIP	Graph unavailable; LIP similarity fills the gap
Learned reranker	Not used	Deterministic scoring is auditable and reproducible
External vector DB	Not used	Violates single-binary principle; LIP is a local daemon

Troubleshooting

Low Recall@K

Index freshness - Run ckb index to rebuild
FTS population - Check ckb status for FTS symbol count
Query specificity - More specific queries work better

Slow Queries

Graph size - Very large codebases may need graph pruning
PPR iterations - Default 20 is usually sufficient
Cache - Subsequent queries benefit from caching

Debugging

# Check index status
ckb status

# Run diagnostics
ckb doctor

# Verbose eval output
ckb eval --verbose

Related Pages

Architecture — System design overview
Performance — Query latency and caching
API-Reference — Search and query tools
Practical-Limits — Accuracy and limitations

Uh oh!

Hybrid Retrieval

Hybrid Retrieval (v7.4 / LIP v2.0)

Overview

Results

How It Works

1. Initial Search (FTS)

2. Graph-Based Re-ranking (PPR)

Seed Expansion

3. Combined Scoring

Eval Suite

Running Eval

Test Types

Metrics

PPR Algorithm

Algorithm

Edge Weights

Export Organizer

Before (v7.3)

After (v7.4)

Benefits

Fast Tier: LIP Embedding Re-ranking

Level 1 — Re-ranking (FTS results exist)

Level 2 — Semantic fallback (FTS found nothing)

Graceful degradation

Activating LIP

LIP v2.0 Features Integrated in CKB

Configuration

Disabling PPR

Research Basis

What's NOT Included

Troubleshooting

Low Recall@K

Slow Queries

Debugging

Related Pages

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally