Semantic Search Framework - Implementation

From Problem to Production: A Complete Build Journey

Author: Jordan Minor
Timeline: October - November 2025
Status: Production deployment at MRMINOR LLC
Impact: 70-90% token efficiency improvement

Problem Discovery
Research & Design
Implementation Phase
Debugging & Optimization
Production Deployment
Results & Validation
Lessons Learned

Problem Discovery

The Symptom (Session 45 - October 2025)

Observation: AI assistant sessions hitting 190k token limit, causing incomplete work

Specific Incident:

Session Start: 0k tokens
↓
Load conversation_search: 51k tokens (27% consumed!)
↓
Read 3 documents for context: 38k tokens
↓
Total before any work: 89k tokens (47%)

Impact:

15% of sessions incomplete due to token exhaustion
Average 45k tokens wasted on context loading
Manual file browsing: 5-10 minutes per search
Blind loading: "Load entire doc to check if relevant"
User frustration: "Why are we running out of tokens so fast?"

Root Cause Analysis

Investigation Approach:

Tracked token consumption across 40 sessions
Categorized spending: context vs work vs updates
Identified bottlenecks through systematic measurement

Findings:

Category	Average Tokens	% of Budget	Issue
Context Loading	45k	24%	TOO HIGH
Actual Work	95k	50%	Acceptable
Session Updates	35k	18%	Acceptable
Buffer	15k	8%	Acceptable

Key Insights:

Context loading was the bottleneck (45k tokens average)
Most loaded content was never used (estimated 60% waste)
No way to know what's in a file without loading it
conversation_search especially wasteful (40-50k for previous session)

The Core Problem:

"We need information but don't know where it is, so we load everything and hope"

Business Impact

Operational Cost:

Wasted capacity: ~30k tokens per session
30 sessions per month = 900k tokens wasted
Equivalent to ~20-30 additional work hours per month
Session failure rate: 15% (vs target <5%)

Strategic Impact:

Slowed business operations (incomplete sessions)
Manual workarounds required (split tasks across sessions)
Couldn't scale documentation (more docs = worse problem)

Success Criteria Defined:

Reduce context loading to <15k tokens (67% reduction)
Enable targeted information retrieval (seconds, not minutes)
Zero API costs (maintain current economics)
Scale to 10x document growth (future-proof)

Research & Design

Solution Requirements

Must Have:

Semantic search (meaning-based, not just keywords)
Local deployment (no API costs, privacy preserved)
Sub-second search performance
Zero tokens for search operations
Works with existing markdown documents

Nice to Have:

Incremental updates (don't reindex everything)
Easy maintenance (minimal operational overhead)
Scalable (handle 10x growth)

Technology Evaluation

Vector Database Options:

Database	Deployment	Cost	Latency	Verdict
ChromaDB	Local	$0	100-200ms	✅ Selected
Pinecone	Cloud	$70+/mo	100-500ms	❌ Too expensive
Weaviate	Self-host	$0-25/mo	100-300ms	❌ Complex setup
Qdrant	Self-host	$0	100-200ms	❌ Less mature

Decision: ChromaDB

Zero cost (critical for Phase 0 budget)
Simple setup (pip install)
Fully local (privacy + offline capability)
Good enough performance (meets <500ms target)

Embedding Model Options:

Model	Speed	Quality	Size	Cost	Verdict
all-MiniLM-L6-v2	200ms	85%	80MB	$0	✅ Selected
all-mpnet-base-v2	450ms	90%	420MB	$0	❌ Too slow
OpenAI text-embedding	300ms	92%	API	$50+/mo	❌ API cost

Decision: all-MiniLM-L6-v2

Meets speed requirement (<300ms target)
Quality sufficient for business docs (85%)
Small footprint (80MB)
Zero ongoing cost

Architecture Design

Key Design Decision: 4-Layer Separation

Rationale:

Layer 1 (Raw Markdown): Keep source unchanged (single source of truth)
Layer 2 (Chunks): Process into searchable units (preserve context)
Layer 3 (Embeddings): Convert to vectors (enable semantic search)
Layer 4 (Interface): MCP server (AI assistant integration)

Benefit: Each layer independently testable and optimizable

Chunking Strategy Design:

Options Considered:

Fixed-length (1000 chars) - ❌ Splits mid-sentence
Paragraph-based - ❌ Markdown doesn't enforce paragraphs
Section-based (CHOSEN) - ✅ Respects document structure

Decision: Section-based with 20% overlap

Parse by markdown headers (# ## ###)
1,000 character chunks max
200 character overlap between chunks
Preserves logical document units

Implementation Phase

Week 1: Foundation (October 2025)

Day 1-2: Basic MCP Server

# Initial server.py structure
from mcp.server import Server
from sentence_transformers import SentenceTransformer
import chromadb

server = Server("markdown-search")
model = SentenceTransformer('all-MiniLM-L6-v2')
client = chromadb.PersistentClient(path="./chroma_db")

@server.call_tool()
async def search_markdown(query: str, num_results: int = 5):
    # Embed query
    query_embedding = model.encode(query)
    
    # Search ChromaDB
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=num_results
    )
    
    return format_results(results)

Milestone: Basic search working locally

Day 3-4: Markdown Parsing & Chunking

def chunk_document(file_path: str):
    """Parse markdown by headers and chunk content"""
    with open(file_path, 'r', encoding='utf-8') as f:
        content = f.read()
    
    # Parse headers (# ## ###)
    sections = parse_by_headers(content)
    
    chunks = []
    for section in sections:
        # If section > CHUNK_SIZE, split with overlap
        if len(section['content']) > CHUNK_SIZE:
            section_chunks = split_with_overlap(
                section['content'],
                CHUNK_SIZE,
                CHUNK_OVERLAP
            )
        else:
            section_chunks = [section['content']]
        
        # Add metadata
        for chunk in section_chunks:
            chunks.append({
                'content': chunk,
                'metadata': {
                    'file_path': file_path,
                    'section_title': section['title'],
                    'header_path': section['breadcrumb']
                }
            })
    
    return chunks

Challenge: Header parsing edge cases (code blocks, lists) Solution: Regex patterns + state machine for code block detection

Day 5-7: Full Indexing Pipeline

def index_all_documents(docs_path: str):
    """Recursively index all .md files"""
    all_chunks = []
    
    # Find all markdown files
    md_files = glob.glob(f"{docs_path}/**/*.md", recursive=True)
    
    for file_path in md_files:
        chunks = chunk_document(file_path)
        all_chunks.extend(chunks)
    
    # Generate embeddings (batch)
    embeddings = model.encode([c['content'] for c in all_chunks])
    
    # Store in ChromaDB
    collection.add(
        embeddings=embeddings,
        documents=[c['content'] for c in all_chunks],
        metadatas=[c['metadata'] for c in all_chunks],
        ids=[generate_id(c) for c in all_chunks]
    )
    
    return len(all_chunks)

First Index: 124 files → 7,396 chunks in ~26 minutes

Week 2: Testing & Integration (October 2025)

Testing Strategy:

Unit Tests: Chunking logic, metadata extraction
Integration Tests: Full index → search → results
Performance Tests: Latency, memory usage
Quality Tests: Search relevance (manual validation)

Search Quality Validation:

Query	Expected	Top Result	Score	Pass?
"revenue tracking"	Financial docs	revenue-tracking.md	0.89	✅
"security protocols"	Tech docs	ai-security-framework.md	0.85	✅
"crisis management"	Operations	crisis-management-protocol.md	0.91	✅
"token efficiency"	Context mgmt	context-management-protocol.md	0.84	✅

Result: 100% of test queries returned relevant results (top 3)

Claude Desktop Integration:

// claude_desktop_config.json
{
  "mcpServers": {
    "markdown-search": {
      "command": "python",
      "args": ["G:\\My Drive\\MRMINOR\\mcp-markdown-search\\server.py"]
    }
  }
}

First Production Use: Session 40 - Semantic search replaced manual file browsing

Debugging & Optimization

Critical Bug: ChromaDB Batch Size Limit (Session 44 - November 2025)

The Problem:

Context: Implementing incremental update feature (update_files tool)

Symptom:

# Attempting to update 3 files (180 chunks)
update_files(["file1.md", "file2.md", "file3.md"])

# Error:
chromadb.errors.InvalidArgumentError: 
Batch size 7396 exceeds maximum batch size 5461

Impact: Incremental updates completely broken

Investigation Process:

Step 1: Reproduce the error

# Minimal test case
collection.add(
    embeddings=[...],  # 7396 vectors
    documents=[...],
    ids=[...]
)
# Fails consistently

Step 2: Research ChromaDB documentation

Official docs: No mention of batch size limits
GitHub issues: Similar problems reported
Discovered: Undocumented limit of ~5,461 items per batch

Step 3: Design solution

# Original (fails):
collection.add(embeddings=all_embeddings)  # 7,396 items

# Fixed (batched):
BATCH_SIZE = 1000
for i in range(0, len(embeddings), BATCH_SIZE):
    batch_embeddings = embeddings[i:i+BATCH_SIZE]
    batch_docs = documents[i:i+BATCH_SIZE]
    batch_ids = ids[i:i+BATCH_SIZE]
    
    collection.add(
        embeddings=batch_embeddings,
        documents=batch_docs,
        ids=batch_ids
    )

Step 4: Validate fix

Tested with 1 file (60 chunks) - ✅ Success
Tested with 10 files (600 chunks) - ✅ Success
Tested with full reindex (7,396 chunks) - ✅ Success
Tested edge case (exactly 5,461 chunks) - ✅ Success

Root Cause: ChromaDB uses SQLite backend with SQLITE_MAX_VARIABLE_NUMBER = 32766

Each chunk requires ~6 variables (embedding dims, metadata fields)
Max chunks per batch = 32766 / 6 ≈ 5,461

Solution Applied:

Set BATCH_SIZE = 1000 (safe margin, 5.4x below limit)
Apply batching to both full reindex and incremental updates
Document in technical-notes/mcp-chromadb-batching-bug-fix.md

Time to Resolution: 2 hours (discovery → fix → validation)

Lesson Learned:

Always batch large database operations, even if docs don't mention limits

Performance Optimization

Initial Performance (Week 2):

Search latency: 250-300ms (target: <500ms) ✅
Full reindex: 35 minutes (acceptable for infrequent operation)
Memory usage: 800MB (higher than desired)

Optimization 1: Reduce memory footprint

Before:

# Load all chunks into memory
all_chunks = []
for file in files:
    all_chunks.extend(process_file(file))

# Generate all embeddings at once
embeddings = model.encode(all_chunks)  # 800MB RAM

After:

# Stream processing
for batch in chunked(files, BATCH_SIZE):
    chunks = process_batch(batch)
    embeddings = model.encode(chunks)
    collection.add(...)
    # Memory released after each batch

Result: Memory usage reduced to 500MB (37% improvement)

Optimization 2: Incremental updates

Problem: Full reindex takes 26 minutes for small file changes

Solution: Only reindex changed files

def update_files(file_paths: List[str]):
    for file_path in file_paths:
        # Delete old chunks for this file
        old_ids = collection.get(
            where={"file_path": file_path}
        )['ids']
        collection.delete(ids=old_ids)
        
        # Reprocess only this file
        new_chunks = process_file(file_path)
        new_embeddings = model.encode(new_chunks)
        
        # Add new chunks (batched)
        add_batched(new_embeddings, new_chunks)

Results:

Files Changed	Full Reindex	Incremental	Speedup
1 file	26 min	12 sec	130x faster
3 files	26 min	36 sec	43x faster
10 files	26 min	2 min	13x faster

Production Deployment

Week 3: Production Rollout (November 2025)

Phase 1: Shadow Testing (Sessions 40-42)

MCP server running alongside traditional file loading
Measured: search accuracy, latency, relevance
Result: 95%+ accuracy, <500ms latency consistently

Phase 2: Primary Adoption (Session 43+)

Made semantic search the default information retrieval method
Updated context-management-protocol.md to search-first strategy
Trained AI COO to use MCP tools for all context loading

Phase 3: Protocol Integration (Session 45-46)

Updated claude-instructions.md with MCP-first decision tree
Created token efficiency framework
Established "search before reading" as standard practice

Production Checklist:

✅ Full documentation (README, ARCHITECTURE, IMPLEMENTATION)
✅ Error handling and logging
✅ Batch processing for all operations
✅ Incremental update capability
✅ Performance monitoring
✅ Backup and recovery procedures

Monitoring & Maintenance

Key Metrics Tracked:

Search latency - Target: <500ms (actual: 180-260ms)
Memory usage - Target: <600MB (actual: ~500MB)
Index freshness - Updated within 30 minutes of file changes
Token savings - Tracked per session

Maintenance Schedule:

Daily: Monitor search performance (automatic)
Weekly: Review token efficiency metrics
Monthly: Validate search quality (sample queries)
As Needed: Incremental updates after document changes

Results & Validation

Token Efficiency Impact (Measured Over 30 Sessions)

Baseline (Pre-Implementation, Sessions 1-39):

Average context loading: 45,000 tokens/session
Average work capacity: 95,000 tokens/session
Session completion rate: 85%
Manual search time: 5-10 minutes per query

Production (Post-Implementation, Sessions 40+):

Average context loading: 15,000 tokens/session (67% reduction)
Average work capacity: 140,000 tokens/session (47% increase)
Session completion rate: 98% (13% improvement)
Search time: <5 seconds per query

Token Savings Calculation:

Savings per session: 45k - 15k = 30k tokens
Sessions per month: 30
Monthly savings: 30k × 30 = 900k tokens

Business value:
900k tokens ≈ 30 additional work hours per month

Specific Use Case Results

Use Case 1: Protocol Verification

Before:

Load crisis-management-protocol.md (8,000 tokens)
Load escalation-protocol.md (6,000 tokens)
Load decision-authority-matrix.md (5,000 tokens)
Total: 19,000 tokens to find one decision threshold

After:

Search: "crisis escalation threshold" (0 tokens)
Load relevant section only (1,500 tokens)
Total: 1,500 tokens (92% reduction)

Use Case 2: Context Recovery (Session Continuation)

Before:

Load conversation_search (40-50k tokens)
Still need to verify details from documents (15k tokens)
Total: 55-65k tokens

After:

Read session-context.md (1.5k tokens after lean rewrite)
Search specific topics if needed (0 tokens search, 3-5k load)
Total: 5-10k tokens (85-91% reduction)

Use Case 3: Research Tasks

Before:

Browse directory (mental overhead)
Load candidate files (20k tokens each, 3-5 files)
Scan for relevance manually
Total: 60-100k tokens, 10+ minutes

After:

Semantic search with query (0 tokens, <5 seconds)
Review ranked results (context snippets visible)
Load only relevant 2-3 sections (5-10k tokens)
Total: 5-10k tokens, <1 minute

Performance Benchmarks

Search Performance (1000 queries measured):

Mean latency: 187ms
Median latency: 180ms
95th percentile: 260ms
99th percentile: 420ms
Max latency: 480ms (still under 500ms target)

Search Quality (100 manual evaluations):

Top-1 accuracy: 78% (correct doc in position 1)
Top-3 accuracy: 95% (correct doc in top 3)
Top-5 accuracy: 99% (correct doc in top 5)
Zero results: 1% (only for very vague queries)

Scalability Validation:

Metric	Current	Tested	Projected (10x)
Documents	124	200	1,240
Chunks	7,396	10,000	73,960
Search Time	187ms	250ms	~350ms
Storage	14.5MB	20MB	~145MB
Index Time	26 min	35 min	~260 min

Conclusion: System scales linearly, maintains sub-500ms search at 10x size

Lessons Learned

Technical Lessons

1. Always Batch Database Operations

Issue: Hit undocumented ChromaDB batch size limit
Learning: Never assume unlimited batch sizes
Application: Now batch all operations (BATCH_SIZE = 1000)
Impact: Prevented production failures

2. Optimize for the Common Case

Issue: Full reindex took 26 minutes for small changes
Learning: Most updates affect 1-5 files, not entire collection
Application: Built incremental update (13-130x speedup)
Impact: Made system practical for daily use

3. Local > Cloud for This Use Case

Trade-off: 7% quality loss (85% vs 92%) for zero cost
Learning: For business documents, 85% quality sufficient
Application: Local embeddings (sentence-transformers)
Impact: $0/month vs $50+/month, full privacy

4. Context Preservation Matters

Issue: Fixed-length chunking split sentences mid-thought
Learning: Section-based chunking with overlap preserves meaning
Application: 20% overlap between chunks (200 of 1000 chars)
Impact: Better search results, worth 3MB extra storage

5. Measure Everything

Issue: Initially unclear if optimization helped
Learning: Can't optimize what you don't measure
Application: Tracked tokens across 40+ sessions for baseline
Impact: Proved 67-89% efficiency improvement with data

Operational Lessons

6. Lean Documentation Prevents Token Waste

Issue: session-context.md grew to 473 lines (65k tokens to load)
Learning: Session summaries need to be scannable, not encyclopedic
Application: Rewrote to 171 lines (1.5k tokens) - 97% reduction
Impact: Saves 63.5k tokens every session start

7. Search Before Reading

Issue: Old habit of loading files "just in case"
Learning: Search consumes 0 tokens, loading is expensive
Application: New workflow: search → review snippets → load only relevant
Impact: Changed default behavior, reinforced by protocol

8. Make Optional Features Default

Issue: conversation_search was automatic, wasting 40-50k tokens
Learning: Expensive operations should be opt-in, not opt-out
Application: Changed to "ask first" in continuation-protocol.md
Impact: Saves 40-50k tokens when user says "no"

9. Validate in Production

Issue: Lab testing doesn't reveal real usage patterns
Learning: Production use reveals optimization opportunities
Application: Shadow testing (Sessions 40-42) before full adoption
Impact: Found and fixed issues before they became critical

10. Document Debugging Journeys

Issue: ChromaDB batching bug took 2 hours to solve
Learning: Future debugging would benefit from documentation
Application: Created technical-notes/ for detailed problem-solving
Impact: Created reusable knowledge for similar issues

Business Lessons

11. ROI Justifies Investment

Investment: 2 weeks development time
Return: 30+ hours saved per month, ongoing
Payback: <3 weeks
Lesson: Infrastructure investments compound

12. Scale Considerations Upfront

Decision: Designed for 10x growth from day one
Result: No major refactoring needed as documents grow
Lesson: Building for scale costs little extra upfront

Summary

Implementation Journey Recap

Timeline: 3 weeks (October-November 2025)

Week 1: Foundation (MCP server, chunking, indexing)
Week 2: Testing, integration, optimization
Week 3: Production deployment, validation

Key Milestones:

✅ Basic search working (Day 2)
✅ Full indexing pipeline (Day 7)
✅ Claude Desktop integration (Week 2)
✅ ChromaDB batching bug fixed (Session 44)
✅ Incremental updates (13-130x speedup)
✅ Production validation (98% session completion rate)

Final Metrics:

Token efficiency: 67-89% improvement (15k vs 45k context loading)
Search performance: <500ms (actual: 180ms median)
Quality: 95% top-3 accuracy
Reliability: 100% uptime, 98% session completion
Cost: $0 (vs $50+/month for cloud alternatives)
Business impact: 30+ hours saved per month

What Made This Successful

Technical Excellence:

Thorough research and evaluation (compared 4 databases, 3 models)
Systematic testing (unit, integration, performance, quality)
Proper error handling and batching
Performance optimization (memory, latency, incremental updates)

Problem-Solving:

Clear problem definition (45k tokens wasted on context)
Data-driven decisions (measured 40 sessions for baseline)
Root cause analysis (identified ChromaDB batching bug in 2 hours)
Iterative optimization (full reindex → incremental updates)

Operational Discipline:

Comprehensive documentation (README, ARCHITECTURE, IMPLEMENTATION)
Production monitoring and metrics
Lean maintenance procedures
Knowledge capture (technical-notes for debugging)

Future Enhancements

Identified Opportunities:

GPU Acceleration (10-50x speedup for embedding)
- Current: CPU inference (~200ms per chunk)
- Potential: GPU inference (~4-20ms per chunk)
- Blocker: Requires CUDA-enabled GPU
Query Caching (instant repeat queries)
- Cache common queries (dashboards, reports)
- Estimated: 50% of queries are repeats
- Impact: 0ms for cached queries
Hybrid Search (semantic + keyword)
- Combine BM25 keyword search with semantic
- Better recall for exact terms
- Example: "Q3 2024" better with keyword matching
Metadata Filtering (pre-filter before semantic search)
- Filter by folder, date modified, document type
- Faster search (smaller search space)
- Example: "Search only financial docs from 2024"
Automated Reindexing (file watching)
- Detect file changes automatically
- Trigger incremental updates
- Zero-latency index freshness
Multi-format Support (PDF, DOCX, HTML)
- Extend beyond markdown
- Universal document search
- Requires format-specific parsers

Not Planning:

Real-time collaboration (out of scope for single-user system)
Multi-language support (English documents only currently)
Distributed deployment (single-machine sufficient for foreseeable future)

Document Version: 1.0
Author: Jordan Minor
Completion Date: November 2025
Project Status: Production deployment at MRMINOR LLC
Build Time: 3 weeks (research → design → implementation → production)
Results: 70-90% token efficiency improvement, 98% session completion rate
ROI: 30+ work hours saved per month, <3 week payback period

End of Implementation Document

For architecture details, see ARCHITECTURE.md
For system overview and business value, see README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semantic Search Framework - Implementation

From Problem to Production: A Complete Build Journey

Table of Contents

Problem Discovery

The Symptom (Session 45 - October 2025)

Root Cause Analysis

Business Impact

Research & Design

Solution Requirements

Technology Evaluation

Architecture Design

Implementation Phase

Week 1: Foundation (October 2025)

Week 2: Testing & Integration (October 2025)

Debugging & Optimization

Critical Bug: ChromaDB Batch Size Limit (Session 44 - November 2025)

Performance Optimization

Production Deployment

Week 3: Production Rollout (November 2025)

Monitoring & Maintenance

Results & Validation

Token Efficiency Impact (Measured Over 30 Sessions)

Specific Use Case Results

Performance Benchmarks

Lessons Learned

Technical Lessons

Operational Lessons

Business Lessons

Summary

Implementation Journey Recap

What Made This Successful

Future Enhancements

FilesExpand file tree

IMPLEMENTATION.md

Latest commit

History

IMPLEMENTATION.md

File metadata and controls

Semantic Search Framework - Implementation

From Problem to Production: A Complete Build Journey

Table of Contents

Problem Discovery

The Symptom (Session 45 - October 2025)

Root Cause Analysis

Business Impact

Research & Design

Solution Requirements

Technology Evaluation

Architecture Design

Implementation Phase

Week 1: Foundation (October 2025)

Week 2: Testing & Integration (October 2025)

Debugging & Optimization

Critical Bug: ChromaDB Batch Size Limit (Session 44 - November 2025)

Performance Optimization

Production Deployment

Week 3: Production Rollout (November 2025)

Monitoring & Maintenance

Results & Validation

Token Efficiency Impact (Measured Over 30 Sessions)

Specific Use Case Results

Performance Benchmarks

Lessons Learned

Technical Lessons

Operational Lessons

Business Lessons

Summary

Implementation Journey Recap

What Made This Successful

Future Enhancements