Intelligent Context Management for AI Development Tools
Omn1-ACE prevents wasteful API calls by finding only relevant context through semantic search and smart cachingβsaving 85% on API costs.
Current Stage: Prototype / Early Development
- β Architecture designed and documented
- β Infrastructure setup (Docker, databases)
β οΈ Core API endpoints are placeholders (not yet implemented)β οΈ Not production-readyFor production-ready microservices, see OmniMemory
The Problem: AI coding assistants send ALL potentially relevant files to expensive APIsβeven when 90% are irrelevant.
The Solution: Smart retrieval finds only what's needed BEFORE hitting paid APIs.
| Aspect | Without Omn1-ACE | With Omn1-ACE |
|---|---|---|
| Files Searched | 50+ files keyword search | 50+ files semantic search (local) |
| Files Sent to API | All 50 files | Only 3 relevant files |
| Cache Check | None (re-send everything) | L1/L2/L3 (skip 2 already sent) |
| API Tokens | 60,000 tokens | 950 tokens |
| Cost per Query | $0.90 | $0.014 |
| Monthly Cost (500 queries) | ~$450 | ~$68 |
How Savings Break Down:
| Optimization | Impact | Savings |
|---|---|---|
| Smart Retrieval | Finds 3 of 50 files | 80% ($340/mo) |
| Cache Hits | Skips 2 already sent | 13% ($55/mo) |
| Compression | Reduces remaining size | 5% ($22/mo) |
| Context Pruning | Trims conversation history | 2% ($8/mo) |
Total Savings: $382/month per developer (85% reduction)
You: "Find the authentication bug"
AI Tool:
1. Searches all files for "auth" β 50 files
2. Sends all 50 files β Anthropic API
3. You pay: 60,000 tokens ($0.90)
Result: 47 files were completely irrelevant (wasted money)
You: "Find the authentication bug"
Omn1-ACE intercepts (before API):
1. Semantic search (local, free) β Finds 3 relevant of 50 files
2. Cache check (local, free) β 2 already sent, skip them
3. Sends 1 new file β Anthropic API
4. You pay: 950 tokens ($0.014)
Result: 59,050 tokens never hit paid API = $0.886 saved
|
Prevents sending irrelevant files Find only what's relevant using three methods:
Impact: 80% cost reduction |
Prevents re-sending files Three-layer cache avoids redundant API calls:
Impact: 13% cost reduction |
Anticipates what you'll need Learns from patterns to prefetch:
Impact: Faster responses |
- Code-Aware Compression: Further reduces tokens while preserving semantic meaning (5% additional savings)
- Model-Specific Optimization: Context tailored for Claude, GPT, or Gemini
- Team Intelligence: L2 cache learns from what your teammates already sent
- LSP Integration: Enhanced code intelligence via Language Server Protocol
- Python 3.11+
- Docker & Docker Compose (recommended)
- 4GB+ RAM
Get started in 5 minutes using convenience scripts:
# Clone the repository
git clone https://github.com/mrtozner/omn1-ace.git
cd omn1-ace
# Start all services (auto-creates .env from template)
./start.sh
# Check service status
./status.sh
# View logs
./logs.sh # All services
./logs.sh postgres # Specific service
# Restart services
./restart.sh
# Stop services
./stop.shAvailable scripts:
start.sh- Start all Docker services with health checksstop.sh- Stop all servicesrestart.sh- Restart all serviceslogs.sh- View service logs (all or specific service)status.sh- Check service health and status
Manual Docker commands (if you prefer):
# Copy environment template
cp .env.example .env && nano .env
# Start services
docker-compose -f deploy/docker-compose.yml up -d
# Verify
curl http://localhost:8000/healthOmn1-ACE implements a 4-layer anticipatory system:
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β AI Development Tools β
β (Claude Code, Cursor, Continue, etc.) β
βββββββββββββββββββββ¬ββββββββββββββββββββββββββββββ
β
βββββββββββββΌβββββββββββββ
β Interception Layer β β MCP Protocol
β (Before API call) β
βββββββββββββ¬βββββββββββββ
β
βββββββββββββΌβββββββββββββ
β Tri-Index Search β β Find 3 of 50 relevant files
β (LOCAL, <100ms, FREE) β (Dense + Sparse + Structural)
βββββββββββββ¬βββββββββββββ
β
βββββββββββββΌβββββββββββββ
β Multi-Tier Cache β β Check L1/L2/L3: Already sent?
β (LOCAL, <5ms, FREE) β Skip cached files
βββββββββββββ¬βββββββββββββ
β
βββββββββββββΌβββββββββββββ
β Send to API β β Only 1 new file (950 tokens)
β (PAID, Anthropic/ β Instead of 50 files (60K tokens)
β OpenAI) β
ββββββββββββββββββββββββββ
Result: 85% cost reduction ($0.014 vs $0.90 per query)
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check |
/api/v1/search |
POST | Semantic search (find relevant files, not all files) |
/api/v1/cache/check |
POST | Cache lookup (skip files already sent to API) |
/api/v1/embeddings |
POST | Generate vector embeddings for semantic search |
/api/v1/predict |
POST | Predict likely context needs (prefetching) |
/api/v1/compress |
POST | Compress context (optional secondary optimization) |
/api/v1/cache/stats |
GET | Cache performance statistics |
Interactive Docs: http://localhost:8000/docs (OpenAPI)
WITHOUT Omn1-ACE:
Files sent to API: 50 files
- auth.ts β
- auth-middleware.ts β
- auth.test.ts β
- database-config.ts β (irrelevant)
- logging-utils.ts β (irrelevant)
- email-templates.ts β (irrelevant)
- ...44 more irrelevant files β
Tokens sent: 60,000
Cost: $0.90
Waste: 47 files (78%) completely irrelevant
WITH Omn1-ACE:
Semantic search (local): Finds 3 relevant of 50
- auth.ts β (similarity: 0.94)
- auth-middleware.ts β (similarity: 0.89)
- auth.test.ts β (similarity: 0.86)
Cache check (local):
- auth.ts: In L1 cache (sent 2 queries ago) β SKIP
- auth-middleware.ts: In L2 cache (teammate sent) β SKIP
- auth.test.ts: Not cached β SEND
Files sent to API: 1 file
Tokens sent: 950 (optionally compressed)
Cost: $0.014
Savings: $0.886 (98.5%)
Different AI models have different token limits:
| Model | Context Window | Configuration |
|---|---|---|
| Claude 3.5 Sonnet | 200,000 tokens | CLAUDE_CONTEXT_WINDOW=200000 |
| GPT-4 Turbo | 128,000 tokens | GPT_CONTEXT_WINDOW=128000 |
| Gemini 1.5 Pro | 1,000,000 tokens | GEMINI_CONTEXT_WINDOW=1000000 |
| GPT-3.5 Turbo | 16,000 tokens | GPT_CONTEXT_WINDOW=16000 |
Why this matters: Even with smart retrieval, you need to ensure your target model can handle the optimized context.
Set your target model in .env:
DEFAULT_TARGET_MODEL=claude # or gpt, gemini
CLAUDE_CONTEXT_WINDOW=200000
GPT_CONTEXT_WINDOW=128000
GEMINI_CONTEXT_WINDOW=1000000Claude (Anthropic):
- β Best with structured, detailed context
- β Excellent at following complex instructions
- β‘ Prefers explicit task breakdowns
GPT (OpenAI):
- β Works well with conversational context
β οΈ May need more explicit formatting- β‘ Better with shorter, focused context
Gemini (Google):
- β Handles very large context windows
- β Good with multimodal content
β οΈ May need different prompt engineering
Recommendation: Standardize on one model per team for consistent cache sharing (L2).
| Component | Requirements |
|---|---|
| API Server | 2+ CPU cores, 4GB+ RAM |
| PostgreSQL | 4GB+ RAM, SSD storage |
| Qdrant | 8GB+ RAM (scales with corpus) |
| Redis | 2GB+ RAM (scales with cache) |
| Operation | Time | Cost |
|---|---|---|
| Semantic search | <100ms | $0 (local) |
| Cache lookup | <5ms | $0 (local) |
| Vector embedding | <50ms | $0 (local) |
| API call (prevented) | N/A | $0.90 saved |
| API call (optimized) | 1-3s | $0.014 |
- Horizontal: API servers behind load balancer
- PostgreSQL: Read replicas for read-heavy workloads
- Qdrant: Clustering for large-scale vector search
- Redis: Clustering for high-availability caching
Before production deployment:
- β
Change all default passwords in
docker-compose.yml - β Use environment variables for sensitive configuration
- β Enable TLS/SSL for all service connections
- β Configure authentication for API endpoints
- β Use network policies to restrict service access
- β Regular security updates for all dependencies
Contributions are welcome!
Before submitting a PR:
- All tests pass
- Code follows style guidelines (black, isort, pylint)
- New features include tests
- Documentation is updated
This project is licensed under the MIT License - see LICENSE for details.
- OmniMemory: Production-ready microservices (13 independent services)
- Extensions: LSP integration for enhanced code intelligence (docs)
Built with:
- FastAPI - Modern web framework
- Qdrant - Vector similarity search
- PostgreSQL - Relational database
- Redis - In-memory data store
- NetworkX - Graph analysis
β Star this repo if you find it useful!
π¬ Discussions β’ π Report Bug
Made with β€οΈ by Mert Ozoner