A lightweight, OpenAI-compatible API gateway with semantic caching, RAG (Retrieval-Augmented Generation), and token-based authentication. Drop-in replacement for the OpenAI API endpoint — just point your existing client to this gateway.
- OpenAI API compatible — works with any client that supports the OpenAI Chat Completions API (
/v1/chat/completions) - Semantic caching — uses vector similarity search (Qdrant) to serve cached responses for semantically equivalent prompts, significantly reducing upstream API calls and latency
- RAG (Retrieval-Augmented Generation) — optional knowledge-base injection: upload document chunks via the admin API and the gateway automatically retrieves relevant context and prepends it to every prompt
- Token-based auth —
sk-xxxstyle API keys stored in Redis, with format validation (CRC32 checksum) before any network lookup - Streaming support — full SSE streaming passthrough and cached-response streaming simulation
- Microservice architecture — each concern (embedding, cache, completion, auth, rag) is a separate gRPC service, independently deployable and scalable
- CORS ready — built-in CORS middleware for browser-based clients
- Docker / Podman + Compose
- An OpenAI-compatible API key and endpoint (for embedding and completion)
git clone https://github.com/lxbme/llm_gateway.git
cd llm_gateway
cp .env.example .env
# Edit .env with your API keys and endpointspodman compose -f docker-compose.prod.yml up -d
# or
docker compose -f docker-compose.prod.yml up -dbash -lc 'set -a; source test/cli/.env; set +a; curl -sS -X POST http://127.0.0.1:8081/admin/create -H "X-Admin-Secret: $ADMIN_SECRET" -H "Content-Type: application/json" -d "{\"alias\":\"manual-test\"}"'
# Response: {"token":"sk_xxx","alias":"my-user"}sk_xxx is obtained by running the previous command
curl -s http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer sk_xxx" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"stream": true,
"messages": [{"role": "user", "content": "Hello! llm_gateway"}]
}'# Embedding service
EMBED_PROVIDER=openai
EMBED_API_KEY=sk-your-embedding-api-key
EMBED_ENDPOINT=https://api.openai.com/v1/embeddings
EMBED_MODEL=text-embedding-3-small
EMBED_DIMENSIONS=1536
# Cache service
CACHE_MODE=semantic
CACHE_STORE_PROVIDER=qdrant
CACHE_BUFFER_SIZE=1000
CACHE_WORKER_COUNT=5
QDRANT_SIMILARITY_THRESHOLD=0.95
# QDRANT_COLLECTION_NAME defaults to llm_semantic_cache for the cache service
# Completion service
COMPL_API_KEY=sk-your-completion-api-key
COMPL_ENDPOINT=https://api.openai.com/v1/chat/completions
# RAG service (optional — leave RAG_ADDR empty to disable RAG entirely)
# RAG_ADDR=localhost:50055
# RAG_SIMILARITY_THRESHOLD=0.6 # default 0.6
# RAG_DEFAULT_TOP_K=3 # default 3
# QDRANT_COLLECTION_NAME defaults to llm_rag_documents for the rag service
# Redis auth DB index
REDIS_DB=0
# Gateway log level: DEBUG | INFO | ERROR
LOG_LEVEL=ERROR
# Enable pprof profiling endpoint
DEBUG_MODE=false
# Admin API secret — REQUIRED to use /admin/* endpoints.
# If unset, all admin requests will be rejected with 403 Forbidden.
# Pass this value via the X-Admin-Secret request header.
ADMIN_SECRET=change-me-to-a-strong-random-secret| Variable | Default | Description |
|---|---|---|
CACHE_ADDR |
localhost:50052 |
Cache service gRPC address |
COMPL_ADDR |
localhost:50053 |
Completion service gRPC address |
AUTH_ADDR |
localhost:50054 |
Auth service gRPC address |
RAG_ADDR |
"" |
RAG service gRPC address. Leave empty to disable RAG. |
LOG_LEVEL |
ERROR |
Log verbosity: DEBUG, INFO, ERROR |
DEBUG_MODE |
false |
Set true to enable /debug/pprof/* endpoints |
| Variable | Default | Description |
|---|---|---|
EMBED_PROVIDER |
— | Required. Embedding provider name, currently openai |
EMBED_API_KEY |
— | API key for the embedding provider |
EMBED_ENDPOINT |
— | Required. Embedding API endpoint URL |
EMBED_MODEL |
— | Required. Embedding model name |
EMBED_DIMENSIONS |
— | Required. Embedding vector dimensions |
SERVE_PORT |
50051 |
gRPC listen port |
| Variable | Default | Description |
|---|---|---|
CACHE_MODE |
semantic |
Cache mode. semantic requires an embedding service; exact is reserved for future exact-match stores |
CACHE_STORE_PROVIDER |
— | Required. Store provider name, currently qdrant |
CACHE_PROVIDER |
— | Backward-compatible alias for CACHE_STORE_PROVIDER |
CACHE_BUFFER_SIZE |
1000 |
Async cache write queue capacity |
CACHE_WORKER_COUNT |
5 |
Number of async cache workers |
EMBED_ADDR |
localhost:50051 |
Embedding service gRPC address |
QDRANT_HOST |
localhost |
Qdrant hostname |
QDRANT_PORT |
6334 |
Qdrant gRPC port |
QDRANT_COLLECTION_NAME |
llm_semantic_cache |
Qdrant collection used by the semantic cache |
QDRANT_SIMILARITY_THRESHOLD |
0.95 |
Minimum cosine similarity score required for a cache hit |
SERVE_PORT |
50052 |
gRPC listen port |
cache-service now selects its backend using CACHE_MODE + CACHE_STORE_PROVIDER. In the current implementation only semantic + qdrant is available. When CACHE_MODE=semantic, EMBED_ADDR must point to a reachable embedding service so the cache can generate vectors before searching or writing.
| Variable | Default | Description |
|---|---|---|
COMPL_API_KEY |
— | API key for the completion provider |
COMPL_ENDPOINT |
— | Required. Chat completions API endpoint URL |
SERVE_PORT |
50053 |
gRPC listen port |
| Variable | Default | Description |
|---|---|---|
EMBED_ADDR |
localhost:50051 |
Embedding service gRPC address |
QDRANT_HOST |
localhost |
Qdrant hostname |
QDRANT_PORT |
6334 |
Qdrant gRPC port |
QDRANT_COLLECTION_NAME |
llm_rag_documents |
Qdrant collection for RAG chunks (separate from the cache collection) |
RAG_SIMILARITY_THRESHOLD |
0.6 |
Minimum cosine similarity for a chunk to be retrieved. Set to 0 to disable score filtering entirely (every chunk passing the collection filter is returned, up to RAG_DEFAULT_TOP_K) |
RAG_DEFAULT_TOP_K |
3 |
Maximum number of chunks returned per query |
SERVE_PORT |
50055 |
gRPC listen port |
| Variable | Default | Description |
|---|---|---|
REDIS_ADDR |
— | Required. Redis address (host:port) |
REDIS_PASSWORD |
"" |
Redis password (leave empty if none) |
REDIS_DB |
— | Required. Redis database index |
SERVE_PORT |
50054 |
gRPC listen port |
REDIS_ADDR is pointing to redis docker in default docker compose file.
The gateway ships an optional rag-service that lets you upload a knowledge base and have relevant document chunks automatically injected into every LLM prompt.
User request
└─ [Gateway: StageBeforeUpstream]
├─ auth_validate_handler ← verify token, resolve alias
├─ rag_retrieve_handler ← query rag-service for top-K chunks
│ └─ augment PromptText with retrieved context
├─ cache_lookup_handler ← operate on the augmented prompt
└─ upstream_request_build_handler
The RAG handler is a soft dependency: if RAG_ADDR is not set, or if retrieval returns no results, the request proceeds normally without any modification.
Document chunks are stored in a single Qdrant collection (llm_rag_documents) with a collection payload field used as a filter. Each request selects its knowledge base via:
X-RAG-Collectionrequest header — explicit collection name (takes priority)- Token alias — falls back to the alias of the authenticated token (per-user knowledge base)
# environment variables
EMBED_ADDR=localhost:50051
QDRANT_HOST=localhost
QDRANT_PORT=6334
QDRANT_COLLECTION_NAME=llm_rag_documents
RAG_SIMILARITY_THRESHOLD=0.6
RAG_DEFAULT_TOP_K=3
SERVE_PORT=50055
go run ./cmd/ragThen tell the gateway where to find it:
RAG_ADDR=localhost:50055 go run ./cmd/gatewayOption A — Raw text (recommended): POST plain text or Markdown. The gateway auto-chunks it in the background and returns immediately.
curl -X POST http://localhost:8081/admin/rag/ingest/text \
-H "X-Admin-Secret: your-secret" \
-H "Content-Type: application/json" \
-d '{
"collection": "alice",
"source": "docs/product-faq.md",
"text": "# Refund Policy\n\nOur refund policy is 30 days...\n\nTo request a refund, email support@example.com.\n\nRefunds are processed within 5 business days.",
"chunk_size": 500,
"chunk_overlap": 50
}'
# Response (202 Accepted, immediate):
# {"job_id":"<uuid>","collection":"alice","status":"queued","chunk_count":3}The chunk_size (default 500) and chunk_overlap (default 50) fields are optional. Chunking is Markdown-aware: headings always start a new chunk, paragraphs are merged up to chunk_size runes, and sentence boundaries are preferred for splits. The actual embedding + vector storage happens asynchronously in the background.
Option B — Pre-chunked (advanced): Supply chunks you have already split yourself. Ingestion is synchronous and returns only after all chunks are stored.
curl -X POST http://localhost:8081/admin/rag/ingest \
-H "X-Admin-Secret: your-secret" \
-H "Content-Type: application/json" \
-d '{
"collection": "alice",
"source": "docs/product-faq.md",
"chunks": [
{"content": "Our refund policy is 30 days...", "chunk_index": 0, "total_chunks": 3},
{"content": "To request a refund, email...", "chunk_index": 1, "total_chunks": 3},
{"content": "Refunds are processed within...", "chunk_index": 2, "total_chunks": 3}
]
}'
# Response: {"doc_id":"<uuid>","ingested_count":3}curl -s http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer sk_xxx" \
-H "X-RAG-Collection: alice" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"stream": true,
"messages": [{"role": "user", "content": "What is your refund policy?"}]
}'The gateway retrieves the top-3 matching chunks from the alice collection and prepends them to the prompt before calling the LLM.
curl -X DELETE http://localhost:8081/admin/rag/doc \
-H "X-Admin-Secret: your-secret" \
-H "Content-Type: application/json" \
-d '{"doc_id": "<uuid>", "collection": "alice"}'
# Response: 204 No ContentIf RAG context never seems to appear in the LLM's answers, set LOG_LEVEL=DEBUG on the gateway. Each request will emit:
RAG: resolved collection="<name>" from <source>— the collection name the gateway will query, plus its source (X-RAG-Collection headerorAuth.Subject (token alias)).RAG: retrieve returned 0 chunks for collection="<name>"— emitted when the query found nothing. The most common cause is a mismatch between thecollectionfield used at ingest time and the name resolved at query time.
When the gateway has RAG configured but cannot resolve any collection (no X-RAG-Collection header and no token alias), it emits a warning and skips retrieval.
Worked example: if you ingested with collection: "mr" but the request's token alias is test_token, the gateway will look up collection test_token and find nothing. Either pass -H "X-RAG-Collection: mr" on the request, or re-ingest using the token alias as the collection name.
To verify with no score filtering at all, restart rag-service with RAG_SIMILARITY_THRESHOLD=0. The startup log will show threshold=0.0000 and the qdrant query will skip the score filter entirely — useful to confirm the collection filter is the actual blocker.
The admin API listens on :8081 (bound to 127.0.0.1 only).
Authentication is required. Every request must include the X-Admin-Secret header matching the ADMIN_SECRET environment variable set on the gateway. If ADMIN_SECRET is empty or the header is missing/wrong, the request is rejected with 403 Forbidden — all admin endpoints are effectively disabled until the secret is configured.
curl -s -X POST http://localhost:8081/admin/create \
-H "X-Admin-Secret: your-secret" \
-H "Content-Type: application/json" \
-d '{"alias": "alice"}'Token management
| Method | Path | Body | Description |
|---|---|---|---|
POST |
/admin/create |
{"alias": "name"} |
Generate a new sk_xxx token |
POST |
/admin/get |
{"token": "sk_xxx"} |
Look up a token's validity and alias |
POST |
/admin/delete |
{"token": "sk_xxx"} |
Revoke a token |
RAG knowledge-base management
| Method | Path | Body | Description |
|---|---|---|---|
POST |
/admin/rag/ingest/text |
{"collection","source","text","chunk_size","chunk_overlap"} |
Ingest raw text/Markdown — auto-chunked, async (202) |
POST |
/admin/rag/ingest |
{"collection","source","chunks":[...]} |
Ingest pre-chunked content, synchronous |
DELETE |
/admin/rag/doc |
{"doc_id","collection"} |
Delete all chunks of a document |
| Variable | Default | Description |
|---|---|---|
ADMIN_SECRET |
— | Required. Shared secret for X-Admin-Secret header. Admin API is disabled when empty. |
See LICENSE.

