Moved out of the top-level README to keep it scannable; this is the full runtime-configuration reference.
Copy .env.example to .env, then adjust only what your deployment needs.
| Variable | Default | Description |
|---|---|---|
OLLAMA_BASE_URL |
http://localhost:11434 |
Base URL for explicit local-first Ollama mode or GraceKelly fallback |
OLLAMA_MODEL_NAME |
qwen2.5:7b |
Primary Ollama model when LLM_PROVIDER_PROFILE=local-first |
OLLAMA_FAST_MODEL_NAME |
llama3.2:3b |
Faster Ollama model for explicit local helper/tool flows |
MODEL_ROUTING_ENABLED |
false |
Enable simple/complex/global model routing |
OLLAMA_REQUEST_TIMEOUT_SEC |
60 |
Timeout for a single Ollama HTTP request |
REQUIRE_OLLAMA |
false |
Fail fast at startup if explicit Ollama mode/fallback validation requires Ollama |
LANGFUSE_PUBLIC_KEY |
- |
Optional Langfuse public key |
LANGFUSE_SECRET_KEY |
- |
Optional Langfuse secret key |
LANGFUSE_HOST |
https://cloud.langfuse.com |
Langfuse host for LLM observability |
| Variable | Default | Description |
|---|---|---|
PROVIDER_REGISTRY_PATH |
config/providers.yml |
YAML registry with providers, pricing, capabilities, and routing profiles |
LLM_PROVIDER_PROFILE |
gracekelly-primary |
Active routing profile; defaults to the local GraceKelly orchestrator |
LLM_BENCHMARK_ALLOW_PAID_APIS |
false |
Backward-compatible flag that allows live external-provider calls in provider benchmarks |
DAILY_COST_LIMIT_USD |
5.0 |
Fail fast when tracked direct-provider spend for the current UTC day reaches this limit |
MISTRAL_API_KEY |
changeme |
Direct Mistral API key; placeholder values are treated as missing |
GRACEKELLY_BASE_URL |
http://127.0.0.1:8011 |
Base URL for the local GraceKelly orchestrator |
GRACEKELLY_API_KEY |
- |
Optional GraceKelly bearer token for non-public endpoints |
GRACEKELLY_API_KEY_ENV |
GRACEKELLY_API_KEY |
Env var name used by the runtime to look up the optional GraceKelly API key |
GRACEKELLY_HEALTH_CHECK_TIMEOUT_SEC |
2.0 |
Readiness-probe timeout before GraceKelly is considered unavailable |
GRACEKELLY_REQUEST_TIMEOUT_SEC |
30.0 |
Timeout for a single GraceKelly /api/v1/smart call |
FAILOVER_CHAIN_ENABLED |
true |
Enable GraceKelly -> Ollama automatic failover for profiles that declare a local fallback |
FAILOVER_FALLBACK_CACHE_SECONDS |
300 |
Cache a successful local fallback decision for this many seconds |
| Variable | Default | Description |
|---|---|---|
RAG_EMBEDDING_MODEL |
BAAI/bge-m3 |
Embedding model used for documents and queries (local backend) |
RAG_EMBEDDING_BACKEND |
local |
local = SentenceTransformer on RAG_DEVICE; remote = OpenAI/Mistral-compatible embeddings API. Remote frees ingest/search from loading the heavy local model (e.g. unblocks Windows under the 1-GiB/process rule). Remote vectors are L2-normalized to match the local path |
RAG_EMBEDDING_REMOTE_URL |
https://api.mistral.ai/v1/embeddings |
Remote embeddings endpoint (OpenAI-compatible {model, input:[...]}) |
RAG_EMBEDDING_REMOTE_MODEL |
mistral-embed |
Remote embedding model name |
RAG_EMBEDDING_REMOTE_API_KEY_ENV |
MISTRAL_API_KEY |
Name of the env var holding the remote API key (the key itself is never stored in settings/logs) |
RAG_EMBEDDING_REMOTE_BATCH |
32 |
Inputs per remote embeddings request |
RAG_EMBEDDING_REMOTE_TIMEOUT_SEC |
60 |
Timeout for a single remote embeddings request |
RAG_RERANKER_MODEL |
BAAI/bge-reranker-v2-m3 |
Multilingual cross-encoder reranker (pairs with BGE-M3) |
RAG_HYBRID_SEARCH |
true |
Combine BM25 with vector retrieval |
RAG_RETRIEVAL_STRATEGY |
hybrid |
Retrieval strategy: vector, hybrid, graph, or factcard; graph and factcard fall back to hybrid when their store is absent. factcard (opt-in) serves whole fact-cards for enumeration queries (fields/documents/conditions) — closes the customs-clearance-fields recall gap; build the collection with scripts/build_factcards.py. Auto-routing into factcard is intentionally NOT default (NO-SHIP pending Phase-5 offline-delta — see docs/operations/2026-06-14-adaptive-retrieval-closure.md) |
RAG_RETRIEVAL_TOP_K |
20 |
Candidate documents fetched before reranking |
RAG_RERANK_TOP_K |
5 |
Final document count after reranking |
RRF_K |
60 |
Reciprocal Rank Fusion smoothing constant |
RRF_DOC_KEY_CHARS |
200 |
Prefix length used to deduplicate RRF document keys |
QUALITY_THRESHOLD |
80 |
Default quality threshold used by routing/evaluation logic |
CHUNK_SIZE |
800 |
Default chunk size for ingestion |
CHUNK_OVERLAP |
200 |
Default chunk overlap for ingestion |
API_DEFAULT_PAGE_SIZE |
50 |
Default page size for list-style admin endpoints |
RAG_SEMANTIC_CHUNKING |
true |
Enable semantic chunking |
RAG_CONTEXTUAL_HEADERS |
true |
Prepend contextual headers during ingestion. Cheap by default (build_vector_store derives headers from chunk metadata — no LLM/network). The LLM-generated variant runs only when INGESTION_BATCH_ENABLED=true, and then per document, not per chunk |
INGESTION_CONTEXTUAL_CONCURRENCY |
4 |
Bounded concurrency for the LLM contextual-header fallback (providers without a native batch API). 1 = strictly serial. Caps in-flight requests so a full-corpus ingest cannot fan out unbounded provider calls. Progress is logged as [contextual_headers] i/N |
RAG_AGENTIC_MODE |
false |
Enable the tool-calling agent graph |
RAG_HYDE |
false |
Enable Hypothetical Document Embeddings |
RAG_PARENT_CHILD |
false |
Enable parent-child chunking |
RAG_STRUCTURAL_CHUNKING |
true |
Split markdown by headers (sections), cap to CHUNK_SIZE |
RAG_PARENT_EXPANSION |
true |
Post-rerank: supplement final chunks with neighbouring sections of their source |
RAG_PARENT_EXPANSION_WINDOW |
2 |
Sections taken from each side of a selected chunk |
RAG_PARENT_EXPANSION_MAX_CHARS |
3600 |
Cap on expanded chunk text (core + neighbours) |
RAG_GRAPH_RETRIEVAL |
off |
Graph-lane activation gate: off/on/auto; condition evaluated & logged at ingestion (lane itself = Phase 2, not built) |
RAG_GRAPH_MIN_CHUNKS |
20000 |
auto: minimal chunk count to consider the graph lane |
RAG_GRAPH_MIN_CROSSDOC_SHARE |
0.15 |
auto: minimal cross-doc entity share (connectivity gate) |
RAG_GRAPH_CROSSDOC_SHARE |
unset | Measured probe value (scripts/graph_probe.py; 2026-06-06 corpus: 0.296, gate passed); unset = probe not run, auto stays off |
RAG_ASK_BUDGET_SEC |
0 |
Optional wall-clock budget for a single ConversationSession.ask() outside the HTTP path (which already has request_timeout_sec). 0 = off (blocking). When >0 and exceeded, ask() returns a graceful degraded result (route="timeout") instead of hanging on a flapping provider; the background run is not cancellable |
RAG_SELF_RAG_MAX_ITER |
2 |
Maximum Self-RAG iterations |
RAG_SELF_RAG_MIN_QUALITY |
70 |
Minimum quality score to avoid retry/escalation |
STREAMING_QUALITY_EVAL |
true |
Streaming /api/ask/stream runs one cheap Self-RAG self-eval so streamed answers are quality-routed on par with non-streaming; set false to roll back to the legacy synthetic-score streaming path |
FACT_VERIFICATION_ENABLED |
true |
Run fact verification after generation |
FACT_VERIFICATION_MIN_SCORE |
70 |
Minimum factuality score threshold |
FACT_VERIFY_CONTEXT_MAX_DOCS |
5 |
Max retrieved docs used as evidence when verifying answer facts |
FACT_VERIFY_CONTEXT_CHARS_PER_DOC |
3600 |
Chars per doc used as fact-verification evidence; aligned with RAG_PARENT_EXPANSION_MAX_CHARS so verification sees full parent-expanded chunks |
SLOW_TRACE_THRESHOLD_MS |
10000 |
Trace-duration threshold for review queue collection |
THRESHOLD_ANALYSIS_MIN_LABELS |
20 |
Minimum labeled traces required before suggesting a new threshold |
REVIEW_QUEUE_ENABLED |
true |
Enable review queue builder and admin endpoints |
ONLINE_EVALUATORS_ENABLED |
true |
Enable lightweight per-trace online evaluators, persistence, and admin views. When persistence fails (e.g. Postgres unreachable in a standalone graph run), the first failure logs at WARNING and identical repeats drop to DEBUG — one signal per process, not one per request; answers are unaffected |
ONLINE_EVALUATORS_TIMEOUT_SEC |
1.0 |
Per-trace online-evaluator wall-clock budget; runs that exceed it are dropped and counted in rag_online_evaluators_dropped_total{reason} |
REGRESSION_GATE_MAX_REGRESSIONS |
2 |
Maximum allowed curated regressions before the gate fails |
REGRESSION_GATE_MIN_PASS_RATE |
0.85 |
Minimum candidate pass rate required by the regression gate |
RAG_VECTOR_BACKEND |
chroma |
Vector store backend |
VECTORDB_COLLECTION_PREFIX |
rag_docs |
Chroma collection prefix; full name is {prefix}_{tenant_id} |
CATEGORIES_CONFIG_PATH |
config/categories.yml |
Taxonomy file for upload auto-categorization |
Resilience layers apply in this order: timeout -> retry -> circuit breaker -> bounded concurrency -> request wall-time
| Variable | Default | Description |
|---|---|---|
OLLAMA_RETRY_MAX_ATTEMPTS |
3 |
Retry attempts including the first call; 1 disables retries |
OLLAMA_RETRY_BASE_DELAY_SEC |
0.5 |
Base retry delay |
OLLAMA_RETRY_MAX_DELAY_SEC |
5.0 |
Maximum retry delay |
OLLAMA_RETRY_JITTER |
true |
Apply jitter to retry delays |
CIRCUIT_BREAKER_ENABLED |
true |
Enable circuit-breaker protection for Ollama |
CIRCUIT_BREAKER_FAILURE_THRESHOLD |
5 |
Consecutive failures before the breaker opens |
CIRCUIT_BREAKER_RESET_TIMEOUT_SEC |
30 |
Delay before half-open probing |
REQUEST_TIMEOUT_SEC |
30 |
Wall-time limit for one /api/ask request |
STREAMING_TIMEOUT_SEC |
120 |
Wall-clock budget for the SSE token loop in /api/ask/stream (separate from REQUEST_TIMEOUT_SEC) |
DB_PERSIST_TIMEOUT_SEC |
2.0 |
Timeout for persisting one conversation message to Postgres before the write is dropped and counted in rag_message_persist_failures_total{operation} |
MAX_CONCURRENT_PIPELINES |
8 |
Maximum concurrent /api/ask pipelines |
PIPELINE_ACQUIRE_TIMEOUT_SEC |
0.5 |
How long to wait for a pipeline slot before returning 503 |
SESSION_TTL_SECONDS |
7200 |
Session idle timeout in seconds |
| Variable | Default | Description |
|---|---|---|
API_KEY |
- |
Legacy X-API-Key protection for API endpoints; JWT is preferred |
ADMIN_USERNAME |
admin |
Username for /api/auth/login |
ADMIN_PASSWORD_HASH |
- |
Bcrypt password hash; if empty, dev mode accepts admin/admin |
JWT_SECRET |
dev-secret-change-in-production! |
Secret for access/refresh tokens |
JWT_ACCESS_TTL |
3600 |
Access-token TTL in seconds |
JWT_REFRESH_TTL |
604800 |
Refresh-token TTL in seconds |
SESSION_SECRET_KEY |
JWT_SECRET fallback |
Secret used by SessionMiddleware and OIDC state cookies |
GOOGLE_OIDC_CLIENT_ID |
- |
Google OIDC client ID |
GOOGLE_OIDC_CLIENT_SECRET |
- |
Google OIDC client secret |
AZURE_OIDC_TENANT |
- |
Azure AD tenant used for issuer discovery |
AZURE_OIDC_CLIENT_ID |
- |
Azure AD OIDC client ID |
AZURE_OIDC_CLIENT_SECRET |
- |
Azure AD OIDC client secret |
TENANT_EMAIL_DOMAINS |
"" |
Domain-to-tenant mapping, for example acme.com:tenant-acme,beta.io:tenant-beta |
RAG_ENV |
development |
development, staging, or production |
CORS_ORIGINS |
* |
Comma-separated allowed origins; * is forbidden in production |
CORS_MAX_AGE_SEC |
600 |
Preflight cache TTL |
MAX_REQUEST_BODY_BYTES |
1048576 |
1 MiB request-body limit for non-upload endpoints |
MAX_UPLOAD_BYTES |
52428800 |
50 MiB upload limit for /api/upload |
ALLOW_ANONYMOUS_ADMIN |
- |
Opt-in escape hatch when API_KEY is empty: set to 1/true to permit anonymous admin (otherwise endpoints return HTTP 503). Local-dev only. Added 2026-04-26 audit. |
HOST |
127.0.0.1 (bare run) |
Used only when launching via python main.py. Default Docker Compose is local-dev only and binds host ports to 127.0.0.1. |
PORT |
8000 |
Same — bare run only. |
UVICORN_RELOAD |
false |
python main.py only: enable uvicorn auto-reload. Default off is headless-safe — auto-reload restarts the API on any write under data//demo/, which flaps headless ingest/eval runs. Set true for the local dev loop. |
AUTO_MIGRATE |
true |
Run alembic upgrade head in startup lifespan. In production, errors abort startup unless AUTO_MIGRATE_FAIL_OPEN=true is explicitly set. |
AUTO_MIGRATE_FAIL_OPEN |
false |
Production escape hatch for temporarily logging migration failures instead of aborting startup. |
| Variable | Default | Description |
|---|---|---|
POSTGRES_PASSWORD |
rag_dev_password |
Local Compose password for the Postgres container |
DATABASE_URL |
postgresql://rag:rag_dev_password@localhost:5432/rag_assistant |
Postgres DSN for sessions, audit, analytics, and copilot data |
DB_ENCRYPTION_KEY |
dev fallback | Key used by pgcrypto; required in production and for migration 008 |
REDIS_URL |
redis://localhost:6379/0 |
Redis cache URL |
LLM_CACHE_ENABLED |
false |
Enable tenant-scoped response caching for /api/ask |
LLM_CACHE_TTL_SECONDS |
3600 |
TTL for cached LLM responses |
OTEL_ENABLED |
false |
Enable OpenTelemetry SDK + instrumentation |
OTEL_EXPORTER_OTLP_ENDPOINT |
http://localhost:4317 |
OTLP gRPC endpoint for Jaeger/Tempo/collectors |
OTEL_SERVICE_NAME |
rag-support-assistant |
service.name resource attribute |
LLM_INPUT_PRICE_PER_1M_TOKENS |
0.0 |
Fallback input-token price when a model is not listed in the provider registry |
LLM_OUTPUT_PRICE_PER_1M_TOKENS |
0.0 |
Fallback output-token price when a model is not listed in the provider registry |
LLM_MODEL_PRICES |
- |
Optional JSON override for legacy analytics or unregistered models |
LLM_COST_PER_1M_TOKENS |
legacy fallback | Backward-compatible legacy pricing format kept for old local setups |
TRACE_RETENTION_DAYS |
90 |
Retention window for SQLite traces; 0 disables purge |
TRACE_PURGE_INTERVAL_SEC |
86400 |
Background trace-purge interval |
AUDIT_RETENTION_DAYS |
180 |
Retention window for audit_log |
AUDIT_PURGE_INTERVAL_SEC |
86400 |
Background audit purge interval |
SHUTDOWN_READY_DELAY_SEC |
5 |
Drain delay between readiness flip and shutdown |
| Variable | Default | Description |
|---|---|---|
SUPPORT_SINK_BACKEND |
local |
Escalation backend: local or bitrix |
BITRIX_WEBHOOK_URL |
- |
Bitrix24 webhook URL |
TELEGRAM_BOT_TOKEN |
- |
Optional Telegram bot token |
ALERT_WEBHOOK_URL |
- |
Webhook used by scripts/check_alerts.py |
ALERT_ESCALATION_PCT |
35 |
Escalation-rate alert threshold over 24h |
ALERT_QUALITY_MIN |
65 |
Minimum 7-day average quality |
ALERT_LOW_QUALITY_PCT |
30 |
Threshold for low-quality answer share |
ALERT_P95_LATENCY_SEC |
12 |
24h p95 latency alert threshold |
ALERT_THUMBS_DOWN_PCT |
20 |
7-day thumbs-down threshold |
ALERT_THUMBS_DOWN_MIN_N |
50 |
Minimum feedback volume before thumbs-down alerts trigger |
REPORT_SLACK_WEBHOOK |
- |
Slack webhook for weekly reports |
REPORT_EMAIL_RECIPIENTS |
"" |
Comma-separated email list for weekly reports |
REPORT_SMTP_HOST |
SMTP_HOST fallback |
SMTP host override for weekly reports |
REPORT_SMTP_PORT |
SMTP_PORT fallback or 587 |
SMTP port override for weekly reports |
REPORT_SMTP_USER |
SMTP_USER fallback |
SMTP user override for weekly reports |
REPORT_SMTP_PASS |
SMTP_PASS fallback |
SMTP password override for weekly reports |
BACKLOG_WEIGHT_REVIEW_BAD |
3.0 |
Impact weight for confirmed-bad review backlog items |
BACKLOG_WEIGHT_THUMBS_DOWN |
2.0 |
Impact weight for thumbs-down backlog items |
BACKLOG_WEIGHT_SLOW |
1.5 |
Impact weight for slow-endpoint backlog items |
BACKLOG_WEIGHT_FRESHNESS |
1.0 |
Impact weight for stale-document backlog items |
BACKLOG_WEIGHT_EVALUATOR_DRIFT |
2.5 |
Impact weight for evaluator drift backlog items |
BACKLOG_MAX_ITEMS |
30 |
Maximum number of improvement backlog items kept after ranking |
BACKLOG_FRESHNESS_MAX_DAYS |
90 |
Freshness cutoff for stale-doc backlog items |
BACKLOG_EMAIL_ENABLED |
false |
Email the generated backlog to TENANT_ADMIN_EMAIL after each run |
TENANT_ADMIN_EMAIL |
"" |
Optional recipient for backlog email delivery |
EMAIL_CHANNEL_MODE |
disabled |
Email channel mode: disabled, imap, or webhook |
IMAP_HOST |
"" |
IMAP server hostname |
IMAP_PORT |
993 |
IMAP server port |
IMAP_USER |
"" |
IMAP username |
IMAP_PASS |
- |
IMAP password (IMAP_PASSWORD is also accepted) |
IMAP_FOLDER |
INBOX |
IMAP folder polled by scripts/email_poller.py |
IMAP_POLL_INTERVAL_SEC |
60 |
Delay between IMAP polling cycles |
SMTP_HOST |
"" |
SMTP hostname for email replies |
SMTP_PORT |
587 |
SMTP port for email replies |
SMTP_USER |
"" |
SMTP username |
SMTP_PASS |
- |
SMTP password (SMTP_PASSWORD is also accepted) |
SMTP_FROM_ADDRESS |
support@example.com |
Default sender address for outbound replies |
EMAIL_WEBHOOK_SIGNING_SECRET |
- |
Shared secret used to verify inbound email webhooks (EMAIL_WEBHOOK_SECRET remains a legacy fallback) |
- IMAP mode runs through
scripts/email_poller.pyand pollsIMAP_FOLDEReveryIMAP_POLL_INTERVAL_SECseconds. python scripts/email_poller.py --onceis the easiest dev-mode smoke check for one poll cycle.- Webhook mode supports SendGrid Inbound Parse style payloads with
from,to,subject,text, optionalhtml, and optional rawheaders. - The webhook accepts
POST /webhook/email;/api/channels/email/inboundremains as a compatibility alias. - Signatures use
HMAC-SHA256(body, EMAIL_WEBHOOK_SIGNING_SECRET)in theX-Signatureheader. - Tenant routing uses the sender email domain from
TENANT_EMAIL_DOMAINS, for exampleTENANT_EMAIL_DOMAINS=acme.com:acme,*:default. - Low-quality email answers are persisted into
escalated_ticketswithstatus="pending_response"for the existing operator flow.
- The final
/api/askresponse is cached for(tenant, normalized_question), where normalization is.strip().lower(). - Keys look like
llm_resp:{tenant}:{sha256(question)[:16]}, so the raw question is not stored in Redis. - Uploads invalidate the tenant namespace
llm_resp:{tenant}:*.
Provider routing is configured through config/providers.yml, which defines:
- enabled providers (
ollama,gracekelly,mistral) - model aliases such as
ollama-small,gk-fast, andmistral-small-latest - per-model input/output pricing, rate limits, and capability flags
- routing profiles
local-first,gracekelly-primary,gracekelly-mixed, andexternal-mistral
Runtime behavior:
gracekelly-primaryis the default profile and routes both tiers through the local GraceKelly orchestrator.local-firstis the explicit Ollama-only profile and keeps both fast/strong lanes on Ollama.gracekelly-primaryfalls back only to the declared Ollama fallback when GraceKelly is unavailable and failover is enabled.gracekelly-mixedkeeps browser-backed strong answer generation on GraceKelly while routing fast helper/evaluator calls through direct Mistral; use it only for explicit live benchmark runs.external-mistraluses the direct Mistral API and is the intended non-local deployment option when GraceKelly is not present.- Startup validation loads the registry, verifies
LLM_PROVIDER_PROFILE, and treats placeholder credentials such aschangemeas missing. - Each traced LLM step now records
provider_name,model_name, token usage, and cost; Prometheus exportsllm_cost_usd_total{provider,model,tenant}. - Automatic failover events are exported as
llm_provider_fallback_total{from_provider,to_provider,reason}. mistral-smallis GraceKelly's local fast-lane model name; usemistral-small-latestwhen you want the direct Mistral alias.- The admin UI exposes a Providers tab backed by
GET /api/admin/providers, including active profile, configured providers, 1-minute usage, 24-hour cost, and the last successful call timestamp.
gracekelly-primaryis intended for local setups whereD:\GraceKelly\runs onhttp://127.0.0.1:8011.- The provider uses
GET /healthz/readybefore the first request and callsPOST /api/v1/smartwithreliability_level=quick. - If GraceKelly is down or times out, the runtime switches only to the declared local fallback (
ollama) and caches that decision forFAILOVER_FALLBACK_CACHE_SECONDS. Ollama is not otherwise required by the default health path. - GraceKelly calls are treated as proxy/orchestrator traffic, so
cost_usdremains0.0in local traces.
external-mistralis the direct Mistral fallback for deployments where GraceKelly is unavailable.- The provider uses
POST https://api.mistral.ai/v1/chat/completionswith OpenAI-compatible chat payloads and reads token usage fromusage.prompt_tokens/usage.completion_tokens. - Placeholder
MISTRAL_API_KEY=changemeis treated as missing both in startup validation and in the provider constructor. DAILY_COST_LIMIT_USDapplies to the direct Mistral profile and blocks new runtime creation after the current UTC-day spend is exhausted.