Full technical reference: problem statement, solution design, architecture, data models, and setup requirements.
Modern customer support systems face three compounding challenges:
-
Static routing logic — Hard-coded
if/elsetrees break as user language evolves. They cannot handle paraphrasing, mixed languages, or new intent types without redeployment. -
Uncontrolled LLM outputs — When language models are used to generate responses, they routinely fabricate facts ("hallucinate") — inventing order numbers, policies, or prices that do not exist in the system. This destroys user trust.
-
No observability — Existing tools provide no structured way to audit what the agent decided, why it decided it, which tool it called, and how long each step took. Debugging or improving the system is difficult.
The result: Support teams manually handle queries that should be automated, users receive incorrect information, and improving the system requires guesswork.
AgentOS is a self-contained orchestration layer that sits between a user's message and backend tools. It solves all three problems:
| Problem | Solution |
|---|---|
| Static routing | ML-style intent classification with keyword + regex patterns across 10 intent types and 2 languages |
| Hallucination | Tiered hallucination guard — in balanced mode allows template responses; in strict mode only FAQ/tool-backed responses pass |
| No observability | Every request logs intent, action, tool call, latency breakdown, and safety flags to a persistent SQLite database |
The system is fully database-driven — agents, FAQs, tool configs, and safety settings are stored in the DB and can be updated through the dashboard UI without any code changes.
- 10 intent types: greeting, order_status, refund_request, complaint, ticket_creation, faq_query, shipping_query, product_query, payment_query, abusive_language
- Bilingual: English and Hinglish (Hindi written in Latin script)
- Mechanism: Weighted keyword matching + regex patterns + language detection, normalized confidence score 0–1
- Evaluated accuracy: 84.4% on 32-query bilingual benchmark
- 3 modes: strict, balanced, permissive
- Checks: PII (SSN, credit card, email), SQL injection, command injection, template injection, spam, abusive language (English + Hinglish)
- Severity scoring: Each flag has a weight; total risk score > threshold → block
- Strict mode extra: Any response without an FAQ or tool source is flagged as hallucination
order_lookup: Looks up order status, product, and estimated delivery by order ID. Mock data is deterministic per order ID (same input → same output every time)create_ticket: Creates a support ticket with category and description. Returns a ticket ID- Retry logic: Up to 3 retries with exponential backoff on failure
- DB-gated: Each tool can be enabled/disabled via
ToolConfigin the database without code changes
- Runs on every final response before delivery
- Cross-references response against the loaded FAQ database using string similarity (Levenshtein distance)
- In strict mode: any non-FAQ, non-tool response is flagged with
isHallucination: trueandconfidence: 0.9 - Result included in every API response as
hallucinationCheck
- Every request is logged to
ExecutionLogin SQLite via a queue-based async logger (MAX_QUEUE_SIZE, exponential backoff retries, graceful flush on shutdown) - Logs: intent, confidence, action, tool used, tool success, safety flags, latency breakdown (per stage), session ID, agent ID
- 32-query bilingual dataset (16 English + 16 Hinglish)
- 3-threshold ablation study (0.6 / 0.7 / 0.8)
- Metrics: intent accuracy, escalation rate, tool success rate, hallucination block count, avg latency
- Strict-mode hallucination probe to confirm guard is active
- Exports:
results.json(full structured data) +results.csv
graph TB
subgraph Client
UI[Next.js Dashboard\nTest Console · Agent Builder\nLogs · Tools + Knowledge]
end
subgraph API["API Layer (Next.js App Router)"]
RUN[POST /api/run\nMain Orchestrator]
AGENTS[/api/agents\nCRUD]
CONFIG[/api/agent-config\nUpsert]
TOOLS[/api/tools\nConfig]
FAQ[/api/knowledge\nFAQ CRUD]
LOGS[/api/logs\nLog Viewer]
end
subgraph Core["Core Libraries"]
IR[intent-router.ts\nClassifyIntent · DetermineAction]
SG[safety-guard.ts\nCheckSafety · HallucinationGuard]
TL[tools.ts\nOrderLookup · CreateTicket]
DL[db-logger.ts\nQueuedLogger · Stats]
end
subgraph Data["Data Layer (Prisma + SQLite)"]
AG[(Agent)]
TC[(ToolConfig)]
FK[(FAQItem)]
EL[(ExecutionLog)]
SS[(Session)]
end
UI --> API
RUN --> IR
RUN --> SG
RUN --> TL
RUN --> DL
DL --> EL
RUN --> AG
RUN --> TC
RUN --> FK
erDiagram
Agent {
string id PK
string name
string description
string systemPrompt
string languageMode
string safetyMode
float confidenceThreshold
string status
datetime createdAt
datetime updatedAt
}
AgentTool {
string id PK
string agentId FK
string toolId
boolean enabled
}
ToolConfig {
string id PK
boolean orderLookupEnabled
boolean createTicketEnabled
datetime updatedAt
}
FAQItem {
string id PK
string question
string answer
string category
boolean active
datetime createdAt
}
ExecutionLog {
string id PK
string agentId
string sessionId
string message
string intent
float confidence
string action
string toolUsed
boolean toolSuccess
string finalResponse
boolean isSafe
string safetyFlags
int latencyMs
datetime createdAt
}
Session {
string id PK
string agentId
int messageCount
datetime lastActivity
datetime expiresAt
}
Agent ||--o{ AgentTool : has
Agent ||--o{ ExecutionLog : generates
Agent ||--o{ Session : tracks
1. POST /api/run
├── Zod validation (message, agentId, sessionId, confidenceThreshold)
├── validateInputLength (1–5000 chars)
├── sanitizeMessage (strip HTML, trim)
│
2. Agent config loading
├── Load agent by agentId from DB (include tools)
├── Fallback: first active agent in DB
├── Fallback: hardcoded defaults (balanced, 0.7 threshold)
├── X-Safety-Mode header can override safetyMode
│
3. DB parallel load
├── FAQItem.findMany() → faqDatabase array
└── ToolConfig.findFirst() → enabled tool set
│
4. Intent classification (lib/intent-router.ts)
├── detectLanguage() → english | hinglish
├── Score each of 10 intents by keyword matches + regex
├── Normalize to [0,1] confidence
└── Return: intent, confidence, keywords, category, language
│
5. Safety check (lib/safety-guard.ts)
├── checkMessageSafety(message, safetyMode)
├── Flag checks: harmful, pii, injection, spam, abusive
├── Risk score = sum(flag weights)
└── Return: isSafe, riskScore, flags
│
6. Action decision
├── If not safe → escalate
├── If confidence < threshold → escalate
└── Else → intent.actionType (answer | tool_call | escalate)
│
7. Execution branch
├── answer → generateAnswerResponse() (FAQ match → template fallback)
├── tool_call
│ ├── isToolEnabled(toolId) check (ToolConfig gate)
│ ├── extractToolParameters(message, tool)
│ └── executeToolWithRetry(tool, params, maxRetries=3)
└── escalate → escalation template
│
8. Hallucination guard
└── applySafetyGuard(response, context) → hallucinationCheck
│
9. DB logging (async queue)
└── logExecutionToDatabase({ agentId, intent, action, latency, ... })
│
10. Return RunResponse JSON
| ID | Requirement | Status |
|---|---|---|
| F1 | Classify user message into one of 10 intents | ✅ |
| F2 | Support English and Hinglish input | ✅ |
| F3 | Apply 3-tier safety filtering before processing | ✅ |
| F4 | Execute order_lookup tool with order ID extraction |
✅ |
| F5 | Execute create_ticket tool with category + description |
✅ |
| F6 | Retry failed tool calls up to 3 times with backoff | ✅ |
| F7 | Gate tool availability via database ToolConfig | ✅ |
| F8 | Validate responses against FAQ database (hallucination guard) | ✅ |
| F9 | Persist all executions to database with full metadata | ✅ |
| F10 | Support multiple agent configurations via agentId routing | ✅ |
| F11 | Provide a web dashboard for managing agents, FAQs, and tools | ✅ |
| F12 | Run ablation evaluation across 3 confidence thresholds | ✅ |
| ID | Requirement | Target | Actual |
|---|---|---|---|
| NF1 | Average response latency | < 200ms | ~25ms |
| NF2 | Intent classification accuracy (English) | > 80% | 84.4% |
| NF3 | Input validation on all API endpoints | 100% | 100% (Zod) |
| NF4 | Safe field filtering on PUT (no id/createdAt overwrite) | Required | ✅ |
| NF5 | Deterministic tool mock data | Required | ✅ (hashed) |
| NF6 | Database logging reliability | No log loss | Queue + retry |
| Dependency | Minimum Version |
|---|---|
| Node.js | 18.x |
| pnpm (or npm) | any |
| SQLite | bundled via Prisma |
# Clone
git clone <repo-url>
cd AgentOS
# Install dependencies
pnpm install
# Generate Prisma client
npx prisma generate
# Create database + apply schema
npx prisma db push
# Seed: default agent + FAQs + tool config
node prisma/seed.js
# Start development server
pnpm run dev
# → http://localhost:3000# .env.local (required)
DATABASE_URL="file:./dev.db"| Script | Command | Description |
|---|---|---|
| Dev server | pnpm run dev |
Start Next.js with hot reload |
| Evaluation | pnpm run eval |
Run full ablation eval |
| Seed DB | node prisma/seed.js |
Seed default data |
| DB Studio | npx prisma studio |
Visual DB browser |
The classifier uses weighted keyword + regex matching rather than an LLM call. This gives sub-millisecond classification, fully deterministic and auditable behavior, zero token cost at runtime, and easy extensibility via the INTENT_PATTERNS object.
Writes never block the API response. Each log is pushed to an in-memory queue and processed asynchronously with retries. On server shutdown, flushLogs() drains the queue. This prevents log loss under load without adding response latency.
order_lookup uses a simple modular hash on the order ID character codes to derive product name, status, and price. The same order ID always returns the same data, making eval runs reproducible and tool tests consistent.
POST /api/agent-config checks for an existing agentId in the request body. If found, it updates; if not, it creates. This prevents duplicate agents when the config form is submitted multiple times.
PUT /api/agents/[id] only allows updating: name, description, systemPrompt, languageMode, safetyMode, confidenceThreshold, status. Fields like id, createdAt, updatedAt are silently ignored to prevent accidents.
AgentOS/
├── app/
│ ├── api/ # All API route handlers
│ │ ├── run/route.ts # Main 9-stage orchestrator
│ │ ├── agents/ # Agent list + create
│ │ ├── agents/[id]/ # Agent detail + update + delete
│ │ ├── agent-config/ # Agent settings upsert + fetch
│ │ ├── tools/ # ToolConfig read + update
│ │ ├── knowledge/ # FAQ CRUD
│ │ ├── knowledge/[id]/ # FAQ delete + update
│ │ ├── logs/ # ExecutionLog read + stats
│ │ └── test-logs/ # Health check endpoint
│ └── (dashboard)/
│ ├── page.tsx # Dashboard home
│ ├── test-console/ # Live agent chat
│ ├── agent-builder/ # Agent config form
│ ├── tools-knowledge/ # FAQ + tool management
│ └── logs/ # Log viewer
├── components/ # Shared UI components
├── lib/
│ ├── intent-router.ts # Intent classification
│ ├── safety-guard.ts # Safety + hallucination
│ ├── tools.ts # Tool implementations
│ ├── db-logger.ts # Async queue logger
│ ├── prisma.ts # Prisma client singleton
│ └── logger.ts # Console + perf timer
├── prisma/
│ ├── schema.prisma # All data models
│ └── seed.js # Default data seeder
├── scripts/
│ └── eval.js # Evaluation pipeline
└── eval/
├── dataset.json # 32-query test dataset
├── results.json # Latest eval results
└── results.csv # CSV export
AgentOS — Built with Next.js, Prisma, TypeScript