Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/llm-gateway/_category_.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,6 @@
"position": 3,
"link": {
"type": "generated-index",
"description": "Unified API gateway for LLM providers route, secure, monitor, and optimize all your LLM traffic through a single endpoint."
"description": "Unified API gateway for LLM providers - route, secure, monitor, and optimize all your LLM traffic through a single endpoint."
}
}
81 changes: 81 additions & 0 deletions docs/llm-gateway/architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
---
sidebar_position: 2
---

# Architecture

How the QuilrAI LLM Gateway processes every request - from your application to the LLM provider and back.

<ArchitectureDiagram
source={{
label: "Your Application",
code: `client = OpenAI(
base_url='https://guardrails.quilr.ai/openai_compatible/',
api_key='sk-quilr-xxx'
)
client.chat.completions.create(
model='gpt-4o',
messages=[{'role': 'user', 'content': 'Hello!'}]
)`,
}}
gateway={{
label: "QuilrAI LLM Gateway",
phases: [
{
label: "Validate",
stages: [
{ label: "Identity & Auth", items: ["JWT / header validation", "Domain allowlist", "Per-user tracking"] },
{ label: "Rate Limits", items: ["Req/min, hr, day limits", "Token budgets", "Key expiration"] },
],
},
{
label: "Scan",
stages: [
{ label: "PII / PHI / PCI", items: ["Contextual detection", "Exact data matching", "Block / redact / anonymize"] },
{ label: "Adversarial Detection", items: ["Prompt injection", "Jailbreak detection", "Social engineering"] },
{ label: "Custom Intents", items: ["User-defined categories", "Example-trained classifier"] },
],
},
{
label: "Transform",
stages: [
{ label: "Prompt Store", items: ["Centralized prompts", "Template variables", "Enforce prompt-only mode"] },
{ label: "Token Saving", items: ["JSON compression", "HTML/MD stripping", "Input-only, same accuracy"] },
],
},
{
label: "Route",
stages: [
{ label: "Request Routing", items: ["Weighted load balancing", "Automatic failover", "Multi-provider groups"] },
],
},
],
footer: "Logging · Cost Tracking · Analytics · Red Team Testing",
}}
destination={{
label: "LLM Providers",
items: ["OpenAI", "Anthropic", "Azure OpenAI", "AWS Bedrock", "Vertex AI", "Custom Endpoints"],
}}
/>

## Pipeline Stages

Every API request flows through these stages in order. Each stage is independently configurable per API key from the dashboard.

| Stage | Description | Details |
|-------|-------------|---------|
| **Identity & Auth** | Validates request identity via JWT, JWKS, or header. Enforces domain restrictions. | [Identity Aware →](./features/identity-aware) |
| **Rate Limits** | Enforces request rates, token budgets, and key expiration before reaching the provider. | [Rate Limits →](./features/rate-limits) |
| **Security Guardrails** | Detects PII, PHI, PCI, and financial data. Catches prompt injection, jailbreak, and social engineering. | [Security Guardrails →](./features/security-guardrails) |
| **Custom Intents** | User-defined detection categories trained with positive and negative examples. | [Custom Intents →](./features/custom-intents) |
| **Prompt Store** | Resolves centralized system prompts by ID with template variable substitution. | [Prompt Store →](./features/prompt-store) |
| **Token Saving** | Compresses input tokens - JSON to TOON, HTML/Markdown to plain text. Responses unchanged. | [Token Saving →](./features/token-saving) |
| **Request Routing** | Routes to the optimal provider using weighted load balancing with automatic failover. | [Request Routing →](./features/request-routing) |

## Response Path

Responses from the LLM provider pass back through the **security guardrails** for output scanning before being returned to your application. The same detection categories and configurable actions (block, redact, anonymize, monitor) apply to both requests and responses.

## Observability

Every request is logged with cost, latency, token counts, and guardrail actions. Use the **Logs** tab to review request history and the **Red Team Testing** tool to [validate your guardrail configuration](./features/red-team-testing) against adversarial prompts.
2 changes: 1 addition & 1 deletion docs/llm-gateway/features/_category_.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,6 @@
"position": 3,
"link": {
"type": "generated-index",
"description": "Detailed documentation for each LLM Gateway capability Routing, Token Saving, Guardrails, Prompt Store, Identity Aware, and more."
"description": "Detailed documentation for each LLM Gateway capability - Routing, Token Saving, Guardrails, Prompt Store, Identity Aware, and more."
}
}
6 changes: 3 additions & 3 deletions docs/llm-gateway/features/custom-intents.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@ Custom intents extend the guardrails system with your own detection logic. Provi

1. **Name** your intent (e.g., `competitor-mentions`)
2. **Describe** what the intent should detect
3. **Add positive examples** prompts that should trigger the intent
4. **Add negative examples** prompts that should not trigger the intent
5. **Assign an action** block, monitor, or redact
3. **Add positive examples** - prompts that should trigger the intent
4. **Add negative examples** - prompts that should not trigger the intent
5. **Assign an action** - block, monitor, or redact

The classifier learns from your examples and applies the configured action when a match is detected.
38 changes: 31 additions & 7 deletions docs/llm-gateway/features/identity-aware.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,21 +8,45 @@ Authenticate and track users behind each API key.

## How It Works

1. **Request Arrives** — App sends an API call with identity info
2. **Gateway Identifies User** — Extracts identity via header or JWT token
3. **Per-User Tracking** — Usage tracked per user with rate limits and analytics
<StepFlow steps={[
{
label: "Request Arrives",
items: [
"Authorization: Bearer sk-quilr-•••",
"X-User-Email: alice@acme.com",
],
},
{
label: "QuilrAI Identifies",
items: [
"User: alice@acme.com",
"Domain: acme.com ✓",
],
},
{
label: "Per-User Tracking",
items: [
"Requests today: 142",
"Rate limit: 80% used",
],
},
]} />

1. **Request Arrives** - App sends an API call with identity info
2. **Gateway Identifies User** - Extracts identity via header or JWT token
3. **Per-User Tracking** - Usage tracked per user with rate limits and analytics

## Authentication Modes

### Header Based Recommended for trusted clients
### Header Based - Recommended for trusted clients

Uses the `X-User-Email` header to identify users. If your app handles user login and makes LLM calls from your own backend, this is the easiest and recommended approach just pass the logged-in user's email as a header.
Uses the `X-User-Email` header to identify users. If your app handles user login and makes LLM calls from your own backend, this is the easiest and recommended approach - just pass the logged-in user's email as a header.

```
X-User-Email: user@company.com
```

### JWKS Endpoint For untrusted clients
### JWKS Endpoint - For untrusted clients

Validates JWT tokens using a JWKS URL for dynamic key rotation. Ideal for production OAuth/OIDC flows with providers like Auth0, Okta, or Google.

Expand Down Expand Up @@ -51,7 +75,7 @@ MIIBIjANBgkqh...

### Enforce Identity

When enabled, requests without valid identity (header or JWT) are **rejected at the gateway** bare API key access is blocked.
When enabled, requests without valid identity (header or JWT) are **rejected at the gateway** - bare API key access is blocked.

### Allowed User Domains

Expand Down
30 changes: 27 additions & 3 deletions docs/llm-gateway/features/prompt-store.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,33 @@ Manage and version system prompts centrally.

## How It Works

1. **Create** — Store a prompt with a unique ID (e.g., `code-reviewer`)
2. **Reference** — Use it as the system message content: `quilrai-prompt-store-code-reviewer`
3. **Gateway Resolves** — The gateway resolves the prompt and sends the full text to the LLM
<StepFlow steps={[
{
label: "Prompt Stored",
items: [
"ID: code-reviewer",
'"You are a {{tone}} reviewer"',
],
},
{
label: "API References It",
items: [
"system: quilrai-prompt-store-code-reviewer",
'vars: {tone: "formal"}',
],
},
{
label: "QuilrAI Resolves",
items: [
'"You are a formal reviewer"',
"Sent to LLM ✓",
],
},
]} />

1. **Create** - Store a prompt with a unique ID (e.g., `code-reviewer`)
2. **Reference** - Use it as the system message content: `quilrai-prompt-store-code-reviewer`
3. **Gateway Resolves** - The gateway resolves the prompt and sends the full text to the LLM

## Template Variables

Expand Down
8 changes: 4 additions & 4 deletions docs/llm-gateway/features/rate-limits.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,10 @@ Rate limits protect your LLM spend and availability. All limits are enforced at

## Key Features

- **Per-key rate limits** Requests per minute, hour, or day
- **Token limits** Input and output token budgets per request or over time
- **API key expiration** Configurable epoch time for automatic key expiry
- **Response timeout** Maximum wait time to prevent hung requests from consuming resources
- **Per-key rate limits** - Requests per minute, hour, or day
- **Token limits** - Input and output token budgets per request or over time
- **API key expiration** - Configurable epoch time for automatic key expiry
- **Response timeout** - Maximum wait time to prevent hung requests from consuming resources

## Configuration

Expand Down
33 changes: 29 additions & 4 deletions docs/llm-gateway/features/request-routing.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,34 @@ Multi-provider load balancing and failover behind a single API key.

## How It Works

1. **Create Group** — Define a named routing group (e.g., `Group1`)
2. **Add Models** — Add providers with traffic weights (e.g., `gpt-4o 60%`, `claude 40%`)
3. **Use as Model** — Pass the group name as the `model` parameter in your API call
<StepFlow steps={[
{
label: "API Request",
items: [
'model: "Group1"',
'content: "Hello!"',
],
},
{
label: "QuilrAI Routes",
items: [
"Group1 found ✓",
"gpt-4o → 60% weight",
"claude-sonnet → 40% weight",
],
},
{
label: "Provider Selected",
items: [
"→ gpt-4o (weighted)",
"Response returned ✓",
],
},
]} />

1. **Create Group** - Define a named routing group (e.g., `Group1`)
2. **Add Models** - Add providers with traffic weights (e.g., `gpt-4o 60%`, `claude 40%`)
3. **Use as Model** - Pass the group name as the `model` parameter in your API call

## Weight-Based Routing

Expand Down Expand Up @@ -64,7 +89,7 @@ Group names can match actual model names. Your application keeps sending request
|-----------|-----------|
| `gpt-4.1` | `gpt-4.1-nano` (70%), `gpt-4.1-mini` (30%) |

Your code still sends `model="gpt-4.1"` zero code changes, but requests get routed to cheaper or faster models behind the scenes.
Your code still sends `model="gpt-4.1"` - zero code changes, but requests get routed to cheaper or faster models behind the scenes.

## Code Examples

Expand Down
14 changes: 7 additions & 7 deletions docs/llm-gateway/features/security-guardrails.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,10 @@ Contextual detection identifies sensitive data categories and applies the config

### Supported Categories

- **PII** Personally Identifiable Information
- **PHI** Protected Health Information
- **PCI** Payment Card Industry data
- **Financial data** Financial records and account information
- **PII** - Personally Identifiable Information
- **PHI** - Protected Health Information
- **PCI** - Payment Card Industry data
- **Financial data** - Financial records and account information

### Exact Data Matching (EDM)

Expand All @@ -29,9 +29,9 @@ Pattern matching with custom EDM rules for specific data formats.

Catches adversarial attack patterns in requests:

- **Prompt injection** Attempts to override system instructions
- **Jailbreak** Attempts to bypass safety controls
- **Social engineering** Manipulation attempts targeting the AI model
- **Prompt injection** - Attempts to override system instructions
- **Jailbreak** - Attempts to bypass safety controls
- **Social engineering** - Manipulation attempts targeting the AI model

## Configurable Actions

Expand Down
18 changes: 0 additions & 18 deletions docs/llm-gateway/features/tags.md

This file was deleted.

38 changes: 31 additions & 7 deletions docs/llm-gateway/features/token-saving.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,23 +8,47 @@ Reduce token usage by compressing input content automatically.

## How It Works

1. **Request Arrives** — Your app sends a normal API call
2. **Gateway Compresses** — Content is transformed to use fewer tokens
3. **Forwarded to LLM** — Optimized content sent — same accuracy, lower cost
<StepFlow steps={[
{
label: "Request Arrives",
items: [
'{"name": "John", "age": 30}',
"14 input tokens",
],
},
{
label: "QuilrAI Compresses",
items: [
"name:John|age:30",
"8 input tokens",
],
},
{
label: "Sent to LLM",
items: [
"43% tokens saved",
"Same response quality ✓",
],
},
]} />

1. **Request Arrives** - Your app sends a normal API call
2. **Gateway Compresses** - Content is transformed to use fewer tokens
3. **Forwarded to LLM** - Optimized content sent - same accuracy, lower cost

## Compression Methods

### Smart JSON Compression Up to 20% savings
### Smart JSON Compression - Up to 20% savings

Converts JSON objects in LLM inputs to TOON format ideal for tool call responses and structured data.
Converts JSON objects in LLM inputs to TOON format - ideal for tool call responses and structured data.

| Before | After |
|--------|-------|
| `{"name": "John", "age": 30}` | `name:John\|age:30` |

### HTML to Text

Strips HTML tags and extracts clean text removes markup overhead from scraped pages or rich content.
Strips HTML tags and extracts clean text - removes markup overhead from scraped pages or rich content.

| Before | After |
|--------|-------|
Expand All @@ -40,4 +64,4 @@ Removes Markdown syntax characters that consume tokens without adding meaning fo

## Seamless and Input-Only

Compression is applied **only to input tokens** before they reach the LLM. Responses are returned untouched. Your application code stays exactly the same no SDK changes, no prompt rewrites, just lower costs.
Compression is applied **only to input tokens** before they reach the LLM. Responses are returned untouched. Your application code stays exactly the same - no SDK changes, no prompt rewrites, just lower costs.
Loading
Loading