On Premise Deployment

On-Premise Deployment

PATAS supports fully on-premise deployment with local LLM and embedding models.

Overview

PATAS uses two separate engines:

Embedding Engine - for semantic similarity and clustering
LLM Engine - for pattern explanation and rule generation

Both engines support openai (cloud) and local (on-premise) providers.

Local Models

Recommended Embedding Models

BAAI/bge-m3 - Multilingual, 568M parameters, good for mixed languages
intfloat/e5-large-v2 - English-focused, 335M parameters
BAAI/bge-large-en-v1.5 - English, 335M parameters

Requirements:

2-4 GB GPU memory
HTTP endpoint (vLLM, TGI, or custom server)

Recommended LLM Models

mistralai/Mistral-7B-Instruct-v0.2 - 7B parameters, good balance
meta-llama/Llama-3.1-8B-Instruct - 8B parameters, strong performance
mistralai/Mistral-7B-Instruct-v0.1 - Alternative Mistral version

Requirements:

8-16 GB GPU memory (quantized: 4-8 GB)
HTTP endpoint (vLLM, TGI, Ollama)

Configuration

Example: Local Models

# Embedding Engine
embedding_provider: local
embedding_model: "BAAI/bge-m3"
embedding_base_url: "http://localhost:8000/v1"  # Your local embedding service
embedding_api_key: ""  # Optional, not required for local

# LLM Engine
llm_provider: local
llm_model: "mistralai/Mistral-7B-Instruct-v0.2"
llm_base_url: "http://localhost:8000/v1"  # Your local LLM service
llm_api_key: ""  # Optional, not required for local

HTTP Endpoint Format

Local models must expose OpenAI-compatible HTTP API:

Embeddings:

POST /v1/embeddings
{
  "model": "BAAI/bge-m3",
  "input": ["text1", "text2", ...]
}

LLM:

POST /v1/chat/completions
{
  "model": "mistralai/Mistral-7B-Instruct-v0.2",
  "messages": [...],
  "response_format": {"type": "json_object"}
}

Deployment Options

Option 1: vLLM Server

# Start embedding server
python -m vllm.entrypoints.openai.api_server \
  --model BAAI/bge-m3 \
  --port 8000

# Start LLM server
python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-7B-Instruct-v0.2 \
  --port 8001

Config:

embedding_base_url: "http://localhost:8000/v1"
llm_base_url: "http://localhost:8001/v1"

Option 2: Text Generation Inference (TGI)

# Start TGI server
docker run -p 8000:80 \
  -v /path/to/models:/models \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id mistralai/Mistral-7B-Instruct-v0.2

Option 3: Ollama

# Install models
ollama pull mistral:7b
ollama pull bge-m3

# Ollama exposes OpenAI-compatible API on port 11434

Config:

embedding_base_url: "http://localhost:11434/v1"
llm_base_url: "http://localhost:11434/v1"

Air-Gapped Deployment

For completely isolated environments:

Download models to local storage
Deploy model servers within air-gapped network
Configure PATAS with internal endpoints
Set privacy_mode: STRICT to disable all external calls

Cost Comparison

OpenAI (Cloud)

500K messages/week: ~$364/month
1M messages/week: ~$728/month

Local Models (On-Premise)

Hardware: 1x GPU (16GB) ~$500-1000/month (cloud) or one-time purchase
Electricity: ~$50-100/month
API costs: $0

Break-even: ~2-3 months for 500K messages/week, ~1-2 months for 1M messages/week

Performance

Local models are typically 2-5x slower than OpenAI API, but:

No API rate limits
No data leaves your network
Predictable costs
Full control over models and versions

Privacy Guarantees

With local provider:

No external API calls - all processing happens on your infrastructure
No data transmission - messages never leave your network
Full control - you own the models and data

Set privacy_mode: STRICT for additional safeguards.

On Premise Deployment

On-Premise Deployment

Overview

Local Models

Recommended Embedding Models

Recommended LLM Models

Configuration

Example: Local Models

HTTP Endpoint Format

Deployment Options

Option 1: vLLM Server

Option 2: Text Generation Inference (TGI)

Option 3: Ollama

Air-Gapped Deployment

Cost Comparison

OpenAI (Cloud)

Local Models (On-Premise)

Performance

Privacy Guarantees

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally