Skip to content

On Premise Deployment

Nick edited this page Nov 21, 2025 · 1 revision

On-Premise Deployment

PATAS supports fully on-premise deployment with local LLM and embedding models.

Overview

PATAS uses two separate engines:

  1. Embedding Engine - for semantic similarity and clustering
  2. LLM Engine - for pattern explanation and rule generation

Both engines support openai (cloud) and local (on-premise) providers.

Local Models

Recommended Embedding Models

  • BAAI/bge-m3 - Multilingual, 568M parameters, good for mixed languages
  • intfloat/e5-large-v2 - English-focused, 335M parameters
  • BAAI/bge-large-en-v1.5 - English, 335M parameters

Requirements:

  • 2-4 GB GPU memory
  • HTTP endpoint (vLLM, TGI, or custom server)

Recommended LLM Models

  • mistralai/Mistral-7B-Instruct-v0.2 - 7B parameters, good balance
  • meta-llama/Llama-3.1-8B-Instruct - 8B parameters, strong performance
  • mistralai/Mistral-7B-Instruct-v0.1 - Alternative Mistral version

Requirements:

  • 8-16 GB GPU memory (quantized: 4-8 GB)
  • HTTP endpoint (vLLM, TGI, Ollama)

Configuration

Example: Local Models

# Embedding Engine
embedding_provider: local
embedding_model: "BAAI/bge-m3"
embedding_base_url: "http://localhost:8000/v1"  # Your local embedding service
embedding_api_key: ""  # Optional, not required for local

# LLM Engine
llm_provider: local
llm_model: "mistralai/Mistral-7B-Instruct-v0.2"
llm_base_url: "http://localhost:8000/v1"  # Your local LLM service
llm_api_key: ""  # Optional, not required for local

HTTP Endpoint Format

Local models must expose OpenAI-compatible HTTP API:

Embeddings:

POST /v1/embeddings
{
  "model": "BAAI/bge-m3",
  "input": ["text1", "text2", ...]
}

LLM:

POST /v1/chat/completions
{
  "model": "mistralai/Mistral-7B-Instruct-v0.2",
  "messages": [...],
  "response_format": {"type": "json_object"}
}

Deployment Options

Option 1: vLLM Server

# Start embedding server
python -m vllm.entrypoints.openai.api_server \
  --model BAAI/bge-m3 \
  --port 8000

# Start LLM server
python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-7B-Instruct-v0.2 \
  --port 8001

Config:

embedding_base_url: "http://localhost:8000/v1"
llm_base_url: "http://localhost:8001/v1"

Option 2: Text Generation Inference (TGI)

# Start TGI server
docker run -p 8000:80 \
  -v /path/to/models:/models \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id mistralai/Mistral-7B-Instruct-v0.2

Option 3: Ollama

# Install models
ollama pull mistral:7b
ollama pull bge-m3

# Ollama exposes OpenAI-compatible API on port 11434

Config:

embedding_base_url: "http://localhost:11434/v1"
llm_base_url: "http://localhost:11434/v1"

Air-Gapped Deployment

For completely isolated environments:

  1. Download models to local storage
  2. Deploy model servers within air-gapped network
  3. Configure PATAS with internal endpoints
  4. Set privacy_mode: STRICT to disable all external calls

Cost Comparison

OpenAI (Cloud)

  • 500K messages/week: ~$364/month
  • 1M messages/week: ~$728/month

Local Models (On-Premise)

  • Hardware: 1x GPU (16GB) ~$500-1000/month (cloud) or one-time purchase
  • Electricity: ~$50-100/month
  • API costs: $0

Break-even: ~2-3 months for 500K messages/week, ~1-2 months for 1M messages/week

Performance

Local models are typically 2-5x slower than OpenAI API, but:

  • No API rate limits
  • No data leaves your network
  • Predictable costs
  • Full control over models and versions

Privacy Guarantees

With local provider:

  • No external API calls - all processing happens on your infrastructure
  • No data transmission - messages never leave your network
  • Full control - you own the models and data

Set privacy_mode: STRICT for additional safeguards.

Clone this wiki locally