-
Notifications
You must be signed in to change notification settings - Fork 0
On Premise Deployment
Nick edited this page Nov 21, 2025
·
1 revision
PATAS supports fully on-premise deployment with local LLM and embedding models.
PATAS uses two separate engines:
- Embedding Engine - for semantic similarity and clustering
- LLM Engine - for pattern explanation and rule generation
Both engines support openai (cloud) and local (on-premise) providers.
- BAAI/bge-m3 - Multilingual, 568M parameters, good for mixed languages
- intfloat/e5-large-v2 - English-focused, 335M parameters
- BAAI/bge-large-en-v1.5 - English, 335M parameters
Requirements:
- 2-4 GB GPU memory
- HTTP endpoint (vLLM, TGI, or custom server)
- mistralai/Mistral-7B-Instruct-v0.2 - 7B parameters, good balance
- meta-llama/Llama-3.1-8B-Instruct - 8B parameters, strong performance
- mistralai/Mistral-7B-Instruct-v0.1 - Alternative Mistral version
Requirements:
- 8-16 GB GPU memory (quantized: 4-8 GB)
- HTTP endpoint (vLLM, TGI, Ollama)
# Embedding Engine
embedding_provider: local
embedding_model: "BAAI/bge-m3"
embedding_base_url: "http://localhost:8000/v1" # Your local embedding service
embedding_api_key: "" # Optional, not required for local
# LLM Engine
llm_provider: local
llm_model: "mistralai/Mistral-7B-Instruct-v0.2"
llm_base_url: "http://localhost:8000/v1" # Your local LLM service
llm_api_key: "" # Optional, not required for localLocal models must expose OpenAI-compatible HTTP API:
Embeddings:
POST /v1/embeddings
{
"model": "BAAI/bge-m3",
"input": ["text1", "text2", ...]
}
LLM:
POST /v1/chat/completions
{
"model": "mistralai/Mistral-7B-Instruct-v0.2",
"messages": [...],
"response_format": {"type": "json_object"}
}
# Start embedding server
python -m vllm.entrypoints.openai.api_server \
--model BAAI/bge-m3 \
--port 8000
# Start LLM server
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.2 \
--port 8001Config:
embedding_base_url: "http://localhost:8000/v1"
llm_base_url: "http://localhost:8001/v1"# Start TGI server
docker run -p 8000:80 \
-v /path/to/models:/models \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id mistralai/Mistral-7B-Instruct-v0.2# Install models
ollama pull mistral:7b
ollama pull bge-m3
# Ollama exposes OpenAI-compatible API on port 11434Config:
embedding_base_url: "http://localhost:11434/v1"
llm_base_url: "http://localhost:11434/v1"For completely isolated environments:
- Download models to local storage
- Deploy model servers within air-gapped network
- Configure PATAS with internal endpoints
-
Set
privacy_mode: STRICTto disable all external calls
- 500K messages/week: ~$364/month
- 1M messages/week: ~$728/month
- Hardware: 1x GPU (16GB) ~$500-1000/month (cloud) or one-time purchase
- Electricity: ~$50-100/month
- API costs: $0
Break-even: ~2-3 months for 500K messages/week, ~1-2 months for 1M messages/week
Local models are typically 2-5x slower than OpenAI API, but:
- No API rate limits
- No data leaves your network
- Predictable costs
- Full control over models and versions
With local provider:
- No external API calls - all processing happens on your infrastructure
- No data transmission - messages never leave your network
- Full control - you own the models and data
Set privacy_mode: STRICT for additional safeguards.