Skip to content

Local Model Integration

Nick edited this page Mar 10, 2026 · 2 revisions

Local Model Integration

PATAS now supports local HTTP-based engines for both embeddings and LLM, enabling on-premise deployments without requiring heavy ML frameworks in PATAS core.

Architecture

PATAS uses two separate engines that can be configured independently:

  1. Embedding Engine – generates semantic embeddings for message similarity analysis, clustering, and pattern discovery
  2. LLM Engine – performs pattern explanation, rule generation, and LLM-based validation

Each engine supports two provider modes:

  • openai (default) – uses OpenAI's managed API services
  • local – uses on-premise models via HTTP endpoints

Local HTTP Embedding Engine

The LocalHttpEmbeddingEngine calls a local/self-hosted HTTP endpoint for embedding generation.

Endpoint Contract

  • URL: {base_url}/embeddings
  • Method: POST
  • Request:
    {
      "model": "<model_identifier>",
      "inputs": ["text1", "text2", ...]
    }
  • Response:
    {
      "embeddings": [
        [0.1, 0.2, ...],
        [0.3, 0.4, ...]
      ]
    }

Configuration

embedding_provider = "local"
embedding_model = "BAAI/bge-m3"  # or your model identifier
embedding_base_url = "http://localhost:8080/v1"
embedding_api_key = ""  # optional
embedding_timeout_seconds = 30.0

Features

  • Automatic batching (default: 512 texts per batch)
  • Embedding cache support (same as OpenAI engine)
  • Error handling with logging
  • Optional API key authentication

Local HTTP LLM Engine

The LocalHttpPatternMiningEngine calls a local/self-hosted HTTP endpoint for LLM inference.

Endpoint Contract

  • URL: {base_url}/chat/completions
  • Method: POST
  • Request (OpenAI-compatible):
    {
      "model": "<model_identifier>",
      "messages": [
        {"role": "system", "content": "..."},
        {"role": "user", "content": "..."}
      ],
      "max_tokens": 1500,
      "temperature": 0.0
    }
  • Response (OpenAI-compatible):
    {
      "choices": [
        {
          "message": {
            "role": "assistant",
            "content": "{...}"
          }
        }
      ]
    }

Configuration

llm_provider = "local"
llm_model = "mistralai/Mistral-7B-Instruct-v0.2"  # or your model identifier
llm_base_url = "http://localhost:8000/v1"
llm_api_key = ""  # optional
llm_timeout_seconds = 30.0

Features

  • OpenAI-compatible API format
  • Low temperature (0.0) for deterministic outputs
  • Same prompt builder as OpenAI engine
  • Error handling with logging
  • Optional API key authentication

Recommended Local Models

For on-premise deployments, the following models are recommended as good defaults:

  • Embeddings: BAAI/bge-m3 – multilingual embedding model, strong performance for RU/UK/EN spam and abuse logs, well-suited for semantic clustering tasks.

  • LLM: mistralai/Mistral-7B-Instruct-v0.2 – compact 7B parameter model, Apache 2.0 license, performs well at structured JSON generation and SQL-style filter generation at low temperature settings.

These models are not required by PATAS, but they provide a solid starting point for on-premise deployments.

Integration with Inference Stacks

The local HTTP engines are designed to work with common inference stacks:

  • vLLM: OpenAI-compatible API server
  • TGI (Text Generation Inference): HuggingFace's inference server
  • Ollama: Local model serving
  • Custom endpoints: Any OpenAI-compatible HTTP endpoint

The exact wiring to your inference stack is intentionally left to the integrator. PATAS only requires that the endpoints follow the contract described above.

Example Configuration

Environment Variables (.env)

# Embedding Engine (Local)
EMBEDDING_PROVIDER=local
EMBEDDING_MODEL=BAAI/bge-m3
EMBEDDING_BASE_URL=http://localhost:8080/v1
EMBEDDING_TIMEOUT_SECONDS=30.0

# LLM Engine (Local)
LLM_PROVIDER=local
LLM_MODEL=mistralai/Mistral-7B-Instruct-v0.2
LLM_BASE_URL=http://localhost:8000/v1
LLM_TIMEOUT_SECONDS=30.0

Python Config

from app.config import settings

settings.embedding_provider = "local"
settings.embedding_model = "BAAI/bge-m3"
settings.embedding_base_url = "http://localhost:8080/v1"

settings.llm_provider = "local"
settings.llm_model = "mistralai/Mistral-7B-Instruct-v0.2"
settings.llm_base_url = "http://localhost:8000/v1"

Fallback Behavior

  • If provider="local" and base_url is provided → uses HTTP-based local engine
  • If provider="local" and base_url is not provided:
    • Embedding engine falls back to LocalEmbeddingEngine (sentence-transformers)
    • LLM engine returns None with a warning

Security Notes

  • Never commit API keys or secrets to version control
  • Use environment variables or secret management systems
  • For on-premise deployments, ensure inference endpoints are properly secured and authenticated
  • In PRIVACY_MODE=STRICT, external LLM providers are disabled by default unless explicitly configured to internal endpoints

Testing

Comprehensive tests are available:

  • tests/test_v2_embedding_engine_local_http.py – Local HTTP embedding engine tests
  • tests/test_v2_llm_engine_local_http.py – Local HTTP LLM engine tests

Tests cover:

  • Successful embedding/pattern generation
  • Batching for large inputs
  • HTTP error handling
  • Malformed response handling
  • Authentication with API keys

See Also

Clone this wiki locally