Technology Decisions & Extension Options

Discussed architectural decisions and their rationale.

T-1: Vision Language Models (VLM) alongside Ollama

Question

Can we connect a local VLM (Vision Language Model) in addition to Ollama for multimodal content (images, diagrams, PDFs with graphics)?

Answer: Yes — Ollama itself supports VLMs

Ollama can natively host vision models. No separate service needed:

Model	Strength	RAM requirement
`llava:7b`	General purpose, good balance	~6 GB
`llama3.2-vision`	Text recognition in images	~6 GB
`moondream2`	Very fast, compact	~2 GB
`llava:34b`	High quality, slow	~25 GB

Conceptual distinction (important)

Image/Diagram
    ↓
VLM (llava)          → generates text description: "Architecture diagram with 3 services..."
    ↓
Embedding (nomic-embed-text)  → Vector [0.12, -0.34, ...]
    ↓
Qdrant

VLM ≠ multimodal embedding. The VLM describes the image in text, which is then embedded normally. You search semantically over the description, not directly over pixels.

Alternative: CLIP-based image embeddings (image directly → vector). Advantage: faster. Disadvantage: separate embedding service, different vector dimension, separate Qdrant collection. Not recommended for this project.

Implementation

Ingestion pipeline (new VLM module):

async def describe_image(image_bytes: bytes, mime_type: str) -> str:
    """Sends image to Ollama VLM, receives text description in return."""
    resp = await http.post(f"{OLLAMA_URL}/api/generate", json={
        "model": VLM_MODEL,  # env: VLM_MODEL=llava:7b
        "prompt": "Describe this image precisely in English...",
        "images": [base64.b64encode(image_bytes).decode()],
        "stream": False,
    })
    return resp.json()["response"]

Integrated in the adapter layer (→ T-4) when processing:

PDF pages with graphics (via pymupdf)
Standalone image files
DOCX/PPTX with embedded diagrams

Configuration (.env):

VLM_ENABLED=false          # default off, opt-in
VLM_MODEL=llava:7b
VLM_MAX_IMAGE_SIZE_MB=10

Ollama container adjustment

No additional container. Model needs to be pulled once:

docker exec pb-ollama ollama pull llava:7b

T-2: Git Server Support (Forgejo as an example)

Question

Is Forgejo a fixed requirement, or do all common Git-based servers work?

Answer: Forgejo is an example — all common Git servers work

The project has two integration points with Git servers:

Integration point 1: OPA bundle polling

OPA polls policies as an HTTP bundle. This is purely URL-based — any server that serves a .tar.gz over HTTP works:

# OPA config — server-agnostic:
services:
  git-server:
    url: ${GIT_SERVER_URL}
    credentials:
      bearer:
        token: ${GIT_TOKEN}
bundles:
  pb:
    service: git-server
    resource: /api/v1/repos/org/pb-policies/raw/bundle.tar.gz # adjust path
    polling:
      min_delay_seconds: 10

Server	Bundle URL schema
Forgejo/Gitea	`/api/v1/repos/{org}/{repo}/raw/{file}`
GitHub	`/raw/{org}/{repo}/{branch}/{file}` (via API or raw.githubusercontent.com)
GitLab	`/api/v4/projects/{id}/repository/files/{file}/raw`
Bitbucket Cloud	`/2.0/repositories/{ws}/{repo}/src/{branch}/{file}`

Integration point 2: Code ingestion (reading repo contents)

Currently uses Forgejo API paths. With the adapter layer (→ T-4) this is abstracted.

Recommended configuration

# Instead of FORGEJO_URL/FORGEJO_TOKEN:
GIT_SERVER_TYPE=forgejo   # forgejo | github | gitlab | bitbucket
GIT_SERVER_URL=https://git.intern.example.com
GIT_TOKEN=...
GIT_ORG=pb-org

The Git adapter (→ T-4) translates to the respective API dialect.

T-3: Monitoring — External OTel Collector, optional Grafana stack

Question

Can monitoring be offered via an external OpenTelemetry Collector, making the local Grafana stack optional?

Answer: Yes — this is actually the recommended architecture

The OpenTelemetry Collector is the standardized routing layer between services and observability backends:

Services (MCP, Reranker, Ingestion)
    │ OTLP (gRPC :4317)
    ▼
┌─────────────────────────────┐
│  OTel Collector             │
│  ├─ Receiver: OTLP          │
│  ├─ Processor: batch, attrs │
│  └─ Exporters:              │
│     ├─ Tempo (local)        │  ← optional
│     ├─ Grafana Cloud        │  ← or external
│     ├─ Datadog              │  ← or external
│     └─ Jaeger               │  ← or external
└─────────────────────────────┘

Services only know OTLP_ENDPOINT — they are backend-agnostic.

Docker Compose profiles

# docker-compose.yml
services:
  otel-collector:          # always active (lightweight)
    image: otel/opentelemetry-collector-contrib:latest
    profiles: []           # no profile = always started
    ...

  prometheus:              # optional
    profiles: ["monitoring-local"]
    ...

  grafana:                 # optional
    profiles: ["monitoring-local"]
    ...

  tempo:                   # optional
    profiles: ["monitoring-local"]
    ...

Operation with local stack:

docker compose --profile monitoring-local up -d

Operation with external backend (e.g. Grafana Cloud):

# .env:
OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp-gateway.grafana.net/otlp
OTEL_EXPORTER_OTLP_HEADERS=Authorization=Basic ...
docker compose up -d   # only otel-collector, no local stack

OTel Collector configuration (`monitoring/otel-collector.yml`)

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    timeout: 1s
  resource:
    attributes:
      - key: service.namespace
        value: pb
        action: upsert

exporters:
  # Local Tempo (when monitoring-local profile is active)
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

  # Optional external backend
  otlp/external:
    endpoint: ${OTEL_EXTERNAL_ENDPOINT:-}
    headers:
      authorization: ${OTEL_EXTERNAL_AUTH:-}

  # Prometheus-compatible metrics (for scraping)
  prometheus:
    endpoint: "0.0.0.0:8889"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

Prometheus metrics: scraping vs. push

Currently: Prometheus actively scrapes services (/metrics endpoint). Alternative: Services push metrics via OTLP to the collector.

	Scraping (current)	OTLP Push
Prometheus required	Yes	No (collector is sufficient)
Pull model	Yes	No
Firewall-friendly	Less	Yes (collector outbound)
Recommendation	Simpler for local operation	Better for external backends

Recommendation: Offer both in parallel — scraping remains as fallback, OTLP push for external backends. Services don't need to know both methods: The collector is the only endpoint.

T-4: Adapter layer for additional data sources

Question

Should an adapter layer for additional data sources be implemented? Where should it be positioned — before or after the privacy layer?

Answer: Yes — mandatory before the privacy layer

Rationale: Why BEFORE the PII scanner

External source (PDF, Git, XLSX, API, ...)
    ↓
[ADAPTER LAYER]            ← Normalization: Binary → Text + Metadata
    ↓
[PII Scanner (Presidio)]   ← first layer that sees text
    ↓
[OPA Policy]               ← Classification + purpose check
    ↓
[Embedding (Ollama)]
    ↓
[Qdrant]

The PII scanner can only process text. If PDFs/DOCX were passed directly into the privacy layer, document parsing would have to occur there — which violates the single-responsibility principle. The adapter layer always delivers NormalizedDocument, regardless of the source.

Common abstraction

# ingestion/adapters/base.py

from dataclasses import dataclass, field
from typing import AsyncIterator

@dataclass
class NormalizedDocument:
    content: str                    # Extracted full text
    content_type: str               # "text", "code", "table", "image_description"
    source_ref: str                 # Original URI/path
    source_type: str                # "git", "pdf", "xlsx", "api", ...
    language: str | None = None     # "de", "en", None = unknown
    metadata: dict = field(default_factory=dict)
    # Contains: title, author, created_at, repo, path, etc.
    chunks: list[str] | None = None  # Optional: pre-cut chunks

class SourceAdapter:
    """Abstract base for all data source adapters."""

    async def fetch(self) -> AsyncIterator[NormalizedDocument]:
        raise NotImplementedError

    async def health_check(self) -> bool:
        raise NotImplementedError

Planned adapters

Adapter	Sources	Priority
`GitAdapter`	Forgejo, GitHub, GitLab, Bitbucket, Gitea	High
`FileAdapter`	PDF, DOCX, XLSX, Markdown, TXT	High
`DatabaseAdapter`	PostgreSQL dump, CSV, JSON	Medium
`ConfluenceAdapter`	Confluence REST API	Medium
`WebAdapter`	HTTP/HTML scraping, sitemaps	Low
`KafkaAdapter`	Streaming content	Low
`EmailAdapter`	IMAP, EML files	Low

Git adapter (also resolves T-2)

# ingestion/adapters/git_adapter.py

class GitAdapter(SourceAdapter):
    """Supports Forgejo, Gitea, GitHub, GitLab, Bitbucket."""

    PROVIDERS = {
        "forgejo": ForgejoProvider,
        "gitea":   ForgejoProvider,   # API-compatible
        "github":  GitHubProvider,
        "gitlab":  GitLabProvider,
        "bitbucket": BitbucketProvider,
    }

    def __init__(self, server_type: str, url: str, token: str,
                 org: str, repo: str, branch: str = "main"):
        self.provider = self.PROVIDERS[server_type](url, token)
        ...

    async def fetch(self) -> AsyncIterator[NormalizedDocument]:
        async for file in self.provider.list_files(self.org, self.repo):
            content = await self.provider.get_file_content(file.path)
            yield NormalizedDocument(
                content=content,
                content_type=_detect_content_type(file.path),
                source_ref=f"{self.url}/{self.org}/{self.repo}/blob/{file.sha}",
                source_type="git",
                language=_detect_language(file.path),
                metadata={"repo": self.repo, "path": file.path, "sha": file.sha},
            )

Directory structure

ingestion/
├── ingestion_api.py         ← FastAPI app (to be implemented, P0-2)
├── adapters/
│   ├── base.py              ← NormalizedDocument, SourceAdapter
│   ├── git_adapter.py       ← Git server (all providers)
│   ├── file_adapter.py      ← PDF/DOCX/XLSX via pymupdf/python-docx
│   ├── database_adapter.py  ← CSV/JSON/PG dumps
│   └── providers/
│       ├── forgejo.py
│       ├── github.py
│       ├── gitlab.py
│       └── bitbucket.py
├── pii_scanner.py           ← unchanged
├── retention_cleanup.py     ← unchanged
└── snapshot_service.py      ← unchanged

Interaction with VLM (T-1)

The FileAdapter automatically calls the VLM for image content:

# In file_adapter.py:
if page.has_images and VLM_ENABLED:
    for image in page.images:
        description = await vlm.describe_image(image.bytes)
        # description is added to the chunk text
        chunk_text += f"\n[Figure: {description}]"

The VLM is thus transparently integrated into the ingestion pipeline — no separate MCP tool call needed.

T-5: vLLM as an alternative to Ollama

Question

Can vLLM replace Ollama, and when does that make sense?

Comparison

	Ollama	vLLM
Primary purpose	Developer experience, local use	Production-grade LLM serving
Throughput	Sequential, low	Continuous batching → 10–50× higher
GPU memory	Standard allocation	PagedAttention → significantly more efficient
Parallelism	Poor (one request after another)	Very good (dozens simultaneous)
CPU-only	✅ Good	⚠️ Experimental, very slow
API	Ollama-specific + OpenAI-compat.	OpenAI-compatible (native)
VLM support	LLaVA, moondream2, llama3.2-vision	LLaVA, InternVL, Qwen-VL — better batched
Embedding models	Broad (nomic, mxbai, all-minilm ...)	Limited
Multi-GPU	No	Tensor parallelism, yes
Setup	`ollama pull model`	CUDA + Docker + model configuration

The problem with a direct replacement

vLLM is not a full embedding service. It supports /v1/embeddings, but the embedding model selection is narrower than with Ollama. nomic-embed-text is not guaranteed to be available. For high-performance embeddings there are better dedicated alternatives:

Service	Strength
HuggingFace Text Embeddings Inference (TEI)	Many models, very fast, OpenAI-compat.
infinity-embedding	Lightweight, OpenAI-compat., simple setup
Ollama	Simple, broad model support, CPU-capable

Recommended split

Embeddings:  infinity (Prod/GPU)
LLM/VLM:     vLLM (Prod/GPU)

Both roles configurable via separate endpoints:

EMBEDDING_PROVIDER_URL=http://infinity:80
EMBEDDING_MODEL=nomic-embed-text

LLM_PROVIDER_URL=http://vllm:8000
LLM_MODEL=llama3.2-vision                   # for VLM

Provider abstraction (LLMProvider interface)

Since Ollama (OpenAI-compatible mode), vLLM, HF TEI, and external services all offer an OpenAI-compatible API, a thin abstraction suffices:

# mcp-server/llm_provider.py  and  ingestion/llm_provider.py

import httpx

class LLMProvider:
    """
    Thin abstraction over OpenAI-compatible LLM/embedding endpoints.
    Supports: Ollama, vLLM, HF TEI, infinity, OpenAI (with DPA!).
    """

    def __init__(self, base_url: str, api_key: str = ""):
        self.base_url = base_url.rstrip("/")
        self.headers  = {"Authorization": f"Bearer {api_key}"} if api_key else {}

    async def embed(self, http: httpx.AsyncClient, text: str, model: str) -> list[float]:
        resp = await http.post(
            f"{self.base_url}/v1/embeddings",
            headers=self.headers,
            json={"model": model, "input": text},
        )
        resp.raise_for_status()
        return resp.json()["data"][0]["embedding"]

    async def generate(self, http: httpx.AsyncClient, prompt: str,
                       model: str, images: list[str] | None = None) -> str:
        """Text generation, optionally with images (VLM)."""
        messages = [{"role": "user", "content": prompt}]
        if images:
            # OpenAI Vision format
            messages = [{"role": "user", "content": [
                {"type": "text", "text": prompt},
                *[{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img}"}}
                  for img in images],
            ]}]

        resp = await http.post(
            f"{self.base_url}/v1/chat/completions",
            headers=self.headers,
            json={"model": model, "messages": messages, "stream": False},
        )
        resp.raise_for_status()
        return resp.json()["choices"][0]["message"]["content"]

Ollama compatibility: Ollama has exposed an OpenAI-compatible API under /v1/ since v0.1.24 — the provider code works with it without changes.

When to choose vLLM

More than 3–5 concurrent agents (parallelism bottleneck with Ollama)
GPU available (vLLM is optimized for CUDA, CPU is very slow)
VLM under load (multiple concurrent image descriptions in ingestion batch)
Models > 13B parameters (PagedAttention saves critical GPU RAM)

Docker Compose extension

# ── vLLM (optional, replaces Ollama for LLM/VLM) ──────────
vllm:
  image: vllm/vllm-openai:latest
  container_name: pb-vllm
  profiles: ["gpu"]
  ports:
    - "8000:8000"
  volumes:
    - vllm_models:/root/.cache/huggingface
  environment:
    HUGGING_FACE_HUB_TOKEN: ${HF_TOKEN:-}
  command:
    - "--model"
    - "${VLLM_MODEL:-llava-hf/llava-1.5-7b-hf}"
    - "--dtype"
    - "bfloat16"
    - "--max-model-len"
    - "4096"
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]
  networks:
    - pb-net
  restart: unless-stopped

# ── HF Text Embeddings Inference (optional) ──────────────
tei:
  image: ghcr.io/huggingface/text-embeddings-inference:latest
  container_name: pb-tei
  profiles: ["gpu"]
  ports:
    - "8010:80"
  volumes:
    - tei_models:/data
  command:
    - "--model-id"
    - "nomic-ai/nomic-embed-text-v1"
    - "--port"
    - "80"
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]
  networks:
    - pb-net
  restart: unless-stopped

Operation with GPU stack:

docker compose --profile gpu up -d
# .env:
# EMBEDDING_PROVIDER_URL=http://tei:80
# LLM_PROVIDER_URL=http://vllm:8000
# LLM_MODEL=llava-hf/llava-1.5-7b-hf

Summary of recommendations

Topic	Dev (CPU)	Prod (GPU)	Effort
Embeddings	Ollama	HF TEI or infinity	Small (provider abstraction)
LLM/VLM	Ollama	vLLM	Medium (Docker profile)
VLM integration	Ollama `llava:7b`	vLLM with LLaVA/InternVL	Small
Git server	Adapter layer with provider abstraction	← same	Medium
Monitoring	Local OTel+Grafana stack	External OTel Collector	Medium
Adapter layer	Before PII scanner, `NormalizedDocument`	← same	Large

The provider abstraction (T-5) and the adapter layer (T-4) are the two strategically most important extensions: T-5 decouples the inference backend choice from the code, T-4 decouples the data sources from the privacy core. Both enable the dev environment (CPU, Ollama) and prod (GPU, vLLM/TEI) to use the same codebase.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Technology Decisions & Extension Options

T-1: Vision Language Models (VLM) alongside Ollama

Question

Answer: Yes — Ollama itself supports VLMs

Conceptual distinction (important)

Implementation

Ollama container adjustment

T-2: Git Server Support (Forgejo as an example)

Question

Answer: Forgejo is an example — all common Git servers work

Integration point 1: OPA bundle polling

Integration point 2: Code ingestion (reading repo contents)

Recommended configuration

T-3: Monitoring — External OTel Collector, optional Grafana stack

Question

Answer: Yes — this is actually the recommended architecture

Docker Compose profiles

OTel Collector configuration (`monitoring/otel-collector.yml`)

Prometheus metrics: scraping vs. push

T-4: Adapter layer for additional data sources

Question

Answer: Yes — mandatory before the privacy layer

Rationale: Why BEFORE the PII scanner

Common abstraction

Planned adapters

Git adapter (also resolves T-2)

Directory structure

Interaction with VLM (T-1)

T-5: vLLM as an alternative to Ollama

Question

Comparison

The problem with a direct replacement

Recommended split

Provider abstraction (LLMProvider interface)

When to choose vLLM

Docker Compose extension

Summary of recommendations

FilesExpand file tree

technology-decisions.md

Latest commit

History

technology-decisions.md

File metadata and controls

Technology Decisions & Extension Options

T-1: Vision Language Models (VLM) alongside Ollama

Question

Answer: Yes — Ollama itself supports VLMs

Conceptual distinction (important)

Implementation

Ollama container adjustment

T-2: Git Server Support (Forgejo as an example)

Question

Answer: Forgejo is an example — all common Git servers work

Integration point 1: OPA bundle polling

Integration point 2: Code ingestion (reading repo contents)

Recommended configuration

T-3: Monitoring — External OTel Collector, optional Grafana stack

Question

Answer: Yes — this is actually the recommended architecture

Docker Compose profiles

OTel Collector configuration (monitoring/otel-collector.yml)

Prometheus metrics: scraping vs. push

T-4: Adapter layer for additional data sources

Question

Answer: Yes — mandatory before the privacy layer

Rationale: Why BEFORE the PII scanner

Common abstraction

Planned adapters

Git adapter (also resolves T-2)

Directory structure

Interaction with VLM (T-1)

T-5: vLLM as an alternative to Ollama

Question

Comparison

The problem with a direct replacement

Recommended split

Provider abstraction (LLMProvider interface)

When to choose vLLM

Docker Compose extension

Summary of recommendations

OTel Collector configuration (`monitoring/otel-collector.yml`)