🔧 Development Guide

Advanced reference for AI Cortex contributors and maintainers. This guide covers the internal architecture, testing strategy, CI/CD pipeline, and everything else you need to understand the codebase deeply and contribute effectively.

📋 Table of Contents

Architecture Overview
Internal API Design
Streaming Architecture
Model Management
Tool System
Type System
Testing Strategy
Performance Optimization
Error Handling
Build and Distribution
CI/CD Pipeline
Security Considerations
Future Enhancements

🏗️ Architecture Overview

Package Structure

aicortex/
├── __init__.py              # Public API: chat(), families(), models(), etc.
├── api.py                   # Internal _OllamaAPI wrapper class
├── chat.py                  # chat() function, Stream, StreamEvent
├── models/                  # Model metadata — one JSON file per family
│   ├── llama.json
│   ├── mistral.json
│   ├── deepseek.json
│   ├── qwen.json
│   └── gemma.json
├── tools/                   # Administrative utilities
│   ├── __init__.py
│   ├── check_models.py      # Step 1: validate server endpoints
│   ├── fetch_models.py      # Step 2: fetch current model lists
│   ├── resolve_models.py    # Step 3: merge and deduplicate metadata
│   ├── apply_valid_models.py # Step 4: write updated JSON files
│   └── server.py            # FastAPI OpenAI-compatible server
└── stubs/                   # Type stubs for IDE autocomplete
    ├── __init__.pyi
    ├── chat.pyi
    ├── tools.pyi
    └── tools/
        ├── __init__.pyi
        ├── server.pyi
        ├── check_models.pyi
        ├── fetch_models.pyi
        ├── resolve_models.pyi
        └── apply_valid_models.pyi

Layer Responsibilities

Layer	Module	Responsibility
Public API	`__init__.py`	Clean, stable surface — thin wrappers only
Internal API	`api.py`	All Ollama interactions; `_OllamaAPI` class
Chat Interface	`chat.py`	`chat()` dispatch, `Stream`, `StreamEvent`
Model Metadata	`models/*.json`	Offline-available model info and server lists
Tools	`tools/`	Administrative pipeline for keeping metadata current
Server	`tools/server.py`	FastAPI proxy that exposes an OpenAI-compatible REST API
Type Stubs	`stubs/`	IDE support — mirrors the public surface in `.pyi`

Design Principles

Single responsibility — each module and class does one thing well
Layered access — public callers never touch _OllamaAPI directly
Fail gracefully — server errors cascade to failover, not crashes
Type safety everywhere — mypy --strict must pass with zero errors
No state in modules — all state lives in instances or is passed explicitly

🔌 Internal API Design

`_OllamaAPI` Class

api.py contains the single internal class that owns all Ollama communication. Public functions in __init__.py create instances of this class and delegate to it; they never call ollama directly.

class _OllamaAPI:
    """Internal wrapper around the Ollama Python client.

    Not part of the public API — subject to change without notice.
    All public functions should go through this class for Ollama access.
    """

    def __init__(self, base_url: str = "http://localhost:11434") -> None: ...

    # Chat
    def _chat(self, model: str, prompt: str, **kwargs: Any) -> dict[str, Any]: ...
    def _stream_chat(self, model: str, prompt: str, **kwargs: Any) -> Iterator[dict]: ...

    # Model discovery
    def list_families(self) -> list[str]: ...
    def list_models(self, family: str | None = None) -> list[str]: ...
    def get_model_info(self, model: str) -> dict[str, Any]: ...

    # Server discovery
    def list_model_servers(self, model: str) -> list[dict[str, Any]]: ...
    def get_server_info(self, model: str, server_url: str | None = None) -> dict[str, Any]: ...
    def build_api_request(self, model: str, prompt: str, **kwargs: Any) -> dict[str, Any]: ...
    def get_llm_params(self, model: str) -> dict[str, Any]: ...
    def get_random_llm_params(self, model: str) -> dict[str, Any]: ...

Server Selection Strategy

When a function needs to talk to an Ollama server, _OllamaAPI follows this selection order:

Explicit URL — if the caller passes server_url, use it directly
Metadata servers — try servers listed in the model's JSON entry, in order
Default localhost — fall back to http://localhost:11434 if all else fails

Each candidate is health-checked before use. A server is considered healthy if it responds to the model list endpoint within the configured timeout. Failed servers are skipped with a warning log; they do not raise exceptions unless all candidates are exhausted.

📡 Streaming Architecture

Event System

Streaming is modeled as a sequence of typed events rather than a raw byte stream. This makes it easy to filter, transform, and compose stream consumers.

from dataclasses import dataclass, field
from enum import Enum
from typing import Iterator


class EventType(str, Enum):
    START = "start"     # Generation has begun
    TOKEN = "token"     # One text chunk has arrived
    END = "end"         # Generation completed successfully
    ERROR = "error"     # An error occurred during generation


@dataclass
class StreamEvent:
    type: EventType
    content: str | None = None   # Token text (only on TOKEN events)
    index: int | None = None     # Token position in the sequence
    model: str | None = None     # Which model generated this event
    done: bool = False           # True on the final event


@dataclass
class Stream:
    """Iterable container for a streamed model response."""
    events: list[StreamEvent] = field(default_factory=list)

    def __iter__(self) -> Iterator[StreamEvent]: ...
    def add(self, event: StreamEvent) -> None: ...
    def text(self) -> str:
        """Concatenate all TOKEN event content into a single string."""
        ...

Event Flow Diagram

chat("...", stream=True)
        │
        ▼
  _OllamaAPI._stream_chat()
        │
        │  yields raw Ollama dicts
        ▼
  _build_stream_events()          ← converts raw → StreamEvent
        │
        │  yields StreamEvents:
        │    StreamEvent(type=START)
        │    StreamEvent(type=TOKEN, content="Hello")
        │    StreamEvent(type=TOKEN, content=" world")
        │    ...
        │    StreamEvent(type=END, done=True)
        ▼
  Stream object returned to caller

Consuming a Stream

# Option 1: iterate events
stream = chat("Tell me a story", stream=True)
for event in stream:
    if event.type == EventType.TOKEN:
        print(event.content, end="", flush=True)

# Option 2: collect full text after completion
text = stream.text()

📦 Model Management

JSON Metadata Schema

Each model family has a JSON file in aicortex/models/. The schema:

{
  "family": "llama",
  "models": [
    {
      "name": "llama3.2:3b",
      "family": "llama",
      "size": "2.0 GB",
      "parameters": "3B",
      "quantization": "Q4_K_M",
      "context_length": 131072,
      "description": "Compact, fast Llama 3.2 variant for everyday tasks.",
      "tags": ["chat", "fast", "lightweight"],
      "servers": [
        {
          "url": "http://localhost:11434",
          "status": "unknown"
        }
      ]
    }
  ]
}

Model Loading

Model metadata is loaded from the bundled JSON files at import time. The files are included in the wheel via package_data in setup.py, so they are always available — no network required to get model info.

The status field in each server entry is not authoritative at load time; it reflects the last-known state from when the tools pipeline was run. Live health checks are done lazily at call time.

Keeping Metadata Current

The four-tool pipeline in aicortex/tools/ is the mechanism for refreshing model metadata:

check_models → fetch_models → resolve_models → apply_valid_models

Run it periodically (e.g., as a cron job or pre-release step) to keep the bundled JSON files accurate. See docs/tools.md for the full pipeline reference.

🔨 Tool System

Tool Categories

Tool	Module	Purpose
check_models	`tools/check_models.py`	Validate that server URLs are reachable and serving models
fetch_models	`tools/fetch_models.py`	Fetch the current model list from each live server
resolve_models	`tools/resolve_models.py`	Merge fetched data with existing metadata; deduplicate
apply_valid_models	`tools/apply_valid_models.py`	Write the resolved data back to `aicortex/models/*.json`
server	`tools/server.py`	Run the FastAPI OpenAI-compatible proxy

Tool Design Constraints

CLI-runnable — every tool exposes a main() function and a __main__ guard so it can be invoked directly: python -m aicortex.tools.check_models
Composable — each tool's output is suitable input for the next step; they can be chained in shell pipelines or called programmatically
Error-resilient — network failures for individual servers are logged and skipped; they do not abort the whole pipeline
Concurrent — check_models and fetch_models use asyncio / ThreadPoolExecutor to probe multiple servers in parallel

🔷 Type System

Stub Files

Every public symbol has a corresponding .pyi stub. Stubs live in aicortex/stubs/ and are included in the wheel so IDEs get autocomplete without needing to read the implementation.

The stub for chat() uses @overload to express the conditional return type:

# aicortex/stubs/chat.pyi
from typing import overload
from .models import Stream

@overload
def chat(prompt: str, *, stream: Literal[False] = ..., **kwargs: Any) -> str: ...

@overload
def chat(prompt: str, *, stream: Literal[True], **kwargs: Any) -> Stream: ...

def chat(prompt: str, *, stream: bool = False, **kwargs: Any) -> str | Stream: ...

Type Checking

mypy aicortex must exit 0 with strict mode enabled
--no-implicit-optional and --disallow-untyped-defs are both on
All Any uses must be justified with a # type: ignore[...] comment

Run:

mypy aicortex --strict

🧪 Testing Strategy

Test Structure

tests/
├── __init__.py
├── conftest.py               # Fixtures: mock_ollama_client, sample_model_data
├── test_chat.py              # chat(), Stream, StreamEvent behavior
├── test_api.py               # _OllamaAPI methods and error paths
├── test_models.py            # JSON loading, model lookup, family listing
├── test_tools.py             # check → fetch → resolve → apply pipeline
├── test_server.py            # FastAPI endpoints, request/response shapes
└── fixtures/
    ├── mock_responses.json   # Canned Ollama API responses
    └── test_models.json      # Minimal model JSON for unit tests

Test Categories

Unit tests — test one function or method in isolation, all external I/O mocked:

def test_get_model_info_returns_correct_family(mock_ollama_client):
    info = get_model_info("llama3.2:3b")
    assert info["family"] == "llama"

Integration tests — test multi-step workflows, still mocked at the network boundary:

def test_tool_pipeline_produces_valid_json(mock_server_responses):
    check_models.run()
    fetch_models.run()
    resolve_models.run()
    apply_valid_models.run()
    data = json.loads(Path("aicortex/models/llama.json").read_text())
    assert "models" in data

Server tests — test FastAPI endpoints using httpx.AsyncClient with the app mounted in-process (no real network):

@pytest.mark.asyncio
async def test_chat_completions_endpoint():
    async with AsyncClient(app=app, base_url="http://test") as client:
        response = await client.post("/v1/chat/completions", json={
            "model": "llama3.2:3b",
            "messages": [{"role": "user", "content": "Hello"}]
        })
    assert response.status_code == 200
    assert "choices" in response.json()

Tox Environments

tox.ini defines these environments:

Environment	Command	Purpose
`py38` – `py312`	`pytest + mypy + black + flake8`	Full suite on each Python version
`docs`	`sphinx-build -b html docs docs/_build/html`	Build and check docs
`build`	`python -m build && twine check dist/*`	Verify the package builds cleanly

Run all environments:

tox

Run a single environment:

tox -e py311
tox -e docs

⚡ Performance Optimization

Caching

Model metadata — JSON files are read once at import time and held in module-level dicts; subsequent calls hit the in-memory cache
Server health — health check results are cached for a configurable TTL (default: 60 seconds) to avoid re-checking on every call
Client instances — ollama.Client instances are reused per base URL rather than created per call

Concurrency

check_models and fetch_models use asyncio.gather() to probe all servers concurrently — O(1) wall time regardless of server count
apply_valid_models writes each family JSON file atomically (write to temp, then rename) to prevent partial writes
Streaming events are yielded lazily — no buffering of the full response before returning to the caller

Memory

Model JSON files are loaded into dict objects, not dataclass instances, to minimize overhead for large model lists
Streaming yields one event at a time; the full token sequence is only materialized if the caller calls .text()

❗ Error Handling

Exception Hierarchy

class AICortexError(Exception):
    """Base exception for all AI Cortex errors."""

class ModelNotFoundError(AICortexError):
    """Raised when the requested model is not available on any server."""

class ServerError(AICortexError):
    """Raised when all configured servers are unreachable."""

class ValidationError(AICortexError):
    """Raised when input parameters fail validation."""

class StreamError(AICortexError):
    """Raised when an error occurs during streaming."""

Recovery Strategies

Failure	Strategy
One server unreachable	Log warning, try next server in list
All servers unreachable	Raise `ServerError`
Model not in metadata	Raise `ModelNotFoundError` with suggestions
Stream interrupted mid-response	Emit `StreamEvent(type=ERROR)`, raise `StreamError`
Malformed model JSON	Log error, skip that family; do not crash the import

📦 Build and Distribution

Building

# Install build tools
pip install build twine

# Build source distribution and wheel
python -m build

# Verify the built package
twine check dist/*

This produces:

dist/aicortex_core-1.0.3.tar.gz — source distribution
dist/aicortex_core-1.0.3-py3-none-any.whl — universal wheel

What's in the Wheel

All aicortex/ Python source files
aicortex/models/*.json — bundled model metadata
aicortex/stubs/**/*.pyi — type stubs for IDE support
README.md, LICENSE

Versioning

AI Cortex follows Semantic Versioning:

MAJOR — breaking changes to the public API
MINOR — new features, backward-compatible
PATCH — bug fixes, backward-compatible

The version is defined in setup.py and should be updated before every release.

🔁 CI/CD Pipeline

The GitHub Actions workflow (.github/workflows/ci.yml) runs on every push and pull request to main and develop.

Jobs

test — matrix over Python 3.8, 3.9, 3.10, 3.11, 3.12:

1. Checkout code
2. Install: pip install -e .[dev,server]
3. Lint: flake8 aicortex tests
4. Format check: black --check aicortex tests
5. Type check: mypy aicortex
6. Test: pytest --cov=aicortex --cov-report=xml
7. Upload coverage to Codecov

build — runs after test passes:

1. Build package: python -m build
2. Store wheel and sdist as workflow artifacts

release — runs on push to main only, after test and build:

1. Build package
2. Publish to PyPI via pypa/gh-action-pypi-publish
   (requires PYPI_API_TOKEN secret in repo settings)

Release Process

Update version in setup.py
Update CHANGELOG.md with release notes
Commit: git commit -m "chore: release v1.1.0"
Tag: git tag v1.1.0 && git push --tags
Merge to main — CI publishes to PyPI automatically

🔒 Security Considerations

Input Validation

All model identifiers are validated against the known model list before use
Server URLs are validated as valid HTTP/HTTPS URLs before connection attempts
Prompt strings are passed through as-is to Ollama — sanitization is the caller's responsibility

Network

HTTPS is supported and recommended for remote servers
All HTTP requests have an explicit timeout (default: 30 seconds)
No credentials or tokens are logged, even at debug level

Dependencies

Core dependencies are minimal: ollama and pydantic only
Server extras (fastapi, uvicorn) are optional
Dependencies are pinned with minimum versions in setup.py; no upper bounds to avoid false conflicts

Known Limitations

The server mode has no built-in authentication — do not expose it on a public network without a reverse proxy that adds auth
Model outputs are not filtered — responsible for downstream content handling lies with the application
HTTP (not HTTPS) is the default for localhost Ollama connections — this is intentional for zero-config local use

🚀 Future Enhancements

Planned

Feature	Description	Priority
Async API	Full `async`/`await` support for `chat()`	High
Plugin system	Extensible tool architecture for third-party additions	Medium
Metrics export	Prometheus-compatible metrics endpoint on the server	Medium
Configuration files	YAML/TOML config for server URLs, defaults, timeouts	Medium
Caching layer	Optional Redis backend for response caching	Low
Multi-modal	Image and audio input support (pending Ollama support)	Low

Architecture Notes for Future Contributors

The _OllamaAPI class is intentionally not async to keep the public API simple. When async support is added, it should be a parallel _AsyncOllamaAPI class, not a modification of the existing one.
The model metadata JSON format is considered stable. New fields may be added; existing fields must not be removed without a major version bump.
The tool pipeline is designed to be run by maintainers, not end users. If a use case arises for user-facing model management, it should be a new public API function, not a thin wrapper around the tools.

FilesExpand file tree

development.md

Latest commit

History