Skip to content

Self-Hosted Local LLM Fallback (Ollama Integration) #674

@hackobi

Description

@hackobi

Run local LLMs for privacy-sensitive tasks and as cost-free fallbacks. Ties into Jay's existing work with local LLM integrations.

What to Build

  • Integration with Ollama or llama.cpp for local model serving
  • Model management: download, serve, health check via skill
  • Routing integration: cost tracker routes privacy-sensitive or simple tasks to local models
  • Models to support: Llama 3.x (8B for fast tasks), Qwen 2.5 Coder (for code tasks), Phi-3 (for lightweight reasoning)

Use Cases

  • Embedding generation (local = free, unlimited)
  • Simple classification tasks (intent detection, sentiment)
  • Privacy-sensitive document processing
  • Offline fallback when API providers are down
  • Reducing monthly API costs

Architecture

skills/local-llm/
├── SKILL.md
├── index.js
├── providers/
│   ├── ollama.js
│   └── llamacpp.js
├── models.json       # Available models + capabilities
└── data/
    └── benchmark.json # Local vs API quality comparison

Hardware Constraint

Running on MacBook Air — need to be conscious of RAM/CPU. 8B parameter models max, quantized (Q4_K_M). Apple Silicon Neural Engine helps with inference speed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions