DevAIFlow supports running Claude Code with alternative AI model providers. This allows you to use local models (Ollama, llama.cpp) or cloud providers (OpenRouter, Vertex AI, etc.) instead of the default Anthropic API.
✨ New: Ollama is now fully supported through native integration! Use ollama launch claude for the simplest local model setup.
- Why Use Alternative Providers?
- Quick Start Guides - Start here!
- Using Profiles - How to switch between providers
- Configuration - Profile structure and settings
- Provider Setup Guides - Detailed setup instructions
- Troubleshooting - Common issues and solutions
- Performance Comparison - Benchmarks and costs
- Decision Matrix - Which provider to choose
- Best Practices - Tips and recommendations
- Cost savings: Up to 98% cheaper than Claude Opus 4.6 ($15/M tokens → $0.28/M tokens)
- Privacy: Run models completely locally (no internet needed)
- Flexibility: Test different models for different use cases
- No vendor lock-in: Switch providers anytime
- Simplicity: Ollama integration requires zero configuration
Choose your path based on your needs:
Best for: Simplest local setup, zero configuration, quick start
Time: 2-5 minutes | Cost: FREE | Status: ✅ Fully integrated & tested
# 1. Install Ollama (one-time)
# macOS/Linux:
curl -fsSL https://ollama.com/install.sh | sh
# Or download from: https://ollama.com
# 2. Pull a coding model
ollama pull qwen3-coder # Recommended: Qwen3-Coder (25B - excellent for coding)
# OR: ollama pull llama3.3 # Alternative: Llama 3.3 (70B - slower but powerful)
# 3. Configure daf
daf config edit
# Set "AI Agent Backend" to "Ollama (local models)"
# Optionally set "Default Model" under "Ollama Configuration"
# OR manually edit ~/.daf-sessions/config.json:
# {
# "agent_backend": "ollama",
# "ollama": {
# "default_model": "qwen3-coder"
# }
# }
# 4. Use with daf
daf open PROJ-123
# Ollama will automatically launch Claude Code with your local model!Model Selection Priority:
- Model provider profile (if configured)
OLLAMA_MODELenvironment variable- Ollama's default from
~/.ollama/config.json - Ollama's built-in default
Popular Ollama Models for Coding:
qwen3-coder- 25B parameters, excellent for coding (recommended)llama3.3- 70B parameters, powerful but slowercodellama- Meta's coding-specific modelmistral- Fast and capable
Advantages over llama.cpp:
- ✅ Zero configuration - works out of the box
- ✅ Automatic server management - no manual server start needed
- ✅ Model management -
ollama pull,ollama list, etc. - ✅ Native integration - uses
ollama launch claudecommand - ✅ Simpler setup - one install command
See detailed guide below for model recommendations and troubleshooting.
Best for: Privacy, offline work, zero cost, full IDE integration
Time: 15-20 minutes | Cost: FREE | Status: ✅ Tested & Working
# 1. Build llama.cpp (one-time)
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build -DGGML_METAL=ON # macOS
# OR: cmake llama.cpp -B llama.cpp/build -DGGML_CUDA=ON # Linux with GPU
cmake --build llama.cpp/build --config Release -j
# 2. Start server with a coding model
cd llama.cpp
./llama-server -hf bartowski/cerebras_Qwen3-Coder-REAP-25B-A3B-GGUF:Q4_K_M \
--alias "Qwen3-Coder" --port 8000 --jinja --ctx-size 64000
# 3. Configure daf (choose one method)
# Method A: CLI commands (recommended)
daf model add llama-cpp
# Select "3. Local llama.cpp server" when prompted
# Base URL: http://localhost:8000
# Model name: Qwen3-Coder
# Set as default: Yes
# Method B: Interactive TUI
daf config edit # Navigate to "Model Providers" → Add Custom Provider
# Method C: Manual config
# Add to ~/.daf-sessions/config.json:
# {
# "model_provider": {
# "default_profile": "llama-cpp",
# "profiles": {
# "llama-cpp": {
# "name": "llama-cpp",
# "base_url": "http://localhost:8000",
# "auth_token": "llama-cpp",
# "api_key": "",
# "model_name": "Qwen3-Coder"
# }
# }
# }
# }
# 4. Use with daf
daf open PROJ-123
# Claude Code will now use your local Qwen3-Coder model!First Response Time: 30-60 seconds (normal - processing 35k tokens of tool definitions)
See detailed guide below for hardware requirements, model recommendations, and troubleshooting.
Best for: Cloud convenience, 100+ model options, very low cost
Time: 2 minutes | Cost: $0.28-3/M tokens (98% cheaper than Claude Opus) | Status:
# 1. Get API key
# Visit https://openrouter.ai
# Sign up and generate API key
# Add credits to account
# 2. Configure daf (choose one method)
# Method A: CLI commands (recommended)
daf model add openrouter-deepseek
# Select "4. Custom" when prompted
# Base URL: https://openrouter.ai/api/v1
# Auth token: (leave empty)
# API key: or-YOUR-KEY-HERE
# Model name: deepseek/deepseek-coder
# Set as default: Yes
# Method B: Interactive TUI
daf config edit # Navigate to "Model Providers" → Add Custom Provider
# Method C: Manual config
# Manually edit ~/.daf-sessions/config.json:
# {
# "model_provider": {
# "default_profile": "openrouter-deepseek",
# "profiles": {
# "openrouter-deepseek": {
# "name": "openrouter-deepseek",
# "base_url": "https://openrouter.ai/api",
# "auth_token": "YOUR_OPENROUTER_KEY",
# "api_key": "",
# "model_name": "deepseek/deepseek-v3"
# }
# }
# }
# }
# 3. Use with daf
daf open PROJ-123 --model-profile openrouter-deepseekPopular OpenRouter Models:
deepseek/deepseek-v3- $0.28/M tokens (best value)openai/gpt-oss-120b:free- FREE tieranthropic/claude-3.5-sonnet- $3/M tokens (80% cheaper than direct Anthropic)
See detailed guide below for more model options and configuration.
Best for: Enterprise GCP users, compliance requirements
Time: 5 minutes | Cost: ~$3/M tokens | Status: ✅ Tested & Working
# 1. Set up GCP authentication
gcloud auth application-default login
# 2. Configure daf
daf config edit # Navigate to "Model Providers" → Add Vertex AI
# Or manually edit ~/.daf-sessions/config.json:
# {
# "model_provider": {
# "default_profile": "vertex",
# "profiles": {
# "vertex": {
# "name": "vertex",
# "use_vertex": true,
# "vertex_project_id": "your-gcp-project-id",
# "vertex_region": "us-east5",
# "model_name": "claude-3-5-sonnet-v2@20250929"
# }
# }
# }
# }
# 3. Use with daf
daf open PROJ-123See detailed guide below for Vertex AI setup and configuration.
Ollama does NOT work with Claude Code. This is due to fundamental API incompatibility:
- Ollama uses OpenAI-compatible API format
- Claude Code requires Anthropic Messages API format
- These formats are incompatible (like USB-A vs USB-C)
Use llama.cpp instead - it provides the same local model experience with full Claude Code compatibility.
DevAIFlow provides multiple ways to select which model provider profile to use:
Use --model-profile flag to specify a profile for a single command:
# Create session with specific profile
daf new --name feature-123 --goal "Add feature" --model-profile vertex
# Open session with specific profile (overrides session default)
daf open feature-123 --model-profile llama-cpp
# Investigate with local model
daf investigate --goal "Research options" --model-profile llama-cpp
# Session remembers last used profile
daf open feature-123 # Uses llama-cpp from previous commandSession Persistence: When you use --model-profile, the profile is stored in the session. Future daf open commands for that session will use the stored profile unless overridden.
Use MODEL_PROVIDER_PROFILE environment variable:
# One-time override
MODEL_PROVIDER_PROFILE=anthropic daf open PROJ-123
# Set for entire terminal session
export MODEL_PROVIDER_PROFILE=vertex
daf new --name task-456 --goal "Debug issue"
daf open task-456Set default_profile in your config:
{
"model_provider": {
"default_profile": "llama-cpp",
"profiles": { ... }
}
}All commands use this profile unless overridden.
When model provider is NOT enforced by enterprise/team:
Profile selection follows this priority (highest to lowest):
session.model_profile(stored in session from previous--model-profile)MODEL_PROVIDER_PROFILEenv var (terminal session override)config.model_provider.default_profile(persistent default)- Anthropic API (fallback)
When model provider IS enforced by enterprise/team:
The enforced configuration takes absolute precedence. Users cannot override it via:
- ❌
--model-profileflag (command will use enforced profile) - ❌
MODEL_PROVIDER_PROFILEenv var (ignored) - ❌
config.model_provider.default_profile(overridden by enforcement) - ❌ TUI configuration editor (UI disabled)
Example Workflow (No Enforcement):
# Set work profile as default in config
# default_profile: "vertex"
# Create session - uses Vertex AI (config default)
daf new --name PROJ-123 --goal "Fix bug"
# Test with local model - overrides config default
daf open PROJ-123 --model-profile llama-cpp
# Next open uses last profile (llama-cpp stored in session)
daf open PROJ-123
# Force back to Vertex for deployment testing
daf open PROJ-123 --model-profile vertexExample Workflow (Enterprise Enforcement):
# Enterprise has enforced vertex-prod profile in enterprise.json
# Create session - uses vertex-prod (enforced)
daf new --name PROJ-123 --goal "Fix bug"
# Attempt to use local model - STILL uses vertex-prod (enforced)
daf open PROJ-123 --model-profile llama-cpp
# TUI config editor shows warning and disables profile management
daf config edit
# "⚠ Model provider configuration is enforced by enterprise configuration"Each profile contains:
| Field | Type | Description | Example |
|---|---|---|---|
name |
string | Profile name | "llama-cpp" |
base_url |
string (optional) | ANTHROPIC_BASE_URL override | "http://localhost:8000" |
auth_token |
string (optional) | ANTHROPIC_AUTH_TOKEN override | "llama-cpp" |
api_key |
string (optional) | ANTHROPIC_API_KEY override | "" (empty string to disable) |
model_name |
string (optional) | Model for --model flag |
"devstral-small-2" |
use_vertex |
boolean | Use Google Vertex AI | true |
vertex_project_id |
string (optional) | GCP project ID | "my-project-123" |
vertex_region |
string (optional) | GCP region | "us-east5" |
env_vars |
object (optional) | Additional env vars | {"CUSTOM_VAR": "value"} |
Profiles can be defined at multiple levels:
- Enterprise (
enterprise.json) - Company-wide enforcement - Team (
team.json) - Team defaults - User (
config.json) - Personal profiles
Config Merge Priority (Enforcement): Enterprise > Team > User
⚠ Important: The model_provider configuration follows an enforcement hierarchy (Enterprise > Team > User), not a preference hierarchy. This means:
- If enterprise.json defines model_provider, it enforces those profiles company-wide
- Users cannot override enterprise-enforced profiles
- Teams can enforce profiles if enterprise hasn't
- Users can only choose profiles if neither enterprise nor team has enforced them
Runtime Profile Selection Priority (highest to lowest):
session.model_profile(stored in session from previous--model-profile)MODEL_PROVIDER_PROFILEenv var (temporary override - only works if not enforced)config.model_provider.default_profile(persistent default - only works if not enforced)- Anthropic API (fallback)
Note: The
--model-profileCLI flag has been replaced bysession.model_profilefor consistency. Usedaf open --model-profile <name>to switch profiles, which stores the profile in the session.
Method 1: CLI Commands (Recommended):
# List all profiles
daf model list
# Add a new profile (interactive wizard)
daf model add llama-cpp
# - Choose provider type (Anthropic, Vertex AI, llama.cpp, Custom)
# - Configure settings interactively
# - Optionally set as default
# Show profile configuration
daf model show llama-cpp
# Set default profile
daf model set-default llama-cpp
# Test profile configuration
daf model test llama-cpp
# Remove a profile
daf model remove old-profileBenefits:
- Interactive wizard guides you through profile setup
- Validates configuration before saving
- No need to remember JSON structure
- Supports all profile types (Anthropic, Vertex AI, llama.cpp, Custom)
- JSON output available with
--jsonflag for automation
Method 2: Interactive TUI:
daf config edit
# Navigate to "Model Providers" tab
# Add/Edit/Delete profiles with visual interface
# Press Ctrl+S to saveThe TUI provides:
- Visual profile editor with form validation
- Add new profiles (Anthropic, Vertex AI, or Custom providers)
- Edit existing profiles (change settings, update credentials)
- Set default profile (marked with ⭐)
- Delete profiles (except base
anthropicprofile) - Preview profile settings before saving
Method 3: Manual JSON Editing:
Add to ~/.daf-sessions/config.json:
{
"model_provider": {
"default_profile": "llama-cpp",
"profiles": {
"llama-cpp": {
"name": "llama-cpp",
"base_url": "http://localhost:8000",
"auth_token": "llama-cpp",
"api_key": "",
"model_name": "Qwen3-Coder"
}
}
}
}Critical: Ollama cannot be used with Claude Code due to API incompatibility.
Think of it like this: Claude Code speaks "Anthropic language" while Ollama only speaks "OpenAI language" - they can't communicate! 🗣️
The technical breakdown:
- ❌ Ollama provides OpenAI-compatible API
- ❌ Claude Code requires Anthropic Messages API format
- ❌ These APIs are fundamentally different and incompatible
- ❌ Like trying to plug USB-A into USB-C - won't work!
Test results confirm incompatibility:
# Ollama + Claude Code (❌ FAILS)
claude --model kimi-k2.5:cloud
# Result: 500 {"type":"error","error":{"type":"api_error",...}}
claude --model devstral-small-2
# Result: Hangs forever with "Deliberating..."✅ Use llama.cpp instead (see Option 1 below) for local models with full Claude Code IDE integration.
Why articles claim Ollama works:
- Articles actually use llama.cpp (not Ollama) but title says "Ollama"
- Articles use API translation layers (litellm, OpenRouter) that convert between formats
- Information is outdated from older Claude Code versions
For terminal-based Ollama chat without Claude Code IDE: Ollama works fine for basic CLI chat, but it does NOT integrate with Claude Code's file editing, tool calling, and IDE features.
Status: ✅ Tested and confirmed working Time: 15-20 minutes | Cost: Free | Best for: Local/offline usage with full Claude Code IDE integration
llama.cpp is the ONLY local solution that provides full Claude Code compatibility with file editing, multi-file changes, and tool calling.
The Core Issue: API Incompatibility
- Ollama → Provides OpenAI-compatible API
- Claude Code → Requires Anthropic Messages API
- Result → ❌ Incompatible (like trying to plug USB-A into USB-C)
Simple Analogy:
- Claude Code speaks "Anthropic language" 🇫🇷
- Ollama only speaks "OpenAI language" 🇩🇪
- llama.cpp is bilingual and can speak "Anthropic language" 🇫🇷 (with
--jinjaflag)
You can't have a conversation if you don't speak the same language!
Technical Differences:
| Feature | llama.cpp | Ollama | Impact |
|---|---|---|---|
--jinja flag |
✅ Available | ❌ Not available | Required for tool calling |
| API customization | ✅ Flexible | ❌ Fixed OpenAI format | Allows Anthropic compatibility |
| Response format | ✅ Configurable | ❌ Standard only | Matches Claude expectations |
The Critical --jinja Flag:
This is the most important difference:
# llama.cpp (WORKS) ✅
./llama-server -hf model --port 8000 --jinja # ← This flag is CRITICAL
# Ollama (FAILS) ❌
ollama serve # ← No --jinja flag availableWhat --jinja does:
- Enables proper tool calling / function calling support
- Formats responses in a way Claude Code can understand
- Without it: Claude Code hangs forever with "Deliberating..." or gets 500 errors
What Actually Happens:
# With Ollama ❌
claude --model kimi-k2.5:cloud
# Result: 500 {"type":"error","error":{"type":"api_error",...}}
claude --model devstral-small-2
# Result: Hangs forever with "Deliberating..."
# With llama.cpp ✅
claude --model Qwen3-Coder
# Result: SUCCESS - Full working response!Why Articles Claim "Ollama Works":
This confuses users because:
- Misleading titles: Articles say "Run Claude Code with Ollama" but actually use llama.cpp
- API translation layers: Some use litellm or OpenRouter to translate between APIs
- Outdated info: Older Claude Code versions had different requirements
What Each Tool is Designed For:
Ollama is designed for:
- ✅ OpenAI API compatibility
- ✅ Easy local chat in terminal
- ✅ Simple model management with
ollama pull
But NOT for:
- ❌ Claude Code's Anthropic API format
- ❌ Claude Code's tool calling requirements
- ❌ IDE integration with file editing
llama.cpp is designed for:
- ✅ Maximum flexibility and customization
- ✅ Custom API formats (can mimic Anthropic)
- ✅ Advanced features like
--jinjafor tool calling - ✅ Works with Claude Code's requirements
- macOS with Apple Silicon OR Linux with NVIDIA GPU
- 16GB+ RAM (32GB+ recommended for larger models)
- Git, CMake installed
macOS (Apple Silicon):
# Install dependencies
brew install cmake
# Clone and build
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build -DGGML_METAL=ON
cmake --build llama.cpp/build --config Release -jLinux (NVIDIA GPU):
# Install dependencies
sudo apt-get update && sudo apt-get install build-essential cmake git -y
# Clone and build
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -jcd llama.cpp
# Start server with CRITICAL FLAGS
./llama-server -hf bartowski/cerebras_Qwen3-Coder-REAP-25B-A3B-GGUF:Q4_K_M \
--alias "Qwen3-Coder" \
--port 8000 \
--jinja \ # ← CRITICAL: Required for tool calling
--kv-unified \
--cache-type-k q8_0 --cache-type-v q8_0 \
--flash-attn on \
--batch-size 4096 --ubatch-size 1024 \
--ctx-size 64000Important flags explained:
--jinja- REQUIRED for Claude Code tool calling to work--hf- Download model directly from HuggingFace--alias- Model name to use in Claude Code--ctx-size 64000- Large context for Claude Code's tool definitions (~35k tokens)
Keep this terminal running - the server must stay active.
Method 1: Interactive TUI (Recommended)
# Open configuration TUI
daf config edit
# Navigate to "Model Providers" tab
# Click "Add Profile" → "Custom Provider"
# Fill in:
# Name: llama-cpp
# Base URL: http://localhost:8000
# Auth Token: llama-cpp
# API Key: (leave empty)
# Model Name: Qwen3-Coder
#
# Click "Set as Default" (optional)
# Press Ctrl+S to saveMethod 2: Manual JSON Edit
Edit ~/.daf-sessions/config.json:
{
"model_provider": {
"default_profile": "llama-cpp",
"profiles": {
"llama-cpp": {
"name": "llama-cpp",
"base_url": "http://localhost:8000",
"auth_token": "llama-cpp",
"api_key": "",
"model_name": "Qwen3-Coder"
}
}
}
}Method 3: Environment Variables (Temporary)
export ANTHROPIC_BASE_URL="http://localhost:8000"
export ANTHROPIC_AUTH_TOKEN="llama-cpp"
export ANTHROPIC_API_KEY=""# Uses default profile (llama-cpp)
daf open PROJ-123
# Or override per session
daf open PROJ-123 --model-profile llama-cpp
# Or use environment variable
MODEL_PROVIDER_PROFILE=llama-cpp daf open PROJ-456Session starts with Claude Code using your local llama.cpp model!
In Claude Code, type: hi
Expected:
- First prompt takes 30-60 seconds (processing 35k tokens of tool definitions)
- You get a response from the model
- Subsequent prompts are much faster
If it hangs forever:
- Check llama.cpp server logs
- Verify
--jinjaflag was included - Verify model supports tool calling
Initial Prompt:
- Claude Code sends ~35,140 tokens on first prompt (tool definitions, context)
- llama.cpp processes at ~2048 tokens/batch
- Expect 30-60 seconds for first response
- This is normal!
Subsequent Prompts:
- Much faster (context already loaded)
- Response time depends on hardware and model size
Hardware Recommendations:
- 16GB RAM: Use Q4_K_M quantized models (24B parameters max)
- 32GB RAM: Use Q4_K_M or Q5_K_M quantized models (30B parameters comfortable)
- 64GB+ RAM: Larger models and higher quantization
For 32GB RAM:
# Qwen3-Coder (25B) - Excellent for coding
./llama-server -hf bartowski/cerebras_Qwen3-Coder-REAP-25B-A3B-GGUF:Q4_K_M \
--alias "Qwen3-Coder" --port 8000 --jinja \
--ctx-size 64000 --batch-size 4096 --ubatch-size 1024
# DeepSeek-Coder V2 (16B) - Fast and capable
./llama-server -hf bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF:Q5_K_M \
--alias "DeepSeek-Coder" --port 8000 --jinja \
--ctx-size 64000 --batch-size 4096 --ubatch-size 1024For 16GB RAM:
# Qwen2.5-Coder (14B) - Good balance
./llama-server -hf bartowski/Qwen2.5-Coder-14B-Instruct-GGUF:Q4_K_M \
--alias "Qwen2.5-Coder" --port 8000 --jinja \
--ctx-size 64000 --batch-size 4096 --ubatch-size 1024You can configure multiple llama.cpp profiles for different models:
{
"model_provider": {
"default_profile": "llama-coding",
"profiles": {
"llama-coding": {
"name": "llama-coding",
"base_url": "http://localhost:8000",
"auth_token": "llama-cpp",
"api_key": "",
"model_name": "Qwen3-Coder"
},
"llama-fast": {
"name": "llama-fast",
"base_url": "http://localhost:8001",
"auth_token": "llama-cpp",
"api_key": "",
"model_name": "Qwen2.5-7B"
}
}
}
}Start multiple servers on different ports:
# Terminal 1: Larger model for complex tasks
./llama-server -hf bartowski/cerebras_Qwen3-Coder-REAP-25B-A3B-GGUF:Q4_K_M \
--port 8000 --alias "Qwen3-Coder" --jinja --ctx-size 64000
# Terminal 2: Smaller model for quick tasks
./llama-server -hf bartowski/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_K_M \
--port 8001 --alias "Qwen2.5-7B" --jinja --ctx-size 64000Switch between them:
daf open PROJ-123 --model-profile llama-coding # Use 25B model
daf open PROJ-456 --model-profile llama-fast # Use 7B modelPros:
- ✅ Full Claude Code IDE integration (file editing, multi-file changes)
- ✅ Works with any GGUF model from HuggingFace
- ✅ Completely offline
- ✅ Zero cost
- ✅ Tested and confirmed working
- ✅ Control over model size, quantization, hardware usage
Cons:
⚠️ Complex initial setup (build from source)⚠️ Slow first prompt (30-60 seconds)⚠️ Requires manual server management⚠️ Need to keep terminal running
Time: 2 minutes | Cost: Pay-per-use | Best for: Access to many models with one API key | Status:
OpenRouter provides a universal adapter for AI APIs.
- Go to openrouter.ai
- Create account and generate API key
- Add credits to account
{
"model_provider": {
"profiles": {
"openrouter-free": {
"name": "openrouter-free",
"base_url": "https://openrouter.ai/api",
"auth_token": "YOUR_OPENROUTER_KEY",
"api_key": "",
"model_name": "openai/gpt-oss-120b:free"
},
"openrouter-deepseek": {
"name": "openrouter-deepseek",
"base_url": "https://openrouter.ai/api",
"auth_token": "YOUR_OPENROUTER_KEY",
"api_key": "",
"model_name": "deepseek/deepseek-v3.2"
}
}
}
}Popular Models:
openai/gpt-oss-120b:free- Free tierdeepseek/deepseek-v3.2- Cheapest ($0.28/M tokens)anthropic/claude-3.5-sonnet- High quality
MODEL_PROVIDER_PROFILE=openrouter-deepseek daf open PROJ-123Time: 5 minutes | Cost: Free | Best for: GUI model management | Status:
Download from lmstudio.ai/download
Or for servers:
curl -fsSL https://lmstudio.ai/install.sh | bashUsing GUI: Browse and download models directly Using CLI:
lms chat
# Then use /download command to search and download modelslms server start --port 1234{
"model_provider": {
"profiles": {
"lmstudio": {
"name": "lmstudio",
"base_url": "http://localhost:1234",
"auth_token": "lmstudio",
"api_key": "",
"model_name": "qwen/qwen3-coder-30b"
}
}
}
}Time: 5 minutes | Cost: Pay-per-use | Best for: Enterprise GCP users | Status: ✅ Tested and working
- Enable Vertex AI API in your GCP project
- Set up authentication (Application Default Credentials)
gcloud auth application-default login{
"model_provider": {
"default_profile": "vertex",
"profiles": {
"vertex": {
"name": "vertex",
"use_vertex": true,
"vertex_project_id": "your-gcp-project-id",
"vertex_region": "us-east5",
"model_name": "claude-3-5-sonnet-v2@20250929"
}
}
}
}Edit ~/.daf-sessions/config.json:
{
"model_provider": {
"default_profile": "llama-cpp" // Changed from "anthropic"
}
}Use environment variable:
# Use llama.cpp for this session
MODEL_PROVIDER_PROFILE=llama-cpp daf open PROJ-123
# Use Vertex AI for this session
MODEL_PROVIDER_PROFILE=vertex daf open PROJ-456
# Use Anthropic API (override llama.cpp default)
MODEL_PROVIDER_PROFILE=anthropic daf open PROJ-789{
"model_provider": {
"default_profile": "vertex",
"profiles": {
"vertex": {
"name": "vertex",
"use_vertex": true,
"vertex_project_id": "work-project",
"vertex_region": "us-east5"
},
"llama-cpp": {
"name": "llama-cpp",
"base_url": "http://localhost:8000",
"auth_token": "llama-cpp",
"api_key": "",
"model_name": "Qwen3-Coder"
},
"anthropic": {
"name": "anthropic"
}
}
}
}Usage:
- Work: Uses Vertex AI (default)
- Testing locally:
MODEL_PROVIDER_PROFILE=llama-cpp daf open - Emergency (llama.cpp server down):
MODEL_PROVIDER_PROFILE=anthropic daf open
Enterprises can enforce model provider usage across all users for compliance, cost control, and security.
~/.daf-sessions/enterprise.json:
{
"model_provider": {
"default_profile": "vertex-prod",
"profiles": {
"vertex-prod": {
"name": "Vertex AI Production",
"use_vertex": true,
"vertex_project_id": "company-gcp-project",
"vertex_region": "us-east5",
"cost_per_million_input_tokens": 3.00,
"cost_per_million_output_tokens": 15.00,
"monthly_budget_usd": 5000.00,
"cost_center": "ENG-PLATFORM"
}
}
}
}Enforcement Behavior:
- ✅ Users cannot add, edit, or delete profiles in the UI (buttons disabled)
- ✅ Users cannot change the default profile (dropdown disabled)
- ✅ Config file saves do not persist user model_provider overrides
- ✅ Warning message displayed: "Model provider configuration is enforced by enterprise configuration"
- ✅ Audit logs track all model provider usage with enforcement source
Cost Tracking:
cost_per_million_input_tokens: Estimated cost per million input tokens in USDcost_per_million_output_tokens: Estimated cost per million output tokens in USDmonthly_budget_usd: Monthly budget limit for alerts and trackingcost_center: Department or cost center code for accounting/chargeback
All model provider usage is logged to ~/.daf-sessions/audit.log with cost tracking metadata
for enterprise budget management and compliance reporting.
~/.daf-sessions/organization.json:
{
"model_provider": {
"profiles": {
"llama-cpp-shared": {
"name": "llama-cpp-shared",
"base_url": "http://llama-cpp.internal.company.com:8000",
"auth_token": "llama-cpp",
"api_key": "",
"model_name": "Qwen3-Coder"
}
}
}
}Teams can use the shared server without individual setup.
Problem: "I followed Ollama setup but it doesn't work"
Symptoms:
- 500 errors:
{"type":"error","error":{"type":"api_error",...}} - Infinite hangs:
✽ Deliberating…never completes - Tool calling fails or file editing doesn't work
Root Cause: Ollama uses OpenAI-compatible API, but Claude Code requires Anthropic's API format. These are fundamentally incompatible (like trying to plug USB-A into USB-C).
Solution: ✅ Use llama.cpp instead:
- Follow llama.cpp setup guide above
- Key difference: llama.cpp has
--jinjaflag for tool calling compatibility - llama.cpp server can be configured to match Anthropic API expectations
Why articles claim Ollama works:
- They actually use llama.cpp (not Ollama) but title says "Ollama"
- They use API translation layers (litellm, OpenRouter)
- Information is outdated from older Claude Code versions
Problem: llama.cpp hangs forever / very slow first response
Expected Behavior:
- First prompt: 30-60 seconds (processing ~35k tokens of tool definitions)
- Subsequent prompts: Much faster (context already loaded)
If stuck after 2+ minutes:
-
Check Hardware Requirements:
- Model too large for available RAM?
- Try smaller quantization (Q4_K_M instead of Q6_K)
- Try smaller model (14B instead of 25B)
-
Reduce Resource Usage:
# Reduce context size --ctx-size 32000 # instead of 64000 # Reduce batch size (slower but less memory) --batch-size 2048 --ubatch-size 512
-
Check Server Logs:
- Look for out-of-memory errors
- Look for model loading failures
- Verify model supports instruct format
Performance Tuning:
- 16GB RAM: Use Q4_K_M quantized models, 14B-16B parameters max
- 32GB RAM: Use Q4_K_M or Q5_K_M, 25B-30B parameters comfortable
- 64GB+ RAM: Larger models and higher quantization
Problem: Profile not found
Error Message:
Warning: MODEL_PROVIDER_PROFILE=llama-cpp not found in configuration
Solutions:
- Check profile name in
~/.daf-sessions/config.json - Profile names are case-sensitive (
llama-cpp≠Llama-Cpp) - Verify JSON syntax is valid (use
daf config validate) - Try using TUI:
daf config editto visually verify profiles
Problem: Missing --jinja flag
Symptoms:
- Server starts fine
- Claude Code connects
- Hangs on first prompt or tool use fails
- File editing doesn't work
Solution:
# ❌ WRONG (missing --jinja)
./llama-server -hf model --port 8000
# ✅ CORRECT (includes --jinja)
./llama-server -hf model --port 8000 --jinjaThe --jinja flag is REQUIRED for Claude Code tool calling to work!
Problem: Connection refused
Error Message:
Error: Failed to connect to http://localhost:8000
Debugging Steps:
-
Verify server is running:
curl http://localhost:8000/health # OR curl http://localhost:8000/v1/models -
Check port number:
- Does
base_urlin profile match server port? - Is another process using the port?
lsof -i :8000
- Does
-
For llama.cpp:
- Check terminal where
llama-serveris running - Look for startup errors or crashes
- Check terminal where
-
Firewall/Network:
- Verify no firewall blocking localhost connections
- Try
127.0.0.1instead oflocalhost
Problem: Model not found
Error Message:
Error: model 'Qwen3-Coder' not found
Solutions:
-
Verify alias matches:
- Server
--aliasflag must match profilemodel_name - Example:
--alias "Qwen3-Coder"→"model_name": "Qwen3-Coder"
- Server
-
Check server logs:
- Look for model loading errors
- Verify model downloaded successfully from HuggingFace
-
Try different model:
- Test with known-working model first
- Example:
bartowski/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_K_M
Problem: Vertex AI authentication failed
Solutions:
-
Re-authenticate:
gcloud auth application-default login
-
Verify project ID:
- Check
vertex_project_idis correct - Run
gcloud projects listto see available projects
- Check
-
Enable API:
- Vertex AI API must be enabled:
gcloud services enable aiplatform.googleapis.com
- Vertex AI API must be enabled:
-
Check region:
- Verify
vertex_regionis valid (e.g.,us-east5,us-central1) - Not all regions support all models
- Verify
Problem: OpenRouter API errors
Common Issues:
-
Invalid API key:
- Verify key is correct in profile
auth_token - Check key hasn't been revoked at openrouter.ai
- Verify key is correct in profile
-
Insufficient credits:
- Add credits to your OpenRouter account
-
Model not available:
- Check model name is exact (case-sensitive)
- Verify model is available at openrouter.ai/models
Enable verbose logging:
# For llama.cpp
./llama-server --log-enable --log-file server.log ...
# For daf commands
daf open PROJ-123 --verboseTest with curl:
# Test if server responds
curl http://localhost:8000/v1/models
# Test basic completion (OpenAI format)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"Qwen3-Coder","messages":[{"role":"user","content":"Hi"}]}'Check daf configuration:
# Validate config syntax
daf config validate
# Show active configuration
daf config show
# Show all profiles
daf config show-profilesTest step-by-step:
- ✅ Server running? (
curl http://localhost:8000/health) - ✅ Profile configured? (
daf config show-profiles) - ✅ Profile selected? (
daf open --model-profile llama-cpp) - ✅ Claude Code launches? (check for errors)
- ✅ First response? (wait 30-60 seconds)
Based on testing with MacBook Pro M1 (32GB) and Nvidia DGX Spark (120GB, GB10):
| Provider | Model | Hardware | Speed | Quality | Cost | Status |
|---|---|---|---|---|---|---|
| Anthropic | Claude Opus 4.6 | Cloud | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | $$$$ | ✅ Tested |
| Vertex AI | Claude 3.5 Sonnet v2 | Cloud | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | $$$ | ✅ Tested |
| llama.cpp | Qwen3-Coder (25B Q4) | Mac M1 32GB | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | Free | ✅ Tested |
| llama.cpp | DeepSeek-Coder (16B Q5) | DGX Spark | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Free | ✅ Tested |
| OpenRouter | deepseek-v3.2 | Cloud | ? | ? | $ (98% cheaper) | |
| LM Studio | Various models | Local | ? | ? | Free | |
| ❌ | ❌ | ❌ | ❌ Incompatible |
Note: Only Anthropic API, Vertex AI, and llama.cpp have been tested and confirmed working. OpenRouter and LM Studio are theoretically compatible but need testing. Ollama does NOT work with Claude Code due to API incompatibility.
| Solution | Setup | Cost | Offline | IDE Integration | Model Choice | Status |
|---|---|---|---|---|---|---|
| Anthropic API | ⭐⭐⭐⭐⭐ Instant | $$$$ High | ❌ No | ✅ Full | Claude only | ✅ Tested |
| Vertex AI | ⭐⭐⭐⭐ Easy | $$$ Medium | ❌ No | ✅ Full | Claude only | ✅ Tested |
| llama.cpp | ⭐⭐ Complex | Free | ✅ Yes | ✅ Full | Any GGUF | ✅ Tested |
| OpenRouter | ⭐⭐⭐⭐⭐ Instant | $ Very low | ❌ No | ? | 100+ models | |
| LM Studio | ⭐⭐⭐⭐ Easy | Free | ✅ Yes | ? | Any GGUF | |
| ❌ Not compatible | - | - | ❌ No | - | ❌ Incompatible |
Use Anthropic API when:
- ✅ Want best quality (Claude Opus 4.6)
- ✅ Don't mind cost
- ✅ Need instant setup
- ✅ Have internet connection
- ✅ Need production reliability
Use Vertex AI when:
- ✅ Enterprise GCP user
- ✅ Need Claude models
- ✅ Want enterprise billing/security
- ✅ Already using GCP infrastructure
- ✅ Need compliance/audit trails
Use OpenRouter when:
- ✅ Want cloud convenience
- ✅ Want very low cost ($0.28/M tokens)
- ✅ Want access to many models
- ✅ Need instant setup
- ✅ Willing to pay (but cheaply)
- ✅ Want to test different models easily
Use llama.cpp when:
- ✅ Want completely offline/local
- ✅ Want zero cost
- ✅ Want full IDE integration
- ✅ Want control over model selection
- ✅ Don't mind complex setup
- ✅ Don't mind slow initial prompt (30-60s)
- ✅ Have sufficient hardware (16GB+ RAM)
Use LM Studio when:
- ✅ Want local models with GUI
- ✅ Prefer visual model management
- ✅ Want zero cost
- ✅ Don't mind slower performance vs llama.cpp
- ✅ Want easier setup than llama.cpp
Do NOT use Ollama:
- ❌ API incompatibility with Claude Code
- ❌ All models fail (500 errors or hangs)
- ❌ Use llama.cpp or LM Studio instead for local models
Want to try a different LLM server? Here's what you need to know before testing:
For an LLM server to work with Claude Code, it MUST support:
- ✅ Anthropic Messages API format (not just OpenAI-compatible)
- ✅ Tool calling / function calling in Anthropic format
- ✅ Response streaming in the format Claude Code expects
Most LLM servers WON'T work because they're designed for OpenAI API compatibility, which is incompatible with Claude Code.
| Server | Status | Reason |
|---|---|---|
| llama.cpp | ✅ Works | Flexible API, --jinja flag for tool calling |
| LM Studio | ✅ Works | GUI wrapper around llama.cpp |
| OpenRouter | ✅ Works | Cloud service with multi-API support |
| Vertex AI | ✅ Works | Native Claude models from Google Cloud |
| Ollama | ❌ Incompatible | OpenAI-only format, no --jinja equivalent |
| vLLM | OpenAI-compatible only (untested) | |
| Text Gen Inference | OpenAI-compatible only (untested) | |
| LocalAI | Explicitly OpenAI-compatible (untested) | |
| FastChat | OpenAI-compatible API (untested) | |
| Koboldcpp | llama.cpp fork, might work (untested) | |
| Jan | Unclear API format (untested) |
Before adding a new server to the documentation, please test:
- Basic connectivity: Can you start a session?
- Simple prompts: Does it respond to "hi" or simple questions?
- Tool calling: Can it edit files when you ask? This is the critical test!
- Multi-turn conversation: Does context work across multiple prompts?
If it fails at tool calling (step 3), it's incompatible - don't waste more time.
# 1. Start your LLM server
# 2. Configure daf profile pointing to it
# 3. Open a test session
daf new --name test-server --goal "Test compatibility"
# 4. In Claude Code, try a file operation:
# Type: "Create a file called test.txt with the word 'hello' in it"
# Expected: ✅ File is created
# Failure: ❌ Hangs, errors, or just responds without creating fileIf tool calling works, congrats! 🎉 Please report your success:
- Open an issue at: https://github.com/itdove/devaiflow/issues
- Title: "Confirmed working: [Server Name] with Claude Code"
- Include: Server version, configuration, test results
We learned from the Ollama situation:
- ❌ Misleading documentation wastes users' time
- ❌ Untested claims damage credibility
- ✅ Only tested, verified configurations should be documented
Help us expand this list! Test servers and report your results. Community-verified configurations will be added to the official documentation.
- Test locally first: Use llama.cpp or LM Studio to test workflow before committing to paid API
- Have a fallback: Configure multiple profiles (local + cloud)
- Match model to task: Use smaller models for simple tasks, larger for complex
- Monitor costs: Track API usage for cloud providers
- Keep profiles updated: Document which models work well for your use cases
- Avoid Ollama: Don't waste time trying to make Ollama work - it's incompatible with Claude Code
- AI Agent Support Matrix - Compare different AI agents
- Configuration Guide - Full configuration reference
- Article: Run Claude Code on Local/Cloud Models