A hands-on guide to orchestrating AI agents entirely on your local machine
What if you could run a team of AI agents—planning, researching, critiquing, and writing—entirely on your laptop? No cloud API keys. No per-token billing. No data leaving your machine.
This demo shows you exactly how to do that using two powerful tools from Microsoft:
- Foundry Local: An on-device inference runtime that runs models like Qwen, Phi, and Llama locally
- Microsoft Agent Framework (MAF): A Python library for building and orchestrating AI agents
The result is a "Local Research & Synthesis Desk"—four AI agents that collaborate to answer research questions using your local documents.
Single LLM calls are powerful, but they have limitations:
- Context window constraints — Complex tasks may exceed the model's context limit
- Lack of specialization — One prompt trying to do everything often underperforms
- No iterative refinement — Single-shot responses miss review and improvement cycles
- Hard to debug — A monolithic prompt is harder to trace than discrete agent steps
Multi-agent orchestration solves these by breaking work into specialized roles:
User Question
│
▼
Planner → Breaks question into sub-tasks
│
▼
Retriever → Finds relevant content from local docs
│
▼
Critic → Reviews for gaps and contradictions
│ │
│ Gaps found?
│ YES │ NO
│ │ └────┐
│ ▼ │
│ Retriever │ → Re-retrieves to fill gaps
│ (gap-fill) │
│ │ │
│ ▼ │
│ Critic │ → Re-evaluates (up to 2 iterations)
│ │ │
▼◄────────────┴──────┘
Writer → Produces final report with citations
Each agent has a focused system prompt and receives structured input from previous agents. The Critic–Retriever feedback loop is a key innovation: rather than accepting whatever the Retriever produces on the first pass, the Critic can request targeted gap-filling, resulting in higher quality reports.
Here's how Foundry Local and MAF work together:
┌─────────────────────────────────────────────────────────┐
│ Your Machine │
│ │
│ ┌──────────────┐ Control Plane ┌──────────────┐ │
│ │ Python App │───(foundry-local-sdk)──►│Foundry Local │
│ │ (MAF agents) │ │ Service │ │
│ │ │ Data Plane │ │ │
│ │ OpenAIChatClient──(OpenAI API)────►│ Model (LLM) │ │
│ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────┘
Control Plane: The foundry-local-sdk manages the runtime—starting the service, downloading models, and returning the endpoint URL.
Data Plane: MAF's OpenAIChatClient sends chat completions to Foundry Local's OpenAI-compatible API. The port is assigned dynamically, so you never hardcode it.
The first challenge is starting Foundry Local and getting the endpoint URL. Here's how:
from foundry_local import FoundryLocalManager
async def boot_foundry(model_alias: str = "qwen2.5-0.5b"):
manager = FoundryLocalManager()
# Download model if needed (cached after first download)
await manager.download_model(model_alias)
# Start the service and get the endpoint
endpoint = await manager.start()
return endpoint # e.g., "http://localhost:54321/v1"The SDK handles hardware detection—if you have a compatible GPU, it uses CUDA/DirectML. Otherwise, it falls back to CPU inference.
Each agent is a ChatAgent with a specialized system prompt:
from agent_framework import ChatAgent, OpenAIChatClient
def create_agents(endpoint: str, model: str):
client = OpenAIChatClient(
base_url=endpoint,
model=model,
api_key="foundry-local" # Any non-empty string works
)
planner = ChatAgent(
name="Planner",
instructions="""You are a planning assistant.
Break the user's question into 2-4 concrete sub-tasks.
Output a numbered list of tasks.""",
client=client
)
retriever = ChatAgent(
name="Retriever",
instructions="""You are a research assistant.
Given sub-tasks and document snippets, extract relevant information.
Always cite the source document.""",
client=client
)
# ... similarly for Critic and Writer
return {"planner": planner, "retriever": retriever, ...}The demo showcases three orchestration patterns:
Sequential Pipeline — Each agent waits for the previous one:
async def run_sequential(agents, question, docs):
# Step 1: Planner breaks down the question
plan = await agents["planner"].run(question)
# Step 2: Retriever finds relevant content
context = await agents["retriever"].run(
f"Plan: {plan}\n\nDocuments: {docs}"
)
# Step 3: Critic reviews — may loop back to Retriever
for iteration in range(MAX_CRITIC_LOOPS):
critique = await agents["critic"].run(
f"Plan: {plan}\n\nContext: {context}"
)
if not critic_found_gaps(critique):
break # No gaps — proceed to Writer
# Gap-fill: Retriever fetches additional content
new_context = await agents["retriever"].run(
f"Gaps: {critique}\n\nPrevious: {context}\n\nDocs: {docs}"
)
context = f"{context}\n\n{new_context}"
# Step 4: Writer produces final report
report = await agents["writer"].run(
f"Plan: {plan}\n\nContext: {context}\n\nCritique: {critique}"
)
return reportThe Critic is instructed to output GAPS FOUND or NO GAPS at the start of its response, making it easy to parse programmatically. When gaps are found, the orchestrator sends the specific gaps back to the Retriever with the original documents, merges the new snippets with existing ones, and re-runs the Critic. This loop runs up to 2 times before handing off to the Writer.
Concurrent Fan-Out — Independent tasks run in parallel:
async def run_concurrent_retrieval(agents, plan, docs):
# Retriever and ToolAgent don't depend on each other
results = await asyncio.gather(
agents["retriever"].run(f"Plan: {plan}\n\nDocs: {docs}"),
agents["tool_agent"].run(f"Analyze: {plan}")
)
return merge_results(results)The concurrent approach saves time when agents are independent—the Retriever searches documents while the ToolAgent extracts keywords, and both complete in the time of the slower task.
MAF supports tool calling where the LLM can invoke Python functions. Here's a simple example:
from pydantic import BaseModel, Field
class WordCountInput(BaseModel):
text: str = Field(description="The text to count words in")
def word_count(text: str) -> int:
"""Count the number of words in the given text."""
return len(text.split())
# Register the tool with the agent
tool_agent = ChatAgent(
name="ToolAgent",
instructions="Use tools to analyze text when asked.",
client=client,
tools=[word_count]
)When the user asks "How many words are in this paragraph?", the model emits a tool call, MAF executes the function, and the result is returned to the model for final response generation.
The web UI uses Server-Sent Events (SSE) to stream agent progress:
@app.route("/api/workflow", methods=["POST"])
def run_workflow():
question = request.json.get("question")
def generate():
for agent_name, output in orchestrator.run_stream(question):
data = {"agent": agent_name, "text": output}
yield f"data: {json.dumps(data)}\n\n"
return Response(generate(), mimetype="text/event-stream")The frontend listens to these events and updates the UI in real-time:
const source = new EventSource("/api/workflow?q=" + encodeURIComponent(question));
source.onmessage = (event) => {
const data = JSON.parse(event.data);
updateAgentOutput(data.agent, data.text);
};The Critic–Retriever feedback loop demonstrates a powerful pattern: agents that evaluate their own pipeline's output and request corrections. Rather than a single pass that may miss important context, the system iteratively refines its retrieval until the Critic is satisfied (or a maximum iteration count is reached). This is similar to how human researchers work—reviewing their sources, identifying what's missing, and going back to find more.
Every computation—model inference, document retrieval, agent orchestration—happens on your machine. This means:
- Privacy: Your documents never leave your device
- Cost: No per-token API charges
- Latency: No network round-trips
- Offline: Works without internet (after initial model download)
Foundry Local automatically selects the best backend for your hardware:
- NVIDIA GPU: CUDA acceleration
- AMD/Intel GPU: DirectML acceleration
- NPU (Neural Processing Unit): Hardware-specific optimizations
- CPU: Works everywhere, just slower
The demo defaults to qwen2.5-0.5b (500M parameters) for fast iteration. In production, you might use:
| Model | Parameters | Speed | Quality |
|---|---|---|---|
| qwen2.5-0.5b | 500M | ⚡⚡⚡ | Good for demos |
| qwen2.5-3b | 3B | ⚡⚡ | Better reasoning |
| qwen2.5-7b | 7B | ⚡ | Production quality |
| phi-3.5-mini | 3.8B | ⚡⚡ | Excellent reasoning |
Switch models with a single flag: --model qwen2.5-7b
Smaller models sometimes emit raw <tool_call> XML in their responses. The fix is to strip it:
import re
def clean_response(text: str) -> str:
return re.sub(r'<tool_call>.*?</tool_call>\s*', '', text, flags=re.DOTALL).strip()Not all models support function calling. Stick to models that explicitly support it:
- Qwen 2.5 family ✅
- Phi-3.5 family ✅
- Llama 3.2 family ✅
Model download can take several minutes depending on size and connection. The model is cached in ~/.foundry/ for future runs.
Larger models need more RAM/VRAM:
- 0.5B model: ~2GB
- 3B model: ~6GB
- 7B model: ~14GB
Check your available memory before selecting a model.
# 1. Install Foundry Local
# Follow: https://github.com/microsoft/Foundry-Local
# 2. Clone and set up
git clone <this-repo>
cd agentframework-foundrylocal
python -m venv .venv && .venv\Scripts\activate
pip install -r requirements.txt
cp .env.example .env
# 3. Run the CLI demo
python -m src.app "What are the benefits of multi-agent AI systems?"
# 4. Launch the web UI
python -m src.app.web
# Open http://localhost:5000This demo is a starting point. Here are ideas for extending it:
- Add more agents: A "Fact Checker" agent that verifies claims against external sources
- Implement memory: Let agents remember context across sessions using vector databases
- Add human-in-the-loop: Pause for user approval before the Writer produces the final report
- Build evaluation pipelines: Measure agent quality with automated metrics
- Deploy as a service: Package the orchestrator as a REST API for team use
- Extend the feedback loop: Add more sophisticated gap detection, or let the Critic suggest entirely new sub-tasks for the Planner
- Foundry Local: foundrylocal.ai
- Microsoft Agent Framework: learn.microsoft.com/en-us/agent-framework
- MAF GitHub Samples: github.com/microsoft/Agent-Framework-Samples
- Foundry Local SDK Reference: Microsoft Learn
Built with Microsoft Agent Framework and Foundry Local. All inference runs on your machine.