Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,9 @@ rename / release-pipeline wiring.
| Agent runtime | `graph/agent.py`, `server.py` | LangGraph `create_agent()` wired to the A2A handler, with streaming token capture for cost-v1 |
| LLM gateway | `graph/llm.py` | OpenAI-compatible client pointed at LiteLLM — swap models by editing the gateway config, not the fork |
| Subagents | `graph/subagents/config.py` | DeerFlow-pattern delegation via a `task()` tool; one placeholder `worker` ships |
| Starter tools | `tools/lg_tools.py` | Free, keyless tools so a fresh fork can demo real behaviour: `echo`, `current_time`, `calculator` (safe AST eval), `web_search` (DuckDuckGo), `fetch_url` |
| Starter tools | `tools/lg_tools.py` | Keyless general tools (`current_time`, `calculator` safe AST eval, `web_search` via DuckDuckGo, `fetch_url`) plus memory tools (`memory_ingest`, `memory_recall`, `memory_list`, `memory_stats`, `daily_log`) bound to the bundled store |
| Knowledge store | `knowledge/store.py` | sqlite + FTS5 (LIKE fallback). One `chunks` table for operator notes, daily-log entries, and conversation findings. Default-on; turn off with `middleware.knowledge: false` |
| Eval harness | `evals/` | Side-effect-verified A2A test harness — audit log + reply text + KB state. `python -m evals.runner` against a running agent. See [Eval your fork](./docs/guides/evals.md) |
| Tracing | `tracing.py` | Langfuse trace_session with distributed `a2a.trace` propagation and the OTel cross-context-detach filter |
| Observability | `metrics.py`, `audit.py` | Prometheus metrics with per-agent prefix, JSONL audit log with trace IDs |
| Output protocol | `graph/output_format.py` | `<scratch_pad>` / `<output>` parsing so the model can think without it leaking to users |
Expand Down
21 changes: 17 additions & 4 deletions TEMPLATE.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,10 +72,10 @@ handler's output extraction depends on it.
## 4. Add your real tools

`tools/lg_tools.py` ships with a small keyless starter set so a
fresh clone can demonstrate a real research loop: `echo`,
`current_time`, `calculator` (safe AST eval — no `eval()`),
`web_search` (DuckDuckGo via `ddgs`), and `fetch_url`. Keep the
ones you want, drop the rest, and add your own:
fresh clone can demonstrate a real research loop: `current_time`,
`calculator` (safe AST eval — no `eval()`), `web_search` (DuckDuckGo
via `ddgs`), and `fetch_url`. Keep the ones you want, drop the rest,
and add your own:

```python
from langchain_core.tools import tool
Expand Down Expand Up @@ -167,6 +167,19 @@ your fork. A useful pattern:
- Extend `tests/test_a2a_integration.py` with assertions for
your declared skills + extensions on the agent card

For end-to-end behaviour testing — "when the operator asks X, does
the right tool actually fire and the right row land in the KB?" —
the template ships an eval harness under `evals/`:

```bash
python -m evals.runner # against a running agent
python -m evals.runner --category tool
```

See [Eval your fork](./docs/guides/evals.md) for what each case
asserts, how the three assertion channels work, and how to add
cases for your fork's new tools.

## 9a. Understand the skill loop

protoAgent's skill loop lets your agent learn from experience automatically.
Expand Down
20 changes: 15 additions & 5 deletions config/langgraph-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,14 +22,24 @@ model:
subagents:
worker:
enabled: true
tools: [echo, current_time, calculator, web_search, fetch_url]
tools:
- current_time
- calculator
- web_search
- fetch_url
- memory_ingest
- memory_recall
- memory_list
- memory_stats
- daily_log
max_turns: 20

middleware:
# The knowledge middleware requires a knowledge store. Leave false
# until you add one. Memory persistence is enabled by default and
# writes session summaries to /sandbox/memory/ without a store.
knowledge: false
# All three middlewares default ON. The knowledge middleware needs a
# store; the template constructs one automatically (see
# ``server.py::_build_knowledge_store``). Set ``knowledge: false`` if
# your fork is purely stateless.
knowledge: true
audit: true
memory: true

Expand Down
2 changes: 1 addition & 1 deletion docs/guides/customize-and-deploy.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ Replace with the skills your agent actually advertises over A2A. The `name` and

## 5. (Optional) Add domain tools

`tools/lg_tools.py` ships with `echo`, `current_time`, `calculator`, `web_search`, `fetch_url`. Keep the ones you want, drop the rest, add your own. Update `get_all_tools()` at the bottom. Any tool returned from there becomes a checkbox in the wizard and drawer automatically.
`tools/lg_tools.py` ships with `current_time`, `calculator`, `web_search`, `fetch_url`. Keep the ones you want, drop the rest, add your own. Update `get_all_tools()` at the bottom. Any tool returned from there becomes a checkbox in the wizard and drawer automatically.

## 6. (Optional) Configure subagents

Expand Down
151 changes: 151 additions & 0 deletions docs/guides/evals.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
# Eval your fork

The template ships an eval harness under `evals/` so a fresh fork has
a working test suite for its tools, memory, and A2A protocol surface
on day one. Cases assert across three independent channels — audit
log, reply text, and knowledge-store side effects — so a model that
hallucinates a tool result still gets caught.

## When to read this

- You forked the template and want a baseline pass-rate before you
ship.
- You added a new tool and want to lock in its intent — "when the
operator says X, fire tool Y".
- You changed a prompt or model and want to measure regression.

## Run the suite

```bash
# Agent running at $EVAL_BASE_URL (default http://localhost:7870)
# with the relevant auth env (A2A_AUTH_TOKEN and/or <AGENT>_API_KEY).

python -m evals.runner
python -m evals.runner --category tool
python -m evals.runner --tasks current_time_intent,daily_log_intent
```

Reports land in `evals/results/run-<ts>.json`. The CLI prints a
pass/fail board; the JSON report carries reply previews and timing
for post-hoc inspection.

## The three assertion channels

```
prompt → A2A → audit log (1) tools fired with expected outcome
→ reply text (2) substrings present in reply
→ KB chunks table (3) side effects landed correctly
```

A case passes only when every configured assertion holds. Most cases
should opt in to channels 1 and 3 — text patterns alone are brittle
to model paraphrasing and miss hallucinated tool results entirely.

### Why side-effect verification beats text-only

A model can produce "Logged: ..." in its reply without actually
calling `daily_log`. Substring matching passes, the DB stays empty,
and the bug ships. Reading `audit.jsonl` and the `chunks` table
afterward catches it.

## The shape of a case

```json
{
"id": "unique-id",
"category": "tool",
"kind": "ask",
"name": "Asks for arithmetic → calculator",
"prompt": "How much is 17 times 23, plus 1?",
"expected_tools": ["calculator"],
"expected_patterns": ["392"],
"verify_kb": {
"find_chunk_containing": "EVAL-MARK-XYZ",
"domain": "context"
},
"setup": [{"kb_ingest": {"content": "...", "domain": "...", "heading": "..."}}],
"teardown": [{"kb_delete_by_content": {"contains": "..."}}]
}
```

Three case `kind`s ship:

- `agent_card` — fetch `/.well-known/agent-card.json` and assert on
the card's name, skill count, and declared extensions.
- `auth_check` — send a request with a deliberately bad bearer and
assert the server returns the expected status (401 by default).
- `ask` — the main shape. Sends `prompt`, then asserts on tool firing,
reply patterns, and KB state.

## Prompt rule

**The tool name never appears in the prompt.** Every prompt must be
plausibly typed by a real user. "Use `daily_log` to record..." tests
instruction-following, not tool selection. If the agent needs to
infer the tool from intent, that *is* the test.

## Setup and teardown — start clean every time

Each `ask` case can pre-seed state via `setup` blocks (BFCL's
`initial_config` pattern: direct DB writes the model never sees) and
clean up after itself with `teardown`. The fixture is invisible to
the agent — it discovers the seeded state via tools, exactly as a
real user would.

`teardown` runs even when assertions fail, so case order doesn't
matter and a noisy failure doesn't poison the next run.

Supported setup/teardown step kinds (extend `evals/verify.py` to add
more):

| Step kind | Args | What it does |
|---|---|---|
| `kb_ingest` | `content`, `domain`, `heading?` | Insert a chunk |
| `kb_delete_by_content` | `contains` | Delete chunks where content LIKE `%contains%` |
| `kb_delete_by_heading` | `domain`, `heading` | Delete chunks matching (domain, heading) |

## What forks should test by default

The starter `tasks.json` covers:

- Agent card discovery (name, skill count, `cost-v1` extension)
- Bearer auth gating
- Each shipped tool fires from a plausible operator prompt
- Memory ingest → recall round-trip
- KB-driven middleware injection (no tool call needed)
- A chained two-tool case (`daily_log` then `memory_recall`)

When you add a tool, add at least one case for it. When you add a
skill to the agent card, extend the `card_discovery` case to assert
the new skill is advertised.

## Running in CI

The runner exits non-zero when any case fails, so it drops in cleanly:

```yaml
- name: Boot agent
run: docker compose up -d agent

- name: Wait for /health
run: ./scripts/wait-for-it.sh http://localhost:7870/.well-known/agent-card.json

- name: Run evals
run: python -m evals.runner
env:
EVAL_BASE_URL: http://localhost:7870
A2A_AUTH_TOKEN: ${{ secrets.AGENT_BEARER }}
```

For non-deterministic categories (any `tool` or `chained` case), aim
for an N-of-M majority threshold rather than 100% — the reference
implementation runs 3 attempts and gates at 2 passes for those
categories. Deterministic ones (`a2a-protocol`, `subsystem` with
seeded state) gate at 100%.

## References

- [`evals/README.md`](https://github.com/protoLabsAI/protoAgent/blob/main/evals/README.md) — quick reference for case authors
- Anthropic — [Demystifying evals for AI agents](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents)
- BFCL V3 — [Multi-Turn](https://gorilla.cs.berkeley.edu/blogs/13_bfcl_v3_multi_turn.html)
- [ToolSandbox](https://arxiv.org/html/2408.04682v1) — user simulator + milestones / minefields
2 changes: 1 addition & 1 deletion docs/guides/fork-the-template.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ Keep the `<scratch_pad>` / `<output>` protocol block in `prompts.py` — the A2A

## 4. Replace the starter tools

`tools/lg_tools.py` ships with `echo`, `current_time`, `calculator`, `web_search`, `fetch_url`. Keep what you want, drop the rest, add your own. Update `get_all_tools()` at the bottom of the file.
`tools/lg_tools.py` ships with `current_time`, `calculator`, `web_search`, `fetch_url`. Keep what you want, drop the rest, add your own. Update `get_all_tools()` at the bottom of the file.

See the [starter tools reference](/reference/starter-tools) for the shapes of the shipped ones.

Expand Down
1 change: 1 addition & 0 deletions docs/guides/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,5 @@ Task-oriented procedures. Assumes you already have a running agent (see [Tutoria
| [Add a custom skill](/guides/add-a-skill) | Your agent does new things and callers need to dispatch to them |
| [Configure subagents](/guides/subagents) | You want specialized delegates beyond the placeholder `worker` |
| [Wire Langfuse + Prometheus](/guides/observability) | You need traces and metrics in production |
| [Eval your fork](/guides/evals) | You want a baseline pass-rate for the tools / memory / A2A surface in your fork |
| [Deploy via GHCR](/guides/deploy) | You're ready to ship and want auto-deploy wired up |
21 changes: 17 additions & 4 deletions docs/guides/subagents.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,11 @@ The template's `LangGraphConfig` (in `graph/config.py`) has a `worker` field. Ad
class LangGraphConfig:
# ... existing fields ...
worker: SubagentDef = field(default_factory=lambda: SubagentDef(
tools=["echo", "current_time", "calculator", "web_search", "fetch_url"],
tools=[
"current_time", "calculator", "web_search", "fetch_url",
"memory_ingest", "memory_recall", "memory_list", "memory_stats",
"daily_log",
],
max_turns=20,
))
researcher: SubagentDef = field(default_factory=lambda: SubagentDef(
Expand Down Expand Up @@ -86,7 +90,16 @@ for name in ("worker", "researcher"): # ← add new names
subagents:
worker:
enabled: true
tools: [echo, current_time, calculator, web_search, fetch_url]
tools:
- current_time
- calculator
- web_search
- fetch_url
- memory_ingest
- memory_recall
- memory_list
- memory_stats
- daily_log
max_turns: 20
Comment thread
coderabbitai[bot] marked this conversation as resolved.
researcher:
enabled: true
Expand Down Expand Up @@ -117,8 +130,8 @@ If your agent is simple enough that subagents are pure overhead, flip `include_s
```python
_graph = create_agent_graph(
_graph_config,
knowledge_store=None,
include_subagents=False, # ← skip the task() tool and subagent machinery
knowledge_store=knowledge_store, # keep the bundled store wired up
include_subagents=False, # ← skip the task() tool and subagent machinery
)
```

Expand Down
25 changes: 17 additions & 8 deletions docs/reference/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,22 @@ model:
subagents:
worker:
enabled: true
tools: [echo, current_time, calculator, web_search, fetch_url]
tools:
- current_time
- calculator
- web_search
- fetch_url
- memory_ingest
- memory_recall
- memory_list
- memory_stats
- daily_log
max_turns: 20

middleware:
knowledge: false
knowledge: true
audit: true
memory: false
memory: true

knowledge:
db_path: /sandbox/knowledge/agent.db
Expand Down Expand Up @@ -59,18 +68,18 @@ Adding a new subagent name to the YAML requires matching entries in `graph/subag

| Key | Default | What |
|---|---|---|
| `knowledge` | `false` | Inject retrieved knowledge into state before LLM calls. Requires a knowledge store — leave off until you add one. |
| `knowledge` | `true` | Inject retrieved knowledge into state before LLM calls. Backed by the bundled `KnowledgeStore` (sqlite + FTS5). Set `false` for a stateless agent. |
| `audit` | `true` | Append every tool call to `/sandbox/audit/audit.jsonl`. |
| `memory` | `false` | Memory middleware (experimental). Requires a knowledge store. |
| `memory` | `true` | Persist a session summary on terminal turn and asynchronously index conversation findings under `domain='finding'`. |

## `knowledge`

Only read when `middleware.knowledge` is `true`.

| Key | Default | What |
|---|---|---|
| `db_path` | `/sandbox/knowledge/agent.db` | SQLite file path. |
| `embed_model` | `nomic-embed-text` | Embedding model. |
| `db_path` | `/sandbox/knowledge/agent.db` | SQLite file path. Falls back to `~/.protoagent/knowledge/agent.db` automatically when the configured path isn't writable (e.g. running locally without `/sandbox`). Override at runtime with `KNOWLEDGE_DB_PATH`. |
| `embed_model` | `nomic-embed-text` | Reserved for forks that bolt embeddings on top of the FTS5 baseline. The bundled store ignores it. |
| `top_k` | `5` | Results per query fed into state. |

The template does not ship a knowledge store — the config keys are kept so a fork can flip the switch without rewiring every call site.
The bundled store is sqlite + FTS5 (with an automatic LIKE fallback when FTS5 isn't available). One `chunks` table; the `domain` column distinguishes operator-set notes (`memory_ingest`), daily-log entries (`daily_log`), and conversation findings extracted by `MemoryMiddleware` (`domain='finding'`).
Loading
Loading