protoLabsAI · mabry1985 · Apr 27, 2026 · Apr 27, 2026 · Apr 27, 2026 · Apr 27, 2026
diff --git a/README.md b/README.md
@@ -30,7 +30,9 @@ rename / release-pipeline wiring.
 | Agent runtime | `graph/agent.py`, `server.py` | LangGraph `create_agent()` wired to the A2A handler, with streaming token capture for cost-v1 |
 | LLM gateway | `graph/llm.py` | OpenAI-compatible client pointed at LiteLLM — swap models by editing the gateway config, not the fork |
 | Subagents | `graph/subagents/config.py` | DeerFlow-pattern delegation via a `task()` tool; one placeholder `worker` ships |
-| Starter tools | `tools/lg_tools.py` | Free, keyless tools so a fresh fork can demo real behaviour: `echo`, `current_time`, `calculator` (safe AST eval), `web_search` (DuckDuckGo), `fetch_url` |
+| Starter tools | `tools/lg_tools.py` | Keyless general tools (`current_time`, `calculator` safe AST eval, `web_search` via DuckDuckGo, `fetch_url`) plus memory tools (`memory_ingest`, `memory_recall`, `memory_list`, `memory_stats`, `daily_log`) bound to the bundled store |
+| Knowledge store | `knowledge/store.py` | sqlite + FTS5 (LIKE fallback). One `chunks` table for operator notes, daily-log entries, and conversation findings. Default-on; turn off with `middleware.knowledge: false` |
+| Eval harness | `evals/` | Side-effect-verified A2A test harness — audit log + reply text + KB state. `python -m evals.runner` against a running agent. See [Eval your fork](./docs/guides/evals.md) |
 | Tracing | `tracing.py` | Langfuse trace_session with distributed `a2a.trace` propagation and the OTel cross-context-detach filter |
 | Observability | `metrics.py`, `audit.py` | Prometheus metrics with per-agent prefix, JSONL audit log with trace IDs |
 | Output protocol | `graph/output_format.py` | `<scratch_pad>` / `<output>` parsing so the model can think without it leaking to users |

diff --git a/TEMPLATE.md b/TEMPLATE.md
@@ -72,10 +72,10 @@ handler's output extraction depends on it.
 ## 4. Add your real tools
 
 `tools/lg_tools.py` ships with a small keyless starter set so a
-fresh clone can demonstrate a real research loop: `echo`,
-`current_time`, `calculator` (safe AST eval — no `eval()`),
-`web_search` (DuckDuckGo via `ddgs`), and `fetch_url`. Keep the
-ones you want, drop the rest, and add your own:
+fresh clone can demonstrate a real research loop: `current_time`,
+`calculator` (safe AST eval — no `eval()`), `web_search` (DuckDuckGo
+via `ddgs`), and `fetch_url`. Keep the ones you want, drop the rest,
+and add your own:
 
 ```python
 from langchain_core.tools import tool
@@ -167,6 +167,19 @@ your fork. A useful pattern:
 - Extend `tests/test_a2a_integration.py` with assertions for
   your declared skills + extensions on the agent card
 
+For end-to-end behaviour testing — "when the operator asks X, does
+the right tool actually fire and the right row land in the KB?" —
+the template ships an eval harness under `evals/`:
+
+```bash
+python -m evals.runner             # against a running agent
+python -m evals.runner --category tool
+```
+
+See [Eval your fork](./docs/guides/evals.md) for what each case
+asserts, how the three assertion channels work, and how to add
+cases for your fork's new tools.
+
 ## 9a. Understand the skill loop
 
 protoAgent's skill loop lets your agent learn from experience automatically.

diff --git a/config/langgraph-config.yaml b/config/langgraph-config.yaml
@@ -22,14 +22,24 @@ model:
 subagents:
   worker:
     enabled: true
-    tools: [echo, current_time, calculator, web_search, fetch_url]
+    tools:
+      - current_time
+      - calculator
+      - web_search
+      - fetch_url
+      - memory_ingest
+      - memory_recall
+      - memory_list
+      - memory_stats
+      - daily_log
     max_turns: 20
 
 middleware:
-  # The knowledge middleware requires a knowledge store. Leave false
-  # until you add one. Memory persistence is enabled by default and
-  # writes session summaries to /sandbox/memory/ without a store.
-  knowledge: false
+  # All three middlewares default ON. The knowledge middleware needs a
+  # store; the template constructs one automatically (see
+  # ``server.py::_build_knowledge_store``). Set ``knowledge: false`` if
+  # your fork is purely stateless.
+  knowledge: true
   audit: true
   memory: true
 

diff --git a/docs/guides/customize-and-deploy.md b/docs/guides/customize-and-deploy.md
@@ -66,7 +66,7 @@ Replace with the skills your agent actually advertises over A2A. The `name` and
 
 ## 5. (Optional) Add domain tools
 
-`tools/lg_tools.py` ships with `echo`, `current_time`, `calculator`, `web_search`, `fetch_url`. Keep the ones you want, drop the rest, add your own. Update `get_all_tools()` at the bottom. Any tool returned from there becomes a checkbox in the wizard and drawer automatically.
+`tools/lg_tools.py` ships with `current_time`, `calculator`, `web_search`, `fetch_url`. Keep the ones you want, drop the rest, add your own. Update `get_all_tools()` at the bottom. Any tool returned from there becomes a checkbox in the wizard and drawer automatically.
 
 ## 6. (Optional) Configure subagents
 

diff --git a/docs/guides/evals.md b/docs/guides/evals.md
@@ -0,0 +1,151 @@
+# Eval your fork
+
+The template ships an eval harness under `evals/` so a fresh fork has
+a working test suite for its tools, memory, and A2A protocol surface
+on day one. Cases assert across three independent channels — audit
+log, reply text, and knowledge-store side effects — so a model that
+hallucinates a tool result still gets caught.
+
+## When to read this
+
+- You forked the template and want a baseline pass-rate before you
+  ship.
+- You added a new tool and want to lock in its intent — "when the
+  operator says X, fire tool Y".
+- You changed a prompt or model and want to measure regression.
+
+## Run the suite
+
+```bash
+# Agent running at $EVAL_BASE_URL (default http://localhost:7870)
+# with the relevant auth env (A2A_AUTH_TOKEN and/or <AGENT>_API_KEY).
+
+python -m evals.runner
+python -m evals.runner --category tool
+python -m evals.runner --tasks current_time_intent,daily_log_intent
+```
+
+Reports land in `evals/results/run-<ts>.json`. The CLI prints a
+pass/fail board; the JSON report carries reply previews and timing
+for post-hoc inspection.
+
+## The three assertion channels
+
+```
+prompt → A2A → audit log         (1) tools fired with expected outcome
+            → reply text         (2) substrings present in reply
+            → KB chunks table    (3) side effects landed correctly
+```
+
+A case passes only when every configured assertion holds. Most cases
+should opt in to channels 1 and 3 — text patterns alone are brittle
+to model paraphrasing and miss hallucinated tool results entirely.
+
+### Why side-effect verification beats text-only
+
+A model can produce "Logged: ..." in its reply without actually
+calling `daily_log`. Substring matching passes, the DB stays empty,
+and the bug ships. Reading `audit.jsonl` and the `chunks` table
+afterward catches it.
+
+## The shape of a case
+
+```json
+{
+  "id": "unique-id",
+  "category": "tool",
+  "kind": "ask",
+  "name": "Asks for arithmetic → calculator",
+  "prompt": "How much is 17 times 23, plus 1?",
+  "expected_tools": ["calculator"],
+  "expected_patterns": ["392"],
+  "verify_kb": {
+    "find_chunk_containing": "EVAL-MARK-XYZ",
+    "domain": "context"
+  },
+  "setup":    [{"kb_ingest": {"content": "...", "domain": "...", "heading": "..."}}],
+  "teardown": [{"kb_delete_by_content": {"contains": "..."}}]
+}
+```
+
+Three case `kind`s ship:
+
+- `agent_card` — fetch `/.well-known/agent-card.json` and assert on
+  the card's name, skill count, and declared extensions.
+- `auth_check` — send a request with a deliberately bad bearer and
+  assert the server returns the expected status (401 by default).
+- `ask` — the main shape. Sends `prompt`, then asserts on tool firing,
+  reply patterns, and KB state.
+
+## Prompt rule
+
+**The tool name never appears in the prompt.** Every prompt must be
+plausibly typed by a real user. "Use `daily_log` to record..." tests
+instruction-following, not tool selection. If the agent needs to
+infer the tool from intent, that *is* the test.
+
+## Setup and teardown — start clean every time
+
+Each `ask` case can pre-seed state via `setup` blocks (BFCL's
+`initial_config` pattern: direct DB writes the model never sees) and
+clean up after itself with `teardown`. The fixture is invisible to
+the agent — it discovers the seeded state via tools, exactly as a
+real user would.
+
+`teardown` runs even when assertions fail, so case order doesn't
+matter and a noisy failure doesn't poison the next run.
+
+Supported setup/teardown step kinds (extend `evals/verify.py` to add
+more):
+
+| Step kind | Args | What it does |
+|---|---|---|
+| `kb_ingest` | `content`, `domain`, `heading?` | Insert a chunk |
+| `kb_delete_by_content` | `contains` | Delete chunks where content LIKE `%contains%` |
+| `kb_delete_by_heading` | `domain`, `heading` | Delete chunks matching (domain, heading) |
+
+## What forks should test by default
+
+The starter `tasks.json` covers:
+
+- Agent card discovery (name, skill count, `cost-v1` extension)
+- Bearer auth gating
+- Each shipped tool fires from a plausible operator prompt
+- Memory ingest → recall round-trip
+- KB-driven middleware injection (no tool call needed)
+- A chained two-tool case (`daily_log` then `memory_recall`)
+
+When you add a tool, add at least one case for it. When you add a
+skill to the agent card, extend the `card_discovery` case to assert
+the new skill is advertised.
+
+## Running in CI
+
+The runner exits non-zero when any case fails, so it drops in cleanly:
+
+```yaml
+- name: Boot agent
+  run: docker compose up -d agent
+
+- name: Wait for /health
+  run: ./scripts/wait-for-it.sh http://localhost:7870/.well-known/agent-card.json
+
+- name: Run evals
+  run: python -m evals.runner
+  env:
+    EVAL_BASE_URL: http://localhost:7870
+    A2A_AUTH_TOKEN: ${{ secrets.AGENT_BEARER }}
+```
+
+For non-deterministic categories (any `tool` or `chained` case), aim
+for an N-of-M majority threshold rather than 100% — the reference
+implementation runs 3 attempts and gates at 2 passes for those
+categories. Deterministic ones (`a2a-protocol`, `subsystem` with
+seeded state) gate at 100%.
+
+## References
+
+- [`evals/README.md`](https://github.com/protoLabsAI/protoAgent/blob/main/evals/README.md) — quick reference for case authors
+- Anthropic — [Demystifying evals for AI agents](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents)
+- BFCL V3 — [Multi-Turn](https://gorilla.cs.berkeley.edu/blogs/13_bfcl_v3_multi_turn.html)
+- [ToolSandbox](https://arxiv.org/html/2408.04682v1) — user simulator + milestones / minefields
diff --git a/docs/guides/fork-the-template.md b/docs/guides/fork-the-template.md
@@ -43,7 +43,7 @@ Keep the `<scratch_pad>` / `<output>` protocol block in `prompts.py` — the A2A
 
 ## 4. Replace the starter tools
 
-`tools/lg_tools.py` ships with `echo`, `current_time`, `calculator`, `web_search`, `fetch_url`. Keep what you want, drop the rest, add your own. Update `get_all_tools()` at the bottom of the file.
+`tools/lg_tools.py` ships with `current_time`, `calculator`, `web_search`, `fetch_url`. Keep what you want, drop the rest, add your own. Update `get_all_tools()` at the bottom of the file.
 
 See the [starter tools reference](/reference/starter-tools) for the shapes of the shipped ones.
 

diff --git a/docs/guides/index.md b/docs/guides/index.md
@@ -9,4 +9,5 @@ Task-oriented procedures. Assumes you already have a running agent (see [Tutoria
 | [Add a custom skill](/guides/add-a-skill) | Your agent does new things and callers need to dispatch to them |
 | [Configure subagents](/guides/subagents) | You want specialized delegates beyond the placeholder `worker` |
 | [Wire Langfuse + Prometheus](/guides/observability) | You need traces and metrics in production |
+| [Eval your fork](/guides/evals) | You want a baseline pass-rate for the tools / memory / A2A surface in your fork |
 | [Deploy via GHCR](/guides/deploy) | You're ready to ship and want auto-deploy wired up |
diff --git a/docs/guides/subagents.md b/docs/guides/subagents.md
@@ -56,7 +56,11 @@ The template's `LangGraphConfig` (in `graph/config.py`) has a `worker` field. Ad
 class LangGraphConfig:
     # ... existing fields ...
     worker: SubagentDef = field(default_factory=lambda: SubagentDef(
-        tools=["echo", "current_time", "calculator", "web_search", "fetch_url"],
+        tools=[
+            "current_time", "calculator", "web_search", "fetch_url",
+            "memory_ingest", "memory_recall", "memory_list", "memory_stats",
+            "daily_log",
+        ],
         max_turns=20,
     ))
     researcher: SubagentDef = field(default_factory=lambda: SubagentDef(
@@ -86,7 +90,16 @@ for name in ("worker", "researcher"):  # ← add new names
 subagents:
   worker:
     enabled: true
-    tools: [echo, current_time, calculator, web_search, fetch_url]
+    tools:
+      - current_time
+      - calculator
+      - web_search
+      - fetch_url
+      - memory_ingest
+      - memory_recall
+      - memory_list
+      - memory_stats
+      - daily_log
     max_turns: 20
   researcher:
     enabled: true
@@ -117,8 +130,8 @@ If your agent is simple enough that subagents are pure overhead, flip `include_s
 ```python
 _graph = create_agent_graph(
     _graph_config,
-    knowledge_store=None,
-    include_subagents=False,   # ← skip the task() tool and subagent machinery
+    knowledge_store=knowledge_store,  # keep the bundled store wired up
+    include_subagents=False,           # ← skip the task() tool and subagent machinery
 )
 ```
 

diff --git a/docs/reference/configuration.md b/docs/reference/configuration.md
@@ -17,13 +17,22 @@ model:
 subagents:
   worker:
     enabled: true
-    tools: [echo, current_time, calculator, web_search, fetch_url]
+    tools:
+      - current_time
+      - calculator
+      - web_search
+      - fetch_url
+      - memory_ingest
+      - memory_recall
+      - memory_list
+      - memory_stats
+      - daily_log
     max_turns: 20
 
 middleware:
-  knowledge: false
+  knowledge: true
   audit: true
-  memory: false
+  memory: true
 
 knowledge:
   db_path: /sandbox/knowledge/agent.db
@@ -59,18 +68,18 @@ Adding a new subagent name to the YAML requires matching entries in `graph/subag
 
 | Key | Default | What |
 |---|---|---|
-| `knowledge` | `false` | Inject retrieved knowledge into state before LLM calls. Requires a knowledge store — leave off until you add one. |
+| `knowledge` | `true` | Inject retrieved knowledge into state before LLM calls. Backed by the bundled `KnowledgeStore` (sqlite + FTS5). Set `false` for a stateless agent. |
 | `audit` | `true` | Append every tool call to `/sandbox/audit/audit.jsonl`. |
-| `memory` | `false` | Memory middleware (experimental). Requires a knowledge store. |
+| `memory` | `true` | Persist a session summary on terminal turn and asynchronously index conversation findings under `domain='finding'`. |
 
 ## `knowledge`
 
 Only read when `middleware.knowledge` is `true`.
 
 | Key | Default | What |
 |---|---|---|
-| `db_path` | `/sandbox/knowledge/agent.db` | SQLite file path. |
-| `embed_model` | `nomic-embed-text` | Embedding model. |
+| `db_path` | `/sandbox/knowledge/agent.db` | SQLite file path. Falls back to `~/.protoagent/knowledge/agent.db` automatically when the configured path isn't writable (e.g. running locally without `/sandbox`). Override at runtime with `KNOWLEDGE_DB_PATH`. |
+| `embed_model` | `nomic-embed-text` | Reserved for forks that bolt embeddings on top of the FTS5 baseline. The bundled store ignores it. |
 | `top_k` | `5` | Results per query fed into state. |
 
-The template does not ship a knowledge store — the config keys are kept so a fork can flip the switch without rewiring every call site.
+The bundled store is sqlite + FTS5 (with an automatic LIKE fallback when FTS5 isn't available). One `chunks` table; the `domain` column distinguishes operator-set notes (`memory_ingest`), daily-log entries (`daily_log`), and conversation findings extracted by `MemoryMiddleware` (`domain='finding'`).