Skip to content

feat: ship default knowledge store + eval harness for forks#155

Merged
mabry1985 merged 6 commits into
mainfrom
feat/default-kb-and-eval-harness
Apr 27, 2026
Merged

feat: ship default knowledge store + eval harness for forks#155
mabry1985 merged 6 commits into
mainfrom
feat/default-kb-and-eval-harness

Conversation

@mabry1985
Copy link
Copy Markdown
Contributor

@mabry1985 mabry1985 commented Apr 27, 2026

Closes #154.

Summary

  • Default knowledge storeknowledge/store.py (sqlite + FTS5, LIKE fallback). One chunks table backs operator notes, daily-log entries, and conversation findings extracted by MemoryMiddleware. Default-on; middleware.knowledge: false to opt out.
  • Five memory tools in tools/lg_tools.pymemory_ingest, memory_recall, memory_list, memory_stats, daily_log — bound to the bundled store. echo removed.
  • Eval harness under evals/ — A2A client + runner + side-effect-verified tasks.json (15 starter cases: agent card, auth, abstention, every shipped tool, KB recall, chained, KnowledgeMiddleware). Audit log + reply text + KB state all checked independently — hallucinated tool results get caught.
  • Docsdocs/guides/evals.md how-to + every reference page updated to match the new defaults / nine-tool starter set.

Why side-effect verification

When the model produces "Logged: ..." without actually calling daily_log, text-only checks pass while the DB stays empty. The harness reads audit.jsonl and the chunks table after every turn so that bug doesn't ship.

Default flip — what existing forks will notice

knowledge_middleware and memory_middleware now default to true in LangGraphConfig. On first run:

  • The store opens at /sandbox/knowledge/agent.db (Docker default).
  • If that path is read-only or absent (running locally outside the container), it falls back to ~/.protoagent/knowledge/agent.db automatically — no permission errors.
  • Forks that want a stateless agent can flip middleware.knowledge: false in config/langgraph-config.yaml.

Test plan

  • python -m evals.runner against a wizard-set-up local agent — expect ≥13/15 cases passing on a small/fast model (text-pattern noise on a couple).
  • python -m evals.runner --category a2a-protocol — agent card + auth gating gate at 100% (deterministic).
  • Existing test suite — pytest tests/test_starter_tools.py tests/test_config_io.py tests/test_skill_emission.py tests/test_a2a_handler.py — should be all green (was 202/203 with one unrelated origin test failing on main).
  • Drawer Save & Reload — confirm the rebuilt graph still has the memory tools wired up (server.py reload path also constructs a store).

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added persistent knowledge store with full‑text search, new memory tools (ingest, recall, list, stats) and daily logging; new evaluation harness and runner to validate agent/tool/KB behavior.
  • Configuration

    • Knowledge and memory middleware enabled by default; worker subagent default tool list updated and the simple "echo" starter tool removed.
  • Documentation

    • Expanded guides and references covering evals, configuration, starter tools, and tutorials.
  • Tests

    • Updated tests to reflect the new starter-tools set and tool-tracking changes.

The template now ships a working memory loop and end-to-end eval suite
on day one so forks have a green baseline before they touch a single
line of code. Closes #154.

What lands:

- knowledge/store.py — sqlite + FTS5 (LIKE fallback). One ``chunks``
  table backs operator notes (memory_ingest), daily-log entries
  (daily_log), and conversation findings extracted by
  MemoryMiddleware (domain='finding'). Path resolves env >
  config > default with an automatic ~/.protoagent/ fallback when
  /sandbox isn't writable.

- tools/lg_tools.py — five new memory tools (memory_ingest,
  memory_recall, memory_list, memory_stats, daily_log) bound to the
  store via a closure factory so tests get a fresh store per run.
  ``echo`` removed; ``get_all_tools(knowledge_store)`` actually uses
  its parameter now.

- server.py — _build_knowledge_store() constructs the store and
  threads it through both initial init and the drawer reload path.
  Defaults flipped: knowledge_middleware + memory_middleware now
  ON by default (config/langgraph-config.yaml + graph/config.py).

- evals/ — A2A client + runner + verify helpers + 15 starter cases
  (tasks.json) covering agent card discovery, bearer auth gating,
  abstention, every shipped tool, KB recall, a chained two-tool
  case, and KnowledgeMiddleware injection. Side-effect-verified:
  audit log + reply text + KB chunks all checked independently so
  hallucinated tool results get caught.

- docs/guides/evals.md — full how-to. README/TEMPLATE/configuration/
  starter-tools/first-agent updated to reflect the new defaults and
  the additional five memory tools.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Replaces the inline echo starter tool with five KnowledgeStore-bound memory tools plus daily_log, enables knowledge and memory middleware by default, adds a sqlite+FTS5 KnowledgeStore, wires it into server startup, and ships an eval harness (client, runner, verify, tasks, README) with docs and test updates.

Changes

Cohort / File(s) Summary
Knowledge Store
knowledge/__init__.py, knowledge/store.py
Add SQLite-backed KnowledgeStore and Chunk dataclass with FTS5 probe/fallback, search/list/stats/find/delete APIs, DB path resolution/fallback, safe error handling, and chunk/finding write APIs.
Memory Tools & Registry
tools/lg_tools.py
Remove echo; add _build_memory_tools(knowledge_store) producing memory_ingest, memory_recall, memory_list, memory_stats, daily_log; update get_all_tools(knowledge_store=None) to include memory tools only when a store is provided.
Server Integration & Config
server.py, graph/config.py, graph/subagents/config.py, config/langgraph-config.yaml
Enable knowledge/memory middleware by default; expand worker subagent tool allowlist to include memory/daily_log tools; add _build_knowledge_store(config) and pass knowledge_store into graph creation/reload; update inline docs and defaults.
Eval Harness
evals/client.py, evals/runner.py, evals/verify.py, evals/tasks.json, evals/README.md
Add async AgentClient/TaskResult, runner executing tasks.json kinds (agent_card, auth_check, ask, stream), verification helpers for audit/KB checks, setup/teardown, 13 starter tasks, and README describing usage and CI guidance.
Docs & Guides
docs/guides/evals.md, docs/guides/subagents.md, docs/guides/*, docs/reference/*, docs/tutorials/*
Add eval how‑to guide and update guides, tutorials, and reference docs to reflect memory tools, knowledge middleware defaults, updated starter-tools API (get_all_tools(knowledge_store)), and config examples.
Tests
tests/test_config_io.py, tests/test_skill_curator.py, tests/test_skill_emission.py, tests/test_starter_tools.py
Adjust fixtures and assertions to remove/replace echo with current_time, add expectations for web_search/fetch_url, and remove the echo-specific starter-tools test.
Starter tooling docs
docs/reference/starter-tools.md
Document new memory tools, update get_all_tools(knowledge_store) example, describe behavior when knowledge_store is disabled, and add examples for each memory tool.

Sequence Diagram(s)

sequenceDiagram
    participant Runner as Eval Runner
    participant Client as AgentClient
    participant Agent as Agent (A2A)
    participant Audit as Audit Log
    participant KB as Knowledge Store

    rect rgba(100, 150, 200, 0.5)
    Note over Runner,Client: Case setup & execution
    end

    Runner->>Client: ask(prompt, timeout_s=90)
    Client->>Agent: JSON-RPC /message/send
    Agent->>Audit: emit audit entries (tool calls)
    Agent->>KB: memory_* / daily_log operations
    Agent-->>Client: final TaskResult / status

    rect rgba(100, 200, 150, 0.5)
    Note over Runner,KB: Assertion validation
    end

    Runner->>Audit: audit_entries_since(ts)
    Audit-->>Runner: tool firing records
    Runner->>KB: search / find_chunk_containing / list_chunks
    KB-->>Runner: stored facts & side effects
    Runner->>Client: parse TaskResult.text
    Client-->>Runner: extracted reply text

    rect rgba(200, 150, 100, 0.5)
    Note over Runner,KB: Teardown
    end

    Runner->>KB: delete_by_content / delete_by_heading
    KB-->>Runner: deletion confirmations
Loading
sequenceDiagram
    participant Tool as Memory Tool
    participant KStore as KnowledgeStore
    participant DB as SQLite + FTS5

    rect rgba(150, 150, 200, 0.5)
    Note over Tool,DB: Tool invocation & persistence
    end

    Tool->>KStore: add_chunk(content, domain, heading)
    KStore->>DB: INSERT into chunks
    KStore->>DB: maintain chunks_fts via triggers (if FTS5)
    DB-->>KStore: row id | OperationalError -> None
    KStore-->>Tool: chunk_id | None

    rect rgba(150, 200, 150, 0.5)
    Note over Tool,DB: Retrieval with FTS
    end

    Tool->>KStore: search(query, domain)
    KStore->>DB: SELECT via FTS5 MATCH (or LIKE fallback)
    DB-->>KStore: scored results
    KStore-->>Tool: standardized result dicts

    rect rgba(200, 150, 150, 0.5)
    Note over Tool,DB: Cleanup
    end

    Tool->>KStore: delete_by_content / delete_by_heading
    KStore->>DB: DELETE + cascade FTS triggers
    DB-->>KStore: affected row count
    KStore-->>Tool: count
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 69.70% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changeset: shipping a default knowledge store and eval harness.
Description check ✅ Passed The description covers the Summary, Closes reference, and detailed Test plan. It explains what changed (knowledge store, memory tools, eval harness, docs), why (side-effect verification), and what users will notice (default middleware flip). All template sections are addressed.
Linked Issues check ✅ Passed The PR fully implements the objectives from issue #154: ships a how-to guide (docs/guides/evals.md), three-channel assertion harness (audit-log, reply text, KB state), reference layout (evals/{client,runner,verify}.py + tasks.json), and 15 starter cases covering agent card, auth, tools, chained calls, and KnowledgeMiddleware.
Out of Scope Changes check ✅ Passed All changes align with the stated objectives: knowledge store implementation, memory tools, eval harness, documentation updates, and test updates. No extraneous changes detected; even the middleware default flip is part of the intended design to enable the new features by default.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/default-kb-and-eval-harness

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tests/test_config_io.py (1)

321-331: 🧹 Nitpick | 🔵 Trivial

Tighten starter-tool contract coverage in this test.

Line 328–331 now verifies only part of the core starter set; a regression dropping web_search would no longer fail this test. Consider explicitly asserting it too.

✅ Small test hardening diff
 def test_list_available_tools_returns_starter_set():
@@
     assert "current_time" in names
     assert "calculator" in names
+    assert "web_search" in names
     assert "fetch_url" in names
     assert all(isinstance(n, str) for n in names)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_config_io.py` around lines 321 - 331, The test
test_list_available_tools_returns_starter_set currently asserts presence of
"current_time", "calculator", and "fetch_url" but misses "web_search"; update
the test in tests/test_config_io.py to also assert "web_search" is in the names
returned by list_available_tools (function list_available_tools in
graph.config_io), keeping the existing type check (all elements are str) so the
starter-tool contract is fully covered.
docs/reference/starter-tools.md (1)

183-188: ⚠️ Potential issue | 🟡 Minor

Preserve the bundled memory tools in the extension example.

Copying this snippet verbatim replaces the new conditional registry and drops all KnowledgeStore-backed tools when a store is configured. Show my_tool being appended to the existing pattern instead of returning a fresh list.

Suggested doc fix
 def get_all_tools(knowledge_store=None):
-    return [current_time, calculator, web_search, fetch_url, my_tool]
+    tools = [current_time, calculator, web_search, fetch_url, my_tool]
+    if knowledge_store is not None:
+        tools.extend(_build_memory_tools(knowledge_store))
+    return tools
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/reference/starter-tools.md` around lines 183 - 188, The snippet replaces
the existing conditional registry and drops KnowledgeStore-backed tools; update
get_all_tools to obtain the existing tools list (keeping bundled tools like
current_time, calculator, web_search, fetch_url and any KnowledgeStore-backed
tools when knowledge_store is provided) and then append my_tool instead of
returning a fresh list. Concretely, inside get_all_tools(knowledge_store=None)
call or build the existing tools list used elsewhere (respecting the conditional
registry when knowledge_store is set), then tools.append(my_tool) and return
tools; reference get_all_tools, my_tool, and the knowledge_store parameter when
making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/guides/subagents.md`:
- Around line 89-99: The docs updated the tools allowlist (current_time,
calculator, web_search, fetch_url, memory_ingest, memory_recall, memory_list,
memory_stats, daily_log) but the Step 2 LangGraphConfig snippet still includes
the legacy "echo" tool, causing a mismatch; update the LangGraphConfig example
to replace or remove "echo" so it exactly matches the new tools list and
max_turns setting (e.g., ensure the LangGraphConfig.tools array contains the
same tool names and that LangGraphConfig.max_turns is set to 20) so both
snippets are consistent.

In `@docs/tutorials/first-tool.md`:
- Around line 53-56: The example redefines WORKER_CONFIG.tools and accidentally
omits the default knowledge-backed tools; instead of replacing the whole
allowlist in the SubagentConfig example, update the snippet to append "git_sha"
to the existing default tools (or explicitly include the default entries:
"memory_ingest", "memory_recall", "memory_list", "memory_stats", "daily_log"
alongside "current_time", "calculator", "web_search", "fetch_url", "git_sha") so
the example matches the template behavior; locate WORKER_CONFIG and
SubagentConfig in the doc and change only the tools list accordingly.

In `@evals/runner.py`:
- Around line 205-209: The _RUNNERS dispatch dict only maps "agent_card",
"auth_check", and "ask" (functions _run_agent_card, _run_auth_check, _run_ask)
so evals/client.py::AgentClient.stream() is never reachable from tasks.json; add
a runner entry that invokes the streaming path (e.g., "agent_stream":
_run_agent_stream) or modify an existing runner (e.g., enhance _run_agent_card)
to detect streaming tasks and call AgentClient.stream(), then register that new
function in _RUNNERS and ensure tasks.json uses the new key so A2A
streaming/push-notification cases are testable.
- Around line 133-149: The teardown must run in a finally block so it always
executes even if client.ask or the verifier raises: wrap the call sequence in
run_one so that after verify.apply_setup(case["setup"]) you set a flag (e.g.,
setup_applied = True) and perform client.ask / any verifier calls inside try,
and then in finally check if setup_applied and "teardown" in case and call
verify.apply_teardown(case["teardown"]); apply the same pattern for the later
teardown usage around the code near the 185-186 region to ensure seeded KB rows
are always cleaned up.
- Around line 158-168: The code currently uses expected_tools =
case.get("expected_tools") or [] which conflates "no key" and "empty list";
change to expected_tools = case.get("expected_tools", None) and then run the
audit check only when expected_tools is not None so that an explicit empty list
asserts that no tool entries were emitted. Keep the rest of the logic (await
asyncio.sleep, entries = verify.audit_entries_since(since), require_success
calculation, and calling verify.assert_tools_fired) but wrap it in if
expected_tools is not None: ... so assert_tools_fired is invoked for [] as well
and add any failure detail to problems as before.

In `@knowledge/store.py`:
- Around line 394-423: The helpers must reject empty or whitespace-only match
strings before constructing LIKE queries: in find_chunk_containing (parameter
text) and in delete_by_content / any verifier helper using contains, call
strip() on the incoming string and if the result is empty immediately return
None (or no-op for delete) instead of building a "%...%" pattern; do this check
at the start of the functions (e.g., in find_chunk_containing, check if not text
or not text.strip() then return None) to prevent LIKE '%%' queries that match
everything.
- Around line 171-174: After installing the FTS schema, ensure existing rows in
the external-content table get indexed: when self._fts_available is true and
after db.executescript(_FTS_SCHEMA) completes, run an FTS5 rebuild for the
chunks_fts virtual table (i.e. perform the FTS5 "rebuild" insert into
chunks_fts) so preexisting rows in chunks are indexed; add this rebuild call
alongside the existing db.executescript usage and keep the guard using
_has_fts5(db)/self._fts_available.
- Around line 162-166: The PRAGMA journal_mode=WAL call in _connect currently
makes WAL mandatory and can raise sqlite3.OperationalError for read-only DBs;
modify _connect (or move the PRAGMA to the writable init path) so WAL is
best-effort: open the sqlite3.Connection and set row_factory as now, then
attempt db.execute("PRAGMA journal_mode=WAL") inside a try/except that catches
sqlite3.OperationalError (and logs/debugs if desired) but does not close or
return None on failure, so read-only connections remain usable; alternatively
ensure only the writable initialization path (e.g., the method that prepares a
writable DB) performs the WAL PRAGMA while _connect always returns a usable
connection for reads.

In `@tools/lg_tools.py`:
- Around line 312-329: Clamp tool inputs before calling the store: cap the
incoming k (and any limit arg in the other memory tools on lines ~332-348) to a
small positive max and enforce a minimum of 1, then pass the clamped value to
knowledge_store.search instead of the raw argument. Concretely, add a MAX_K
constant (e.g. 20) and replace uses of k with something like clamped_k = max(1,
min(k, MAX_K)) (and do the same for any limit parameter in the other memory_*
function(s)), then call knowledge_store.search(query, k=clamped_k) so the tool
cannot request arbitrarily large slices of the KB; apply the same clamping logic
to all memory_* entry points in this file.

---

Outside diff comments:
In `@docs/reference/starter-tools.md`:
- Around line 183-188: The snippet replaces the existing conditional registry
and drops KnowledgeStore-backed tools; update get_all_tools to obtain the
existing tools list (keeping bundled tools like current_time, calculator,
web_search, fetch_url and any KnowledgeStore-backed tools when knowledge_store
is provided) and then append my_tool instead of returning a fresh list.
Concretely, inside get_all_tools(knowledge_store=None) call or build the
existing tools list used elsewhere (respecting the conditional registry when
knowledge_store is set), then tools.append(my_tool) and return tools; reference
get_all_tools, my_tool, and the knowledge_store parameter when making the
change.

In `@tests/test_config_io.py`:
- Around line 321-331: The test test_list_available_tools_returns_starter_set
currently asserts presence of "current_time", "calculator", and "fetch_url" but
misses "web_search"; update the test in tests/test_config_io.py to also assert
"web_search" is in the names returned by list_available_tools (function
list_available_tools in graph.config_io), keeping the existing type check (all
elements are str) so the starter-tool contract is fully covered.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: e658a3ad-718e-464a-a26c-f5dc1e6ba884

📥 Commits

Reviewing files that changed from the base of the PR and between 3c25c41 and 433b44e.

⛔ Files ignored due to path filters (2)
  • README.md is excluded by !*.md
  • TEMPLATE.md is excluded by !*.md
📒 Files selected for processing (27)
  • config/langgraph-config.yaml
  • docs/guides/customize-and-deploy.md
  • docs/guides/evals.md
  • docs/guides/fork-the-template.md
  • docs/guides/index.md
  • docs/guides/subagents.md
  • docs/reference/configuration.md
  • docs/reference/starter-tools.md
  • docs/tutorials/first-agent.md
  • docs/tutorials/first-tool.md
  • evals/README.md
  • evals/__init__.py
  • evals/client.py
  • evals/results/.gitkeep
  • evals/runner.py
  • evals/tasks.json
  • evals/verify.py
  • graph/config.py
  • graph/subagents/config.py
  • knowledge/__init__.py
  • knowledge/store.py
  • server.py
  • tests/test_config_io.py
  • tests/test_skill_curator.py
  • tests/test_skill_emission.py
  • tests/test_starter_tools.py
  • tools/lg_tools.py
💤 Files with no reviewable changes (1)
  • tests/test_starter_tools.py

Comment thread docs/guides/subagents.md
Comment thread docs/tutorials/first-tool.md
Comment thread evals/runner.py Outdated
Comment thread evals/runner.py Outdated
Comment thread evals/runner.py
Comment thread knowledge/store.py
Comment thread knowledge/store.py
Comment thread knowledge/store.py
Comment thread tools/lg_tools.py
Real bugs:
- evals/runner.py: teardown now runs in a finally block so seeded KB
  rows get cleaned up even when the verifier or client.ask() raises.
  expected_tools=[] now means "assert no tools fired" (was conflated
  with "no key" via the `or []` short-circuit, making the abstention
  case a no-op).
- evals/runner.py + tasks.json: added a `stream` runner kind so
  AgentClient.stream() is reachable from tasks.json — new
  streaming_status_updates case asserts the SSE event sequence.
- knowledge/store.py: PRAGMA journal_mode=WAL is now best-effort
  (read-only DBs no longer break _connect). FTS5 rebuild after
  schema install so an existing chunks table populated before FTS
  was added gets indexed. find_chunk_containing/delete_by_content
  reject empty/whitespace-only inputs to prevent LIKE '%%' wildcards
  from matching every row.

Hardening:
- tools/lg_tools.py: clamp memory_recall(k) to [1, 20] and
  memory_list(limit) to [1, 200] so the agent can't request
  arbitrarily large slices of the KB.

Doc cleanup:
- docs/guides/subagents.md: LangGraphConfig snippet had a stale
  "echo" reference; replaced with the new memory-tool list.
- docs/tutorials/first-tool.md: WORKER_CONFIG example now appends
  git_sha alongside the bundled defaults instead of replacing
  them and dropping the memory tools.
- docs/reference/starter-tools.md: "adding your own" snippet now
  preserves the conditional _build_memory_tools(knowledge_store)
  extension.
- tests/test_config_io.py: starter-tool contract assertion now
  also covers web_search.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mabry1985
Copy link
Copy Markdown
Contributor Author

Round 1 of CodeRabbit feedback addressed in 4d0d428. Real bugs first, hardening after, docs last.

Real bugs

  • evals/runner.py — teardown moved into a finally block so seeded KB rows always clean up, even when client.ask() or the verifier raises.
  • evals/runner.pyexpected_tools is not None so [] now correctly asserts no tools fired (the abstention case was a no-op before).
  • evals/runner.py + tasks.json — added stream runner kind + a streaming_status_updates case so AgentClient.stream() is reachable from task definitions.
  • knowledge/store.pyPRAGMA journal_mode=WAL is now best-effort (read-only DBs stay usable). FTS5 rebuild runs after schema install so a chunks table populated before FTS was added gets indexed. find_chunk_containing and delete_by_content reject empty / whitespace-only inputs (prevents LIKE '%%' from matching every row).

Hardening

  • tools/lg_tools.pymemory_recall(k) clamped to [1, 20], memory_list(limit) to [1, 200].

Docs

  • docs/guides/subagents.md — stale echo reference removed from the LangGraphConfig snippet.
  • docs/tutorials/first-tool.mdWORKER_CONFIG example now appends git_sha alongside the bundled defaults rather than replacing them and dropping the memory tools.
  • docs/reference/starter-tools.md — "adding your own" snippet preserves the conditional _build_memory_tools(knowledge_store) extension.
  • tests/test_config_io.py — starter-tool contract now also asserts web_search.

Tests in scope (203 across the affected files): same 1 pre-existing baseline failure (test_subscribe_endpoint_origin_rejected) that reproduces on main and is unrelated to this PR.

🤖 Generated with Claude Code

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@evals/runner.py`:
- Around line 243-244: The nested condition checking setup_applied before
testing for a teardown key is redundant; update the block that currently uses
`if setup_applied or "teardown" in case:` followed by `if "teardown" in case:`
to a single guard that directly checks `if "teardown" in case:` (remove the
`setup_applied` portion and the inner duplicate check) so the teardown branch
runs only when the `"teardown"` key is present; adjust any dependent logic that
relied on `setup_applied` being true inside that branch if necessary (look for
references to `setup_applied` and `case` in the surrounding code).

In `@knowledge/store.py`:
- Around line 91-92: Update the _now_iso function to use the Python 3.11+
datetime.UTC alias instead of timezone.utc; specifically modify the _now_iso()
implementation so it calls datetime.now(datetime.UTC).isoformat(); keep the
function name _now_iso and its return type, and only apply this change if the
project targets Python 3.11+ (otherwise leave as timezone.utc).

In `@tools/lg_tools.py`:
- Around line 331-334: Replace the explicit loop that builds lines with a list
comprehension for clarity and conciseness: instead of initializing lines and
looping over results, create lines with a comprehension like the one that
produces f"[{r.get('domain', '?')}] {r['preview']}" for each r in results, then
return "\n".join(lines); update the block containing variables results, lines,
and r accordingly to remove the loop and keep the same behavior.
- Around line 284-285: Update the _build_memory_tools function signature to
include a return type annotation (e.g. -> list[Tool] or -> list[Any]) to reflect
that it returns a list of tool objects; import or reference the appropriate type
(Tool or Any) from typing or the module that defines the tool class and adjust
the signature from def _build_memory_tools(knowledge_store): to include the
chosen return annotation so IDEs and type checkers pick up the correct return
type.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c6e241b4-d596-4c12-9a86-9665470c476a

📥 Commits

Reviewing files that changed from the base of the PR and between 433b44e and 4d0d428.

📒 Files selected for processing (8)
  • docs/guides/subagents.md
  • docs/reference/starter-tools.md
  • docs/tutorials/first-tool.md
  • evals/runner.py
  • evals/tasks.json
  • knowledge/store.py
  • tests/test_config_io.py
  • tools/lg_tools.py

Comment thread evals/runner.py Outdated
Comment thread knowledge/store.py Outdated
Comment thread tools/lg_tools.py Outdated
Comment thread tools/lg_tools.py Outdated
- evals/runner.py: collapse redundant nested teardown guard into a
  single `if "teardown" in case:` (SIM102); remove now-unused
  `setup_applied` flag
- knowledge/store.py: use `datetime.UTC` alias (Python 3.11+, UP017)
- tools/lg_tools.py: add `-> list` return annotation to
  `_build_memory_tools` (ANN202); replace explicit loop with list
  comprehension in `memory_recall` (PERF401)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

https://claude.ai/code/session_01148o8ppbuQwuZBsVGTQWwQ
Copy link
Copy Markdown
Contributor Author

Round 2 review — addressed in commit cab3bd8

CodeRabbit's second review (submitted 2026-04-27T21:51Z, after the round-1 push at 21:47Z) posted 4 new inline comments, all labelled Trivial / Nitpick. All four have been applied:

File Finding Action
evals/runner.py L243-244 Redundant nested if setup_applied or "teardown" in case: wrapping if "teardown" in case: (SIM102) Collapsed to single if "teardown" in case:; removed the now-unused setup_applied flag.
knowledge/store.py L92 Use datetime.UTC alias instead of timezone.utc (UP017) Changed import to from datetime import UTC, datetime; updated call-site. Valid: project requires Python ≥ 3.11.
tools/lg_tools.py L284 Missing return type annotation on _build_memory_tools (ANN202) Added -> list to the function signature.
tools/lg_tools.py L331-334 Explicit loop building lines — should be a list comprehension (PERF401) Replaced with one-liner comprehension.

All three touched files were spot-checked against the pre-existing test run: same 11 failures (asyncio/import issues) with and without these changes — confirmed pre-existing baseline.


Generated by Claude Code

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (1)
evals/runner.py (1)

150-157: ⚠️ Potential issue | 🟠 Major

Setup failures can still leak seeded KB rows.

evals/verify.py::apply_setup() applies steps incrementally and returns on the first failure. Because this return happens before Line 162 enters the try, a partial setup failure skips the finally on Lines 237-242 and any configured teardown never runs, so later cases can inherit leftover fixture data. Move setup inside the try so cleanup also happens on setup errors.

🧹 Suggested fix
-    # Pre-seed state via direct DB writes (model never sees this).
-    if "setup" in case:
-        err = verify.apply_setup(case["setup"])
-        if err:
-            return CaseResult(
-                case["id"], case["category"], case["name"], False,
-                f"setup failed: {err}",
-            )
-
     events: list[dict] = []
     result: TaskResult | None = None
 
     try:
+        # Pre-seed state via direct DB writes (model never sees this).
+        if "setup" in case:
+            err = verify.apply_setup(case["setup"])
+            if err:
+                return CaseResult(
+                    case["id"], case["category"], case["name"], False,
+                    f"setup failed: {err}",
+                )
         since = verify.audit_now()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/runner.py` around lines 150 - 157, The setup path currently calls
verify.apply_setup(...) before entering the surrounding try/finally, so a
partial failure can return early and skip the cleanup; move the call to
verify.apply_setup(case["setup"]) inside the same try block that contains the
finally cleanup, and on error still produce the same CaseResult (using
CaseResult(case["id"], case["category"], case["name"], False, f"setup failed:
{err}")) but only after the finally has run; reference verify.apply_setup and
the existing try/finally/CaseResult flow to locate where to relocate the setup
call and ensure teardown always executes even when setup fails.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@evals/runner.py`:
- Around line 97-103: The current code builds a bearer-only headers dict
(variables expected_status, bad, headers), which prevents tests from exercising
X-API-Key scenarios; modify the request header construction to accept headers
from the test case (e.g., case.get("headers")) and merge them with sensible
defaults like "Content-Type": "application/json" and a default Authorization
using case.get("bad_token",""), so tasks can specify missing/invalid X-API-Key
or other auth combos; ensure case-supplied headers override defaults when
present so auth scenarios in tasks.json are honored.
- Around line 192-199: The fixed 0.3s sleep is brittle; replace it with a short
polling loop that repeatedly calls verify.audit_entries_since(since) and
verify.assert_tools_fired(entries, expected_tools, require_success=...) until
either the assertion passes or a deadline (e.g. 1–3 seconds) is reached; on
timeout fail the test with the last detail. Locate the block that uses
expected_tools, since, verify.audit_entries_since and verify.assert_tools_fired
in runner.py and swap the single asyncio.sleep for an async loop that awaits
small intervals (e.g. 50–200ms) between polls, updates entries each iteration,
and breaks early when passed is True. Ensure the final behavior preserves
require_success computed from case.get("tool_outcome", "success") and returns or
raises the same failure path when timed out.

In `@knowledge/store.py`:
- Around line 311-316: The FTS5 query builds MATCH from raw tokens in
_search_fts (tokens = [...] and match = " OR ".join(tokens)), which lets
apostrophes and reserved words break the SQL; instead quote/escape each token
with SQLite's FTS5 quoting helper (use fts5_quote() or an equivalent escape
function) before joining so the MATCH expression is safe (e.g., mapped_token =
fts5_quote(t) for t in tokens), then join mapped tokens with " OR " and use that
for the MATCH; keep the early empty-token check and existing fallback-to-LIKE
behavior intact.
- Around line 176-199: The exception handlers in the store code only catch
sqlite3.OperationalError but must catch sqlite3.DatabaseError to honor the
corrupt-database/read-only contract; update every except block that currently
uses "except sqlite3.OperationalError" to "except sqlite3.DatabaseError" (or the
broader sqlite3.Error/DatabaseError as appropriate) so operations like the
schema init (the try around self._connect(), db.executescript(_SCHEMA), the FTS
rebuild block that calls db.execute(...), and the outer commit/close) and any
other error-handling paths that reference sqlite3.OperationalError will
correctly handle corrupt or read-only DBs (search for the other handlers in this
module and replace them similarly).
- Around line 429-458: The LIKE patterns in find_chunk_containing (the SQL using
"WHERE (content LIKE ? OR heading LIKE ?)") and delete_by_content (the "DELETE
FROM chunks WHERE content LIKE ?") must treat input literally: escape any '%'
and '_' in the input by replacing '%' -> '\%' and '_' -> '\_' (taking care to
escape the backslash in Python strings), build the parameter as f"%{escaped}%",
and add " ESCAPE '\'" to the SQL statements so SQLite treats '\' as the escape
character; update the params passed to db.execute accordingly and keep the
domain handling as-is.

---

Duplicate comments:
In `@evals/runner.py`:
- Around line 150-157: The setup path currently calls verify.apply_setup(...)
before entering the surrounding try/finally, so a partial failure can return
early and skip the cleanup; move the call to verify.apply_setup(case["setup"])
inside the same try block that contains the finally cleanup, and on error still
produce the same CaseResult (using CaseResult(case["id"], case["category"],
case["name"], False, f"setup failed: {err}")) but only after the finally has
run; reference verify.apply_setup and the existing try/finally/CaseResult flow
to locate where to relocate the setup call and ensure teardown always executes
even when setup fails.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: de696887-9338-4c91-8268-f0dc82e83016

📥 Commits

Reviewing files that changed from the base of the PR and between 4d0d428 and cab3bd8.

📒 Files selected for processing (3)
  • evals/runner.py
  • knowledge/store.py
  • tools/lg_tools.py

Comment thread evals/runner.py
Comment thread evals/runner.py
Comment thread knowledge/store.py Outdated
Comment thread knowledge/store.py Outdated
Comment thread knowledge/store.py Outdated
Real bugs:
- evals/runner.py: setup is now inside the try block so a partial
  setup failure (e.g. step 2 of 3 errors) still triggers the
  finally teardown — rows from steps that did succeed no longer
  leak into the next case. Was flagged as a duplicate from round 2.
- knowledge/store.py: LIKE patterns now escape % and _ via
  ESCAPE '\' on every clause that takes user input
  (find_chunk_containing, delete_by_content, _search_like). A
  query for "100%" or "hello_world" no longer silently matches
  every row containing "100" or any single character between
  "hello" and "world".
- knowledge/store.py: FTS5 MATCH tokens are now double-quoted via
  _fts_quote() so user-supplied query terms can't smuggle FTS5
  operators (column filters, prefix wildcards, NEAR, AND/OR/NOT).
  Defence in depth — the [\w']+ tokenizer already filters most
  special chars.

Hardening:
- evals/runner.py: the fixed 0.3s asyncio.sleep waiting for the
  audit log to flush is gone. _await_audit_assertion now polls
  every 50ms up to a 2s deadline and returns as soon as the
  assertion passes — exits early on success, only burns the full
  deadline when the tool genuinely never fired.
- evals/runner.py: _run_auth_check accepts case["headers"] so cases
  can override the default bearer-only header set and exercise
  X-API-Key auth scenarios (or both auths together).
- knowledge/store.py: per-method exception handlers broadened from
  sqlite3.OperationalError to sqlite3.DatabaseError. Catches
  IntegrityError, ProgrammingError, and corruption variants too
  without crashing the agent loop. _has_fts5 (probe) and _connect
  (connection-time errors only) keep the narrower OperationalError.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mabry1985
Copy link
Copy Markdown
Contributor Author

Round 3 of CodeRabbit feedback addressed in b3a9f1d. All 5 actionable comments + the duplicate Major from round 2 are applied.

Real bugs

  • evals/runner.pyapply_setup moved inside the try block so a partial setup failure (e.g. step 2 of 3 errors) still triggers the finally teardown. Was the round-2 duplicate Major.
  • knowledge/store.py — LIKE patterns now escape % and _ with ESCAPE '\' in find_chunk_containing, delete_by_content, and _search_like. Verified: a search for "100%" matches only rows containing literal 100%, not every row starting with 100.
  • knowledge/store.py — FTS5 MATCH tokens are double-quoted via _fts_quote() so user-supplied query terms can't smuggle FTS5 operators (column filters, prefix wildcards, NEAR, AND/OR/NOT). Defence in depth — the [\w']+ tokenizer already filtered most special chars, but the quote pass closes the rest.

Hardening

  • evals/runner.py — the fixed await asyncio.sleep(0.3) audit-log wait is gone. _await_audit_assertion polls every 50ms up to a 2s deadline and exits as soon as the assertion passes — early exit on success, deadline only used when the tool genuinely never fired.
  • evals/runner.py_run_auth_check now accepts case["headers"] so cases can exercise X-API-Key auth too. Default behaviour is unchanged.
  • knowledge/store.py — per-method exception handlers broadened from sqlite3.OperationalError to sqlite3.DatabaseError. Catches IntegrityError, ProgrammingError, and corruption variants too. _has_fts5 (probe) and _connect (connection-time only) keep the narrower OperationalError.

Smoke tests pass:

  • _escape_like("100%")"100\\%"; _fts_quote('he"llo')'"he""llo"'.
  • End-to-end: find_chunk_containing("100%") matches only the literal-percent row; find_chunk_containing("hello_world") matches only the underscore row; search("AND OR NOT") is treated as three literal tokens, no FTS5 syntax error.
  • Scoped pytest (168 tests across test_starter_tools / test_config_io / test_skill_emission / test_a2a_handler / test_a2a_integration): all pass except the same single pre-existing test_subscribe_endpoint_origin_rejected failure that reproduces on main.

🤖 Generated with Claude Code

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
knowledge/store.py (1)

234-239: ⚠️ Potential issue | 🟠 Major

Widen exception scope in _get_db to honor the no-crash contract.

The module docstring (lines 20–22) states that the store "swallows sqlite3.DatabaseError" so it "never crashes the agent loop on a corrupt or read-only DB." However, _get_db only catches OperationalError. Corrupt databases raise sqlite3.DatabaseError (the parent class), which would propagate and crash the caller.

🔧 Suggested fix
     def _get_db(self) -> sqlite3.Connection | None:
         try:
             return self._connect()
-        except sqlite3.OperationalError as exc:
+        except sqlite3.DatabaseError as exc:
             log.error("[knowledge] connect failed: %s", exc)
             return None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@knowledge/store.py` around lines 234 - 239, In _get_db, broaden the exception
catch to sqlite3.DatabaseError (the parent of OperationalError) so corrupt or
read-only DB errors are swallowed as the module docstring requires; change the
except clause in the _get_db method (currently "except sqlite3.OperationalError
as exc") to catch sqlite3.DatabaseError instead and keep the existing log.error
call and return None behavior.
evals/runner.py (1)

237-239: 🧹 Nitpick | 🔵 Trivial

Prefix unused entries variable with underscore.

The entries value from _await_audit_assertion is unpacked but never used within _run_prompt_case. Prefix it with _ to indicate it's intentionally ignored and silence the static analysis warning.

♻️ Suggested fix
-            entries, passed, detail = await _await_audit_assertion(
+            _entries, passed, detail = await _await_audit_assertion(
                 since, expected_tools, require_success=require_success,
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/runner.py` around lines 237 - 239, In _run_prompt_case, the first value
returned by _await_audit_assertion is unpacked to entries but never used; change
the unpacking to prefix that variable with an underscore (e.g., _entries) so it
becomes _entries, passed, detail = await _await_audit_assertion(...) to signal
intentional unused value and silence static-analysis warnings referencing
_await_audit_assertion and _run_prompt_case.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@evals/runner.py`:
- Around line 170-179: Replace uses of asyncio.get_event_loop().time() with
asyncio.get_running_loop().time() when computing deadline and checking time in
the loop: update the two call sites around the deadline assignment and the loop
condition where deadline is compared, keeping the same logic with variables
'deadline', 'since', and the verify.audit_entries_since /
verify.assert_tools_fired calls unchanged so the behavior and comparisons remain
identical.
- Around line 339-340: The timestamp in the payload uses
datetime.now(timezone.utc).isoformat() which is inconsistent with the codebase
alias; update the payload construction to use the datetime.UTC alias (i.e.,
datetime.now(datetime.UTC).isoformat()) so it matches other files like
knowledge/store.py and keeps style consistent; locate the payload assignment in
evals/runner.py (the payload variable and the datetime.now(...) call) and
replace timezone.utc with datetime.UTC.

In `@knowledge/store.py`:
- Around line 229-230: Replace the log.error(...) calls inside the exception
handlers that catch sqlite3.DatabaseError (the blocks that currently read
"except sqlite3.DatabaseError as exc: log.error(..., exc)") with
log.exception(...) so the error message for schema/init/database failures
includes the full traceback; keep the same descriptive message (including
self.path or other context) when calling log.exception so callers see both the
message and the exception stack. Ensure you update every occurrence of that
pattern (the sqlite3.DatabaseError except blocks where log.error is used) rather
than adding new try/except logic.

---

Duplicate comments:
In `@evals/runner.py`:
- Around line 237-239: In _run_prompt_case, the first value returned by
_await_audit_assertion is unpacked to entries but never used; change the
unpacking to prefix that variable with an underscore (e.g., _entries) so it
becomes _entries, passed, detail = await _await_audit_assertion(...) to signal
intentional unused value and silence static-analysis warnings referencing
_await_audit_assertion and _run_prompt_case.

In `@knowledge/store.py`:
- Around line 234-239: In _get_db, broaden the exception catch to
sqlite3.DatabaseError (the parent of OperationalError) so corrupt or read-only
DB errors are swallowed as the module docstring requires; change the except
clause in the _get_db method (currently "except sqlite3.OperationalError as
exc") to catch sqlite3.DatabaseError instead and keep the existing log.error
call and return None behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 547c1280-742e-4685-bc51-8c02425b92dd

📥 Commits

Reviewing files that changed from the base of the PR and between cab3bd8 and b3a9f1d.

📒 Files selected for processing (2)
  • evals/runner.py
  • knowledge/store.py

Comment thread evals/runner.py Outdated
Comment thread evals/runner.py Outdated
Comment thread knowledge/store.py Outdated
- evals/runner.py: use asyncio.get_running_loop() instead of the
  deprecated get_event_loop() inside the _await_audit_assertion coroutine
- evals/runner.py: prefix unused _entries return value with underscore
- evals/runner.py: use datetime.UTC alias (consistent with store.py),
  drop now-unused timezone import
- knowledge/store.py: broaden _get_db exception catch from
  OperationalError to DatabaseError so corrupt-DB errors are swallowed
  per the module's no-crash contract
- knowledge/store.py: replace log.error with log.exception in all three
  DatabaseError handlers (schema init, _get_db, add_chunk) so
  tracebacks appear in error logs

Co-Authored-By: Claude <claude@anthropic.com>

https://claude.ai/code/session_01YW5U6mtpLy4rzKmqd4trkH
Copy link
Copy Markdown
Contributor Author

Round 4 — fix(review-4) (commit b713d8d)

CodeRabbit's round-4 review (submitted 2026-04-27T22:30:33Z, covering commit b3a9f1d) posted 3 new actionable items plus 2 duplicate carryovers from round 3. All 5 are addressed below.


✅ Fixed

# File Finding Action
1 evals/runner.py L170, L179 asyncio.get_event_loop().time() used inside a coroutine — deprecated since Python 3.10, raises DeprecationWarning Changed to asyncio.get_running_loop().time() in both call sites inside _await_audit_assertion
2 knowledge/store.py _get_db Catches sqlite3.OperationalError instead of the broader sqlite3.DatabaseError, violating the module's stated no-crash contract (corrupt-DB errors would propagate) Changed except sqlite3.OperationalErrorexcept sqlite3.DatabaseError
3 knowledge/store.py L230, L238, L270 log.error(...) in three DatabaseError handlers omits tracebacks Replaced with log.exception(...) (no exc arg needed — captured automatically) across schema-init, _get_db, and add_chunk
4 evals/runner.py L237 entries unpacked from _await_audit_assertion but never used (duplicate carryover) Renamed to _entries
5 evals/runner.py L340 datetime.now(timezone.utc) inconsistent with knowledge/store.py style (duplicate carryover) Changed to datetime.now(UTC); removed now-unused timezone import

ℹ️ Not addressed (out of scope / already correct)

  • knowledge/store.py L133 (_has_fts5) and L202 (WAL PRAGMA) — both still catch sqlite3.OperationalError. These are intentional: _has_fts5 is probing for FTS5 support, and the WAL PRAGMA is explicitly best-effort for read-only mounts. Neither is a contract-violation.
  • knowledge/store.py L319, L422, L439, L479, L504, L522 — these DatabaseError handlers use log.warning, not log.error. CodeRabbit's finding targeted the log.error pattern specifically; warning-level handlers are deliberately lower-severity for expected operational failures (search miss, stats failure, etc.).

Round 4 has no new blockers. Ready to merge when the team is.


Generated by Claude Code

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
evals/runner.py (1)

231-241: ⚠️ Potential issue | 🟠 Major

Regression: expected_tools: [] no longer asserts abstention.

The logic at Line 235 still treats empty lists like normal expected-tool checks, and verify.assert_tools_fired(..., expected=[]) passes even when tools fired. This contradicts the Line 231-233 contract.

💡 Suggested fix
         expected_tools = case.get("expected_tools")
         if expected_tools is not None:
             require_success = case.get("tool_outcome", "success") == "success"
-            _entries, passed, detail = await _await_audit_assertion(
-                since, expected_tools, require_success=require_success,
-            )
-            if not passed:
-                problems.append(detail)
+            if not expected_tools:
+                loop = asyncio.get_running_loop()
+                deadline = loop.time() + _AUDIT_POLL_DEADLINE_S
+                while True:
+                    entries = verify.audit_entries_since(since)
+                    fired = [e.get("tool", "?") for e in entries if e.get("tool")]
+                    if fired:
+                        problems.append(f"unexpected tools fired: {fired}")
+                        break
+                    if loop.time() >= deadline:
+                        break
+                    await asyncio.sleep(_AUDIT_POLL_INTERVAL_S)
+            else:
+                _entries, passed, detail = await _await_audit_assertion(
+                    since, expected_tools, require_success=require_success,
+                )
+                if not passed:
+                    problems.append(detail)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evals/runner.py` around lines 231 - 241, The test treats an explicit empty
list for expected_tools as an abstention assertion but the current branch treats
[] like a normal expectation; add an explicit branch for expected_tools == []
that enforces “no tools fired”: detect if expected_tools is a list and
len(expected_tools) == 0, then call the audit check that verifies zero tool
entries (either by adding an assert_none/expect_empty flag to
_await_audit_assertion or by calling a new helper like
_await_no_tools_assertion(since)) and append detail to problems if that check
fails; keep the existing path for non-empty expected_tools and continue to use
require_success derived from case.get("tool_outcome").
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@evals/runner.py`:
- Around line 231-241: The test treats an explicit empty list for expected_tools
as an abstention assertion but the current branch treats [] like a normal
expectation; add an explicit branch for expected_tools == [] that enforces “no
tools fired”: detect if expected_tools is a list and len(expected_tools) == 0,
then call the audit check that verifies zero tool entries (either by adding an
assert_none/expect_empty flag to _await_audit_assertion or by calling a new
helper like _await_no_tools_assertion(since)) and append detail to problems if
that check fails; keep the existing path for non-empty expected_tools and
continue to use require_success derived from case.get("tool_outcome").

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1d06a7a4-c90c-4424-9401-2d0aeb53f4d3

📥 Commits

Reviewing files that changed from the base of the PR and between b3a9f1d and ae752a5.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (2)
  • evals/runner.py
  • knowledge/store.py

@mabry1985 mabry1985 merged commit 3749914 into main Apr 27, 2026
1 check passed
@mabry1985 mabry1985 deleted the feat/default-kb-and-eval-harness branch April 27, 2026 22:54
mabry1985 added a commit that referenced this pull request Apr 28, 2026
…uler

docs: sync after KB store + scheduler PRs (#155 / #156)
mabry1985 added a commit that referenced this pull request May 25, 2026
* chore: release v0.2.0

First tagged release. Contents of community-improvements project:

M1 — Security Hardening (A2A bearer auth, audit redaction, origin verification)
M2 — Memory On By Default (session persistence + load-on-start)
M3 — Skill Loop (skill-v1 emission + SQLite FTS5 index + curator)

Plus: .gitignore cleanup for .automaker-lock + .worktrees, docs coverage of
security layer, skill-loop architecture, and new env vars.

Manual bump because prepare-release.yml requires GH_PAT secret (not configured).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: release v0.2.1

Bug fixes from v0.2.0 smoke testing:
- Agent card now advertises bearer scheme when A2A_AUTH_TOKEN is set
- Session memory persistence actually fires (moved from unreachable on_session_end to after_agent)
- Test suite collects cleanly in fresh Docker env
- MemoryMiddleware activates standalone (without knowledge_store)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(ci): update repo homepage after docs deploy (#149)

Writes the deployed GitHub Pages URL back to the repo's `homepage`
field so it renders in the About sidebar on the repo page.

Co-authored-by: Automaker <automaker@localhost>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ui): live-edit config drawer with model discovery + SOUL.md editor

Elevates langgraph-config.yaml + SOUL.md into a typed form inside the
Gradio sidebar so forks can iterate on model / temperature / tools /
middleware / persona without a code edit + restart. Save rebuilds the
compiled graph in place; in-flight turns finish on the prior graph.

The model dropdown is populated from the connected gateway's
`/v1/models` endpoint so forks always see what's actually available
through the configured api_base + api_key, no hardcoded list to drift.

graph/config_io.py is the new I/O layer: YAML round-trip preserves
comments and unknown top-level sections (the shipped YAML's
memory/skills blocks that the dataclass doesn't model), dual-location
SOUL.md handling writes to both /sandbox/SOUL.md (runtime) and
config/SOUL.md (source), and gateway model discovery returns a
readable error string instead of raising when the endpoint is down.

Also exposes GET/POST /api/config + GET /api/config/models for
external control, and falls SOUL back to config/SOUL.md in
graph/prompts.py so local dev without a /sandbox mount still picks
up drawer edits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: first-run setup wizard + autostart + SOUL presets

Turns the "fork a template and edit code" onboarding into a
download-and-run flow. A fresh clone boots without any env vars,
lands in a 4-step wizard (Connect / Identity / Tools / Profile),
and writes out config + SOUL.md + a .setup-complete marker on
Launch — the chat UI then appears on the same page, drawer
pre-populated with the wizard's values.

Key pieces:

- Wizard UI in chat_ui.py: visibility-toggled wizard pane vs
  chat pane, populated from the live config so re-runs pre-fill.
  4 ship-with presets in config/soul-presets/ (generic-assistant,
  research, coding, blank) power the persona dropdown.
- Lazy graph init in server.py: no model required at boot. The
  chat endpoints return a friendly "setup required" message
  until the wizard completes. After wizard save, the marker is
  flipped BEFORE the graph reload so the rebuild actually
  compiles (this order matters — earlier iteration reloaded
  before marking complete and left _graph=None).
- Identity/auth/runtime sections added to LangGraphConfig so
  the wizard-captured name, operator, A2A token, and autostart
  flag round-trip through the existing YAML infrastructure.
  agent_name() resolver prefers YAML identity.name → env →
  "protoagent" so the agent card + OpenAI-compat model id
  reflect the wizard value without a process restart.
- autostart.py: macOS LaunchAgent install/uninstall with
  Linux/Windows stubs. Captures sys.executable at install time
  so venv-based runs survive a reboot. Opt-in via wizard
  checkbox; toggle from drawer anytime.
- Dockerfile: config volume declared so setup persistence
  survives docker run without a -v flag.
- Docs: first-agent.md rewritten for clone→pip→run→wizard
  flow; old fork/sed/docker content moved to new
  customize-and-deploy.md guide.

Tests: 29 passing (7 new — setup marker lifecycle, preset
discovery, preset content shape, shipped starter presence).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ui): sync wizard/chat visibility on every page load

Without this, a browser refresh after the marker was written
externally (POST /api/config/setup, or /api/config/reset-setup
from another tab) kept Gradio serving its initial visibility
snapshot — wizard visible even though setup is done, or vice
versa. app.load runs per page visit so visibility tracks
is_setup_complete() live.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(llm): override OpenAI SDK User-Agent to bypass gateway WAF

Cloudflare's managed WAF in front of api.proto-labs.ai (and likely
other gateways behind a default WAF config) blocks the OpenAI
Python SDK's `OpenAI/Python <ver>` User-Agent with a 403 "Your
request was blocked". /v1/models went through fine because the
gateway's model-list handler doesn't gate on UA the same way —
only /v1/chat/completions 403'd, which made this look like a key
or model-alias problem rather than what it actually was.

tools/lg_tools.py already sets a custom UA on its outbound httpx
fetches for exactly this reason; graph/llm.py had no equivalent,
so ChatOpenAI fell back to the SDK default. Threading the same
identifier through default_headers makes every protoAgent egress
present a consistent allowlisted UA.

Verified: product-director wizard → chat turn → 200 OK from
api.proto-labs.ai with the groq-llama-70b alias.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(review): address PR #150 review feedback

Critical — path traversal in preset loader (graph/config_io.py:
read_soul_preset):
  Inputs like "../secret" escaped config/soul-presets/ and read
  arbitrary .md files anywhere on disk. Resolve both the preset
  root and the candidate path and require the latter live inside
  the former before reading. 7 parametrized tests cover the
  malicious inputs I could think of (../, ../../, absolute paths,
  bare "..", mid-path ../../).

Major — YAML auth.token was non-functional for A2A bearer:
  register_a2a_routes captured _a2a_token at register time, so
  wizard-set tokens were ignored until process restart. Promoted
  _a2a_token to a module-level mutable holder (_A2A_TOKEN: list)
  that the closure reads on every request, added set_a2a_token()
  as the public mutator, and a new auth_token= arg to
  register_a2a_routes as the seed source (env still the fallback).
  server.py's reload path now calls set_a2a_token on every YAML
  change so the wizard → live bearer enforcement flow works with
  no restart — verified: fresh boot open → wizard token set →
  401 on wrong token / 200 on right → drawer clears token → open
  again.

Major — plist XML injection in autostart.py:
  Agent names containing <, >, & produced malformed plists (and
  could theoretically inject nodes). xml.sax.saxutils.escape()
  every interpolated string field before embedding.

Major — install_autostart defaulted to port 7870 regardless of
--port flag (autostart.py / server.py):
  Captured the active port in a module-level _active_port at
  _main() time and threaded it through both finish_setup's
  autostart sync and the drawer's toggle_autostart callback. The
  generated LaunchAgent now reboots on whatever port the operator
  launched with.

Minor — chat_ui polish:
  * Numeric fields (max_tokens, max_iterations, worker_max_turns)
    fall back to sensible defaults (4096/50/20) instead of 0 when
    cleared — validate_config_dict rejects zero, so "or 0" blocked
    legitimate saves with a confusing validation error.
  * _sync_visibility no longer aliases the sidebar output slot to
    wizard_pane when the sidebar is absent; split into two closures
    with matching output arities so Gradio doesn't receive duplicate
    updates to the same component.
  * Legacy load_provider_choices handler guards get_current_provider
    existence — KeyError risk when a fork provides get_provider_choices
    alone.

Nitpicks:
  * Remove unused _FIELD_MAP from config_io.py.
  * ASCII hyphen (U+002D) instead of en-dash (U+2013) in the
    temperature validation error.
  * Pin ruamel.yaml>=0.18 in Dockerfile to match requirements.txt.
  * Document the VOLUME anonymous-volume lifecycle and named-volume
    recommendation in the Dockerfile comment.

Not addressed (deliberate):
  * CodeRabbit flagged test_list_gateway_models_http_error as
    expecting httpx.ConnectError to be caught by except httpx.HTTPError
    — false positive, ConnectError → NetworkError → TransportError →
    RequestError → HTTPError, test already passes.
  * "Reuse config_io.read_soul() in graph/prompts.py" — kept the
    inline check to avoid introducing an import dependency from
    prompts.py (loaded early, widely used) into config_io.py.
  * "Use tuple form for @pytest.parametrize" — stylistic; comma-
    separated string works identically.

Test surface: 36 passing (7 new — the path-traversal parametrize set).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(review-2): address round-2 PR #150 feedback

Major — atomic graph reload (server.py::_reload_langgraph_agent):
  Previously swapped _graph_config + set_a2a_token BEFORE calling
  create_agent_graph, so a failed build would leave the running
  _graph pinned to the OLD agent but reporting the NEW config and
  rotated bearer token. Now build first; commit config/auth/graph
  state only on success.

Major — rollback marker on failed first-run reload (finish_setup):
  mark_setup_complete fires before the reload so the graph compiles.
  If the reload fails, the marker stays and the next page load
  drops the user into chat with _graph=None and no obvious recovery
  path. finish_setup now reset_setup()s on reload failure, so the
  wizard returns for a retry.

Major — sanitize agent_name for plist path (autostart.py::_macos_label):
  Prior sanitization only lowercased + replaced spaces. `/` and `..`
  survived, so an agent name with path-traversal chars could target
  arbitrary paths relative to ~/Library/LaunchAgents/ on install /
  status / uninstall. Strip input to [a-z0-9_.-] and fall back to
  "protoagent" when the result is empty/dots-only. Verified resolved
  plist path stays inside LaunchAgents/ for every path-traversal
  payload I could think of.

Major — gateway api_key off query string (POST /api/config/models):
  GET with ?api_key=... leaks credentials into browser history,
  reverse-proxy access logs, and uvicorn's own request log. Switched
  to POST taking a small ModelsProbeRequest body. Empty body still
  falls back to stored config so the drawer's initial render works.

Major — round-trip identity/security/autostart through the drawer
(chat_ui.py + server.py):
  Drawer previously only edited model/worker/middleware/knowledge/
  SOUL, leaving the wizard's agent name, operator, bearer token,
  and autostart flag with no post-setup edit path. Added three new
  accordion sections (Identity / Security / Autostart), wired them
  into _config_components, _load_all, and _save.

  Extracted _sync_autostart_with_config so the wizard's finish_setup
  and the drawer's save_all both drive the LaunchAgent install/
  uninstall from the same code path — flipping the drawer's
  Autostart checkbox + Save & Reload now installs/removes the plist
  the same way the wizard does.

Verified end-to-end on product-director:
  * Fresh clone → wizard → "pd-renamed" via drawer → agent card
    says pd-renamed, old bearer token → 401, new bearer token → 200
  * Invalid temperature (99) → rejected at validation; YAML +
    marker untouched
  * Path traversal via /api/config/presets/../secret → 404

Tests: 36 passing (no new cases — existing coverage was already
sufficient for these fixes).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: ship default knowledge store + side-effect-verified eval harness

The template now ships a working memory loop and end-to-end eval suite
on day one so forks have a green baseline before they touch a single
line of code. Closes #154.

What lands:

- knowledge/store.py — sqlite + FTS5 (LIKE fallback). One ``chunks``
  table backs operator notes (memory_ingest), daily-log entries
  (daily_log), and conversation findings extracted by
  MemoryMiddleware (domain='finding'). Path resolves env >
  config > default with an automatic ~/.protoagent/ fallback when
  /sandbox isn't writable.

- tools/lg_tools.py — five new memory tools (memory_ingest,
  memory_recall, memory_list, memory_stats, daily_log) bound to the
  store via a closure factory so tests get a fresh store per run.
  ``echo`` removed; ``get_all_tools(knowledge_store)`` actually uses
  its parameter now.

- server.py — _build_knowledge_store() constructs the store and
  threads it through both initial init and the drawer reload path.
  Defaults flipped: knowledge_middleware + memory_middleware now
  ON by default (config/langgraph-config.yaml + graph/config.py).

- evals/ — A2A client + runner + verify helpers + 15 starter cases
  (tasks.json) covering agent card discovery, bearer auth gating,
  abstention, every shipped tool, KB recall, a chained two-tool
  case, and KnowledgeMiddleware injection. Side-effect-verified:
  audit log + reply text + KB chunks all checked independently so
  hallucinated tool results get caught.

- docs/guides/evals.md — full how-to. README/TEMPLATE/configuration/
  starter-tools/first-agent updated to reflect the new defaults and
  the additional five memory tools.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(review): address PR #155 CodeRabbit feedback

Real bugs:
- evals/runner.py: teardown now runs in a finally block so seeded KB
  rows get cleaned up even when the verifier or client.ask() raises.
  expected_tools=[] now means "assert no tools fired" (was conflated
  with "no key" via the `or []` short-circuit, making the abstention
  case a no-op).
- evals/runner.py + tasks.json: added a `stream` runner kind so
  AgentClient.stream() is reachable from tasks.json — new
  streaming_status_updates case asserts the SSE event sequence.
- knowledge/store.py: PRAGMA journal_mode=WAL is now best-effort
  (read-only DBs no longer break _connect). FTS5 rebuild after
  schema install so an existing chunks table populated before FTS
  was added gets indexed. find_chunk_containing/delete_by_content
  reject empty/whitespace-only inputs to prevent LIKE '%%' wildcards
  from matching every row.

Hardening:
- tools/lg_tools.py: clamp memory_recall(k) to [1, 20] and
  memory_list(limit) to [1, 200] so the agent can't request
  arbitrarily large slices of the KB.

Doc cleanup:
- docs/guides/subagents.md: LangGraphConfig snippet had a stale
  "echo" reference; replaced with the new memory-tool list.
- docs/tutorials/first-tool.md: WORKER_CONFIG example now appends
  git_sha alongside the bundled defaults instead of replacing
  them and dropping the memory tools.
- docs/reference/starter-tools.md: "adding your own" snippet now
  preserves the conditional _build_memory_tools(knowledge_store)
  extension.
- tests/test_config_io.py: starter-tool contract assertion now
  also covers web_search.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(review-2): address round-2 PR #155 CodeRabbit feedback

- evals/runner.py: collapse redundant nested teardown guard into a
  single `if "teardown" in case:` (SIM102); remove now-unused
  `setup_applied` flag
- knowledge/store.py: use `datetime.UTC` alias (Python 3.11+, UP017)
- tools/lg_tools.py: add `-> list` return annotation to
  `_build_memory_tools` (ANN202); replace explicit loop with list
  comprehension in `memory_recall` (PERF401)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

https://claude.ai/code/session_01148o8ppbuQwuZBsVGTQWwQ

* fix(review-3): address round-3 PR #155 CodeRabbit feedback

Real bugs:
- evals/runner.py: setup is now inside the try block so a partial
  setup failure (e.g. step 2 of 3 errors) still triggers the
  finally teardown — rows from steps that did succeed no longer
  leak into the next case. Was flagged as a duplicate from round 2.
- knowledge/store.py: LIKE patterns now escape % and _ via
  ESCAPE '\' on every clause that takes user input
  (find_chunk_containing, delete_by_content, _search_like). A
  query for "100%" or "hello_world" no longer silently matches
  every row containing "100" or any single character between
  "hello" and "world".
- knowledge/store.py: FTS5 MATCH tokens are now double-quoted via
  _fts_quote() so user-supplied query terms can't smuggle FTS5
  operators (column filters, prefix wildcards, NEAR, AND/OR/NOT).
  Defence in depth — the [\w']+ tokenizer already filters most
  special chars.

Hardening:
- evals/runner.py: the fixed 0.3s asyncio.sleep waiting for the
  audit log to flush is gone. _await_audit_assertion now polls
  every 50ms up to a 2s deadline and returns as soon as the
  assertion passes — exits early on success, only burns the full
  deadline when the tool genuinely never fired.
- evals/runner.py: _run_auth_check accepts case["headers"] so cases
  can override the default bearer-only header set and exercise
  X-API-Key auth scenarios (or both auths together).
- knowledge/store.py: per-method exception handlers broadened from
  sqlite3.OperationalError to sqlite3.DatabaseError. Catches
  IntegrityError, ProgrammingError, and corruption variants too
  without crashing the agent loop. _has_fts5 (probe) and _connect
  (connection-time errors only) keep the narrower OperationalError.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(review-4): address round-4 PR #155 CodeRabbit feedback

- evals/runner.py: use asyncio.get_running_loop() instead of the
  deprecated get_event_loop() inside the _await_audit_assertion coroutine
- evals/runner.py: prefix unused _entries return value with underscore
- evals/runner.py: use datetime.UTC alias (consistent with store.py),
  drop now-unused timezone import
- knowledge/store.py: broaden _get_db exception catch from
  OperationalError to DatabaseError so corrupt-DB errors are swallowed
  per the module's no-crash contract
- knowledge/store.py: replace log.error with log.exception in all three
  DatabaseError handlers (schema init, _get_db, add_chunk) so
  tracebacks appear in error logs

Co-Authored-By: Claude <claude@anthropic.com>

https://claude.ai/code/session_01YW5U6mtpLy4rzKmqd4trkH

* chore: add uv.lock generated during round-4 review session

https://claude.ai/code/session_01YW5U6mtpLy4rzKmqd4trkH

* feat: ship pluggable scheduler (local sqlite + Workstacean adapter)

Adds a default scheduler so agents can defer work to themselves —
"remind me tomorrow", recurring sweeps, deadline check-ins. Three
new tools land in get_all_tools() when a backend is wired up:
schedule_task, list_schedules, cancel_schedule.

Two backends ship behind a single SchedulerBackend protocol:

- LocalScheduler (default): sqlite + asyncio polling. Per-agent
  jobs.db at /sandbox/scheduler/<agent_name>/ with a
  ~/.protoagent/scheduler/<agent_name>/ fallback. Fires by POSTing
  message/send to the running agent's own /a2a endpoint, going
  through bearer + X-API-Key auth like a real caller (audit log +
  cost-v1 capture work the same). Cron expressions reschedule via
  croniter; ISO datetimes are one-shot. Missed-fire recovery:
  within 24h fires immediately, older fires roll forward without
  firing.

- WorkstaceanScheduler: HTTP adapter to a Workstacean install's
  POST /publish. Activated automatically when
  WORKSTACEAN_API_BASE and WORKSTACEAN_API_KEY env vars are set.
  Topic and job IDs are namespaced cron.<agent_name>.<job_id>
  so a single Workstacean can serve N protoAgent forks safely.

Multi-agent isolation is the headline architectural property —
spinning up gina-personal alongside gina-work on the same box (or
sharing one Workstacean) won't cross-fire scheduled prompts.
Verified with explicit tests in test_scheduler_local.py.

Wiring:
- scheduler/{__init__,interface,local,workstacean}.py — module
- tools/lg_tools.py — _build_scheduler_tools factory; get_all_tools
  takes a new optional scheduler= kwarg
- graph/agent.py — create_agent_graph and create_simple_agent
  accept scheduler=
- server.py — _build_scheduler() picks backend at boot,
  on_event("startup"/"shutdown") drives the polling task lifecycle,
  reload path reuses the running scheduler instance
- config/langgraph-config.yaml + graph/{config,subagents/config}.py
  — worker subagent gets the three new tools in its allowlist
- requirements.txt — croniter>=2.0

Tests: 48 new (test_scheduler_local.py covers add/list/cancel,
multi-agent isolation, reschedule-vs-delete, missed-fire recovery,
and an end-to-end fire path with httpx mocked; test_scheduler_workstacean.py
covers all the publish payload assertions, namespacing,
custom topic prefix, and HTTP error handling).

Docs: docs/guides/scheduler.md (Diataxis how-to with the firing
model, multi-agent story, env reference, and notes on the
Workstacean A2A-bridge gap), plus index/configuration/README/
TEMPLATE updates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(review): address round-1 CodeRabbit findings on scheduler PR

- docs/guides/scheduler.md: replace jargon "multiple ginas" with "multiple agents"
- scheduler/__init__.py: sort __all__ lexicographically (RUF022)
- scheduler/local.py: log.error → log.exception in _init_db to preserve traceback (TRY400)
- scheduler/workstacean.py: correct stale module docstring that claimed list_jobs()
  issues a list command — it returns [] unconditionally
- server.py: add -> None return annotations to _scheduler_startup/_scheduler_shutdown (ANN202)
- tests: add match= to two bare pytest.raises(ValueError) calls (PT011)
- tools/lg_tools.py: wrap blocking scheduler calls in asyncio.to_thread() to avoid
  blocking the event loop under concurrent load; fix cancel_schedule error message to
  not conflate transport/DB failures with "no such job"

Co-Authored-By: claude-code <claude@anthropic.com>

https://claude.ai/code/session_01JmFYJSYRMRndZ43g3AYW2q

* fix(review-2): address round-2 PR #156 CodeRabbit feedback

Real bugs:
- scheduler/local.py: _fire() now returns bool (True on 2xx, False
  on HTTP error or network exception). _tick() only reschedules /
  deletes when _fire() succeeds, so a transient failure leaves the
  job in place for the next tick to retry. Previously a one-shot
  job hit by a 5xx silently vanished.
- server.py: the API key env var name now uses AGENT_NAME_ENV.upper()
  to match the auth handler at L893. The previous code read
  agent_name() (which returns the wizard-set identity.name when set),
  so a wizard rename pointed the scheduler at <RENAMED>_API_KEY while
  the auth handler still expected <ENV>_API_KEY → self-invocation
  401'd silently after every wizard rename.
- server.py: reload path now constructs a scheduler when _scheduler
  is None (first-run case: server boots pre-setup, wizard finishes,
  drawer triggers reload — this is when we *first* construct the
  scheduler). Existing instances are still reused — drawer saves
  don't tear down the polling loop.

Surface:
- tools/lg_tools.py: exported SCHEDULER_TOOL_NAMES and MEMORY_TOOL_NAMES
  as module constants.
- graph/config_io.py::list_available_tools: now exposes scheduler +
  memory tool names to the wizard's checkbox group even when the
  runtime hasn't yet constructed the underlying backends. Otherwise
  the wizard would hide tools that the runtime registers as soon as
  the user finishes setup.

Declined:
- scheduler/local.py L141-149: CodeRabbit asked to re-raise
  sqlite3.DatabaseError from _init_db. The store is intentionally
  fail-soft (matches knowledge/store.py + audit.py): _resolve_db_path
  already falls back to ~/.protoagent/scheduler/ when /sandbox is
  unwritable, and re-raising would crash boot in exactly the scenario
  the fallback is designed to handle. The graceful degradation
  contract is "scheduler tools return errors when storage is broken,
  agent keeps serving everything else".

Tests:
- tests/test_scheduler_local.py: new test_fire_failure_leaves_job_in_place
  regression guard + test_fire_returns_bool contract test.
- tests/test_config_io.py: list_available_tools assertions now check
  for memory + scheduler tools and no duplicates.

86 scheduler-scope tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(scheduler): YAML opt-out via middleware.scheduler (symmetric with knowledge/memory)

Scheduler had asymmetric opt-out — only env-based (SCHEDULER_DISABLED=1).
The knowledge and memory subsystems already exposed YAML toggles
(middleware.knowledge, middleware.memory) so forks could flip them
through the drawer or wizard. Scheduler now matches:

- LangGraphConfig.scheduler_enabled: bool = True (default-on)
- from_yaml() reads middleware.scheduler
- config_to_dict() emits it for the drawer round-trip
- config/langgraph-config.yaml has middleware.scheduler: true
- server.py::_build_scheduler honors the YAML toggle first, env second

Both subsystems are now genuinely opt-out:

  middleware:
    knowledge: true       # was already so
    memory: true          # was already so
    scheduler: true       # NEW — was env-only
    audit: true

Drawer/wizard can flip any of them without restart (the existing
reload path already rebuilds on config change). The env opt-out
(SCHEDULER_DISABLED=1) stays as a runtime escape hatch for fleet
operators who can't edit YAML in the moment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(review-3+4): address rounds 3 + 4 PR #156 CodeRabbit feedback

Real bugs:
- scheduler/local.py: stop() now suppresses only asyncio.CancelledError
  (the expected outcome of cancelling our own task) and logs any
  other exception via log.exception. Previously every exception
  was silently swallowed, so a polling-loop crash during shutdown
  would vanish without a trace.
- server.py: reload path now honors the new middleware.scheduler
  toggle. Three states:
  - flipped OFF (was on)  → stop + drop the running scheduler;
    new graph builds with scheduler=None.
  - flipped ON (was off / first run) → construct + start.
  - unchanged              → reuse the running instance.
  Helpers _start_scheduler_async / _stop_scheduler_async fire
  start()/stop() onto the active loop without forcing the entire
  reload chain to become async.

Type / nits:
- server.py: added `-> "SchedulerBackend | None"` return type to
  _build_scheduler, with a TYPE_CHECKING import to avoid runtime
  cycles.
- tests/test_scheduler_local.py: raw-string regex for `|`
  alternation (test_malformed_raises); added match= to the two
  bare ValueError tests (test_empty_prompt_rejected,
  test_malformed_schedule_rejected) so they only pass for the
  intended error message.
- tests/test_config_io.py: assert list_schedules in names alongside
  schedule_task / cancel_schedule.
- docs/reference/configuration.md: clarified the scheduler opt-out
  description — middleware.scheduler is canonical, SCHEDULER_DISABLED
  is a runtime escape hatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(review-5): address round-5 PR #156 CodeRabbit feedback

Real bugs:
- scheduler/local.py::_fire(): metadata moved from
  params.message.metadata to params.metadata. The A2A handler
  reads custom metadata from params.metadata
  (a2a_handler.py L1244 — `msg_metadata = params.get("metadata")`),
  so the previous nesting silently dropped the
  scheduler_job_id / scheduler_kind breadcrumb. Observers now get
  it as intended.
- server.py reload path: scheduler swap is now planned before the
  graph rebuild and only committed after rebuild succeeds. A
  failed graph rebuild used to leave the scheduler torn down or a
  fresh one already started, dis-aligning runtime state. The new
  ordering: build candidate, rebuild graph (rollback-safe on
  failure), commit graph + scheduler atomically.
- scheduler/local.py: _resolve_db_path now sanitizes agent_name
  via a new _safe_segment() helper. Strips path separators, ``..``,
  and absolute-path prefixes; falls back to "default" when nothing
  usable remains. Defence in depth — the value comes from operator-
  controlled env / YAML, but a typo or copy-paste shouldn't be able
  to put a sqlite file outside the configured scheduler dir.

Tests:
- tests/test_scheduler_local.py::test_cron_rescheduled_after_fire:
  pinned to a fixed fired_at timestamp so the assertion is exact
  (next_fire == "2026-04-29T09:00:00+00:00") instead of a
  "different from original" near-tautology that depends on
  datetime.now().

Docs:
- docs/reference/configuration.md: clarified that the scheduler's
  enable/disable lives in YAML (middleware.scheduler), while
  backend selection and runtime knobs are env-driven. Repositioned
  SCHEDULER_DISABLED as the runtime escape hatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: sync surface counts and add scheduler/memory tool coverage

After PRs #155 (default KB store + memory tools) and #156 (default
scheduler), the docs claimed nine tools, missed scheduler tools
entirely in the reference, and skipped scheduler env vars. This pass
syncs every stale claim flagged by the audit.

Updates:

- docs/reference/starter-tools.md
  - Corrected count: nine → twelve
  - New tool sections: schedule_task, list_schedules, cancel_schedule
    (signatures, output formats, multi-agent isolation notes)
  - "adding your own" snippet now threads scheduler= through
    get_all_tools alongside knowledge_store=
  - Related links include the scheduler guide

- docs/reference/environment-variables.md
  - New "Knowledge store" section: KNOWLEDGE_DB_PATH override + the
    ~/.protoagent fallback
  - New "Audit log" section: AUDIT_PATH (used by evals/verify.py)
  - New "Scheduler" section: WORKSTACEAN_API_BASE/KEY/TOPIC_PREFIX,
    SCHEDULER_DB_DIR/INVOKE_URL/DISABLED, plus the protoLabs
    operators callout pointing at the ava node + secrets manager
    for the actual key

- docs/tutorials/first-agent.md
  - Wizard description now mentions all twelve tools and the four
    middleware toggles (added Scheduler alongside Audit/Memory/
    Knowledge)

- docs/tutorials/first-tool.md
  - "Where to go next" link copy: five → twelve

- docs/guides/fork-the-template.md
  - Tool list paragraph corrected to all twelve, with the
    binding-by-backend split called out

- docs/guides/customize-and-deploy.md
  - "Add domain tools" section now mentions memory + scheduler tool
    binding and the middleware.* toggles for opt-out

- README.md
  - Starter tools row now lists all twelve, grouped 4+5+3 with
    backend bindings shown

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: conform to workspace-config standard (.beads + .automaker baseline)

Scaffolds .beads/issues.jsonl + .automaker/settings.json + standard .gitignore
(narrowing any blanket ignore) via release-tools init-workspace-config. Any
remaining runner-rule errors need per-job migration to
namespace-profile-protolabs-linux.

* chore(ci): migrate node/git workflows to org-owned runner

Migrate plain-Linux jobs to namespace-profile-protolabs-linux; annotate
jobs that genuinely require GitHub-hosted infra.

MIGRATED:
- docs.yml build (vitepress docs build + upload-pages-artifact)
- prepare-release.yml prepare (python version bump + gh PR ops)

ANNOTATED (allow-hosted-runner):
- docker-publish.yml build-and-push — docker buildx + registry push
- docs.yml deploy — GitHub Pages deploy needs hosted Pages environment
- release.yml release — docker buildx + registry push (+ OIDC attestation)

Also finishes the .beads scaffold that the conform commit referenced but
never landed: narrow the blanket *.jsonl ignore so .beads/issues.jsonl is
committed (workspace-config requires the git-friendly export tracked).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Automaker <automaker@localhost>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Ava <ava@protolabs.ai>
Co-authored-by: Automaker <josh@protolabs.studio>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

docs: add eval-setup guide for forks

2 participants