feat: ship default knowledge store + eval harness for forks#155
Conversation
The template now ships a working memory loop and end-to-end eval suite on day one so forks have a green baseline before they touch a single line of code. Closes #154. What lands: - knowledge/store.py — sqlite + FTS5 (LIKE fallback). One ``chunks`` table backs operator notes (memory_ingest), daily-log entries (daily_log), and conversation findings extracted by MemoryMiddleware (domain='finding'). Path resolves env > config > default with an automatic ~/.protoagent/ fallback when /sandbox isn't writable. - tools/lg_tools.py — five new memory tools (memory_ingest, memory_recall, memory_list, memory_stats, daily_log) bound to the store via a closure factory so tests get a fresh store per run. ``echo`` removed; ``get_all_tools(knowledge_store)`` actually uses its parameter now. - server.py — _build_knowledge_store() constructs the store and threads it through both initial init and the drawer reload path. Defaults flipped: knowledge_middleware + memory_middleware now ON by default (config/langgraph-config.yaml + graph/config.py). - evals/ — A2A client + runner + verify helpers + 15 starter cases (tasks.json) covering agent card discovery, bearer auth gating, abstention, every shipped tool, KB recall, a chained two-tool case, and KnowledgeMiddleware injection. Side-effect-verified: audit log + reply text + KB chunks all checked independently so hallucinated tool results get caught. - docs/guides/evals.md — full how-to. README/TEMPLATE/configuration/ starter-tools/first-agent updated to reflect the new defaults and the additional five memory tools. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughReplaces the inline Changes
Sequence Diagram(s)sequenceDiagram
participant Runner as Eval Runner
participant Client as AgentClient
participant Agent as Agent (A2A)
participant Audit as Audit Log
participant KB as Knowledge Store
rect rgba(100, 150, 200, 0.5)
Note over Runner,Client: Case setup & execution
end
Runner->>Client: ask(prompt, timeout_s=90)
Client->>Agent: JSON-RPC /message/send
Agent->>Audit: emit audit entries (tool calls)
Agent->>KB: memory_* / daily_log operations
Agent-->>Client: final TaskResult / status
rect rgba(100, 200, 150, 0.5)
Note over Runner,KB: Assertion validation
end
Runner->>Audit: audit_entries_since(ts)
Audit-->>Runner: tool firing records
Runner->>KB: search / find_chunk_containing / list_chunks
KB-->>Runner: stored facts & side effects
Runner->>Client: parse TaskResult.text
Client-->>Runner: extracted reply text
rect rgba(200, 150, 100, 0.5)
Note over Runner,KB: Teardown
end
Runner->>KB: delete_by_content / delete_by_heading
KB-->>Runner: deletion confirmations
sequenceDiagram
participant Tool as Memory Tool
participant KStore as KnowledgeStore
participant DB as SQLite + FTS5
rect rgba(150, 150, 200, 0.5)
Note over Tool,DB: Tool invocation & persistence
end
Tool->>KStore: add_chunk(content, domain, heading)
KStore->>DB: INSERT into chunks
KStore->>DB: maintain chunks_fts via triggers (if FTS5)
DB-->>KStore: row id | OperationalError -> None
KStore-->>Tool: chunk_id | None
rect rgba(150, 200, 150, 0.5)
Note over Tool,DB: Retrieval with FTS
end
Tool->>KStore: search(query, domain)
KStore->>DB: SELECT via FTS5 MATCH (or LIKE fallback)
DB-->>KStore: scored results
KStore-->>Tool: standardized result dicts
rect rgba(200, 150, 150, 0.5)
Note over Tool,DB: Cleanup
end
Tool->>KStore: delete_by_content / delete_by_heading
KStore->>DB: DELETE + cascade FTS triggers
DB-->>KStore: affected row count
KStore-->>Tool: count
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tests/test_config_io.py (1)
321-331: 🧹 Nitpick | 🔵 TrivialTighten starter-tool contract coverage in this test.
Line 328–331 now verifies only part of the core starter set; a regression dropping
web_searchwould no longer fail this test. Consider explicitly asserting it too.✅ Small test hardening diff
def test_list_available_tools_returns_starter_set(): @@ assert "current_time" in names assert "calculator" in names + assert "web_search" in names assert "fetch_url" in names assert all(isinstance(n, str) for n in names)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_config_io.py` around lines 321 - 331, The test test_list_available_tools_returns_starter_set currently asserts presence of "current_time", "calculator", and "fetch_url" but misses "web_search"; update the test in tests/test_config_io.py to also assert "web_search" is in the names returned by list_available_tools (function list_available_tools in graph.config_io), keeping the existing type check (all elements are str) so the starter-tool contract is fully covered.docs/reference/starter-tools.md (1)
183-188:⚠️ Potential issue | 🟡 MinorPreserve the bundled memory tools in the extension example.
Copying this snippet verbatim replaces the new conditional registry and drops all
KnowledgeStore-backed tools when a store is configured. Showmy_toolbeing appended to the existing pattern instead of returning a fresh list.Suggested doc fix
def get_all_tools(knowledge_store=None): - return [current_time, calculator, web_search, fetch_url, my_tool] + tools = [current_time, calculator, web_search, fetch_url, my_tool] + if knowledge_store is not None: + tools.extend(_build_memory_tools(knowledge_store)) + return tools🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/reference/starter-tools.md` around lines 183 - 188, The snippet replaces the existing conditional registry and drops KnowledgeStore-backed tools; update get_all_tools to obtain the existing tools list (keeping bundled tools like current_time, calculator, web_search, fetch_url and any KnowledgeStore-backed tools when knowledge_store is provided) and then append my_tool instead of returning a fresh list. Concretely, inside get_all_tools(knowledge_store=None) call or build the existing tools list used elsewhere (respecting the conditional registry when knowledge_store is set), then tools.append(my_tool) and return tools; reference get_all_tools, my_tool, and the knowledge_store parameter when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/guides/subagents.md`:
- Around line 89-99: The docs updated the tools allowlist (current_time,
calculator, web_search, fetch_url, memory_ingest, memory_recall, memory_list,
memory_stats, daily_log) but the Step 2 LangGraphConfig snippet still includes
the legacy "echo" tool, causing a mismatch; update the LangGraphConfig example
to replace or remove "echo" so it exactly matches the new tools list and
max_turns setting (e.g., ensure the LangGraphConfig.tools array contains the
same tool names and that LangGraphConfig.max_turns is set to 20) so both
snippets are consistent.
In `@docs/tutorials/first-tool.md`:
- Around line 53-56: The example redefines WORKER_CONFIG.tools and accidentally
omits the default knowledge-backed tools; instead of replacing the whole
allowlist in the SubagentConfig example, update the snippet to append "git_sha"
to the existing default tools (or explicitly include the default entries:
"memory_ingest", "memory_recall", "memory_list", "memory_stats", "daily_log"
alongside "current_time", "calculator", "web_search", "fetch_url", "git_sha") so
the example matches the template behavior; locate WORKER_CONFIG and
SubagentConfig in the doc and change only the tools list accordingly.
In `@evals/runner.py`:
- Around line 205-209: The _RUNNERS dispatch dict only maps "agent_card",
"auth_check", and "ask" (functions _run_agent_card, _run_auth_check, _run_ask)
so evals/client.py::AgentClient.stream() is never reachable from tasks.json; add
a runner entry that invokes the streaming path (e.g., "agent_stream":
_run_agent_stream) or modify an existing runner (e.g., enhance _run_agent_card)
to detect streaming tasks and call AgentClient.stream(), then register that new
function in _RUNNERS and ensure tasks.json uses the new key so A2A
streaming/push-notification cases are testable.
- Around line 133-149: The teardown must run in a finally block so it always
executes even if client.ask or the verifier raises: wrap the call sequence in
run_one so that after verify.apply_setup(case["setup"]) you set a flag (e.g.,
setup_applied = True) and perform client.ask / any verifier calls inside try,
and then in finally check if setup_applied and "teardown" in case and call
verify.apply_teardown(case["teardown"]); apply the same pattern for the later
teardown usage around the code near the 185-186 region to ensure seeded KB rows
are always cleaned up.
- Around line 158-168: The code currently uses expected_tools =
case.get("expected_tools") or [] which conflates "no key" and "empty list";
change to expected_tools = case.get("expected_tools", None) and then run the
audit check only when expected_tools is not None so that an explicit empty list
asserts that no tool entries were emitted. Keep the rest of the logic (await
asyncio.sleep, entries = verify.audit_entries_since(since), require_success
calculation, and calling verify.assert_tools_fired) but wrap it in if
expected_tools is not None: ... so assert_tools_fired is invoked for [] as well
and add any failure detail to problems as before.
In `@knowledge/store.py`:
- Around line 394-423: The helpers must reject empty or whitespace-only match
strings before constructing LIKE queries: in find_chunk_containing (parameter
text) and in delete_by_content / any verifier helper using contains, call
strip() on the incoming string and if the result is empty immediately return
None (or no-op for delete) instead of building a "%...%" pattern; do this check
at the start of the functions (e.g., in find_chunk_containing, check if not text
or not text.strip() then return None) to prevent LIKE '%%' queries that match
everything.
- Around line 171-174: After installing the FTS schema, ensure existing rows in
the external-content table get indexed: when self._fts_available is true and
after db.executescript(_FTS_SCHEMA) completes, run an FTS5 rebuild for the
chunks_fts virtual table (i.e. perform the FTS5 "rebuild" insert into
chunks_fts) so preexisting rows in chunks are indexed; add this rebuild call
alongside the existing db.executescript usage and keep the guard using
_has_fts5(db)/self._fts_available.
- Around line 162-166: The PRAGMA journal_mode=WAL call in _connect currently
makes WAL mandatory and can raise sqlite3.OperationalError for read-only DBs;
modify _connect (or move the PRAGMA to the writable init path) so WAL is
best-effort: open the sqlite3.Connection and set row_factory as now, then
attempt db.execute("PRAGMA journal_mode=WAL") inside a try/except that catches
sqlite3.OperationalError (and logs/debugs if desired) but does not close or
return None on failure, so read-only connections remain usable; alternatively
ensure only the writable initialization path (e.g., the method that prepares a
writable DB) performs the WAL PRAGMA while _connect always returns a usable
connection for reads.
In `@tools/lg_tools.py`:
- Around line 312-329: Clamp tool inputs before calling the store: cap the
incoming k (and any limit arg in the other memory tools on lines ~332-348) to a
small positive max and enforce a minimum of 1, then pass the clamped value to
knowledge_store.search instead of the raw argument. Concretely, add a MAX_K
constant (e.g. 20) and replace uses of k with something like clamped_k = max(1,
min(k, MAX_K)) (and do the same for any limit parameter in the other memory_*
function(s)), then call knowledge_store.search(query, k=clamped_k) so the tool
cannot request arbitrarily large slices of the KB; apply the same clamping logic
to all memory_* entry points in this file.
---
Outside diff comments:
In `@docs/reference/starter-tools.md`:
- Around line 183-188: The snippet replaces the existing conditional registry
and drops KnowledgeStore-backed tools; update get_all_tools to obtain the
existing tools list (keeping bundled tools like current_time, calculator,
web_search, fetch_url and any KnowledgeStore-backed tools when knowledge_store
is provided) and then append my_tool instead of returning a fresh list.
Concretely, inside get_all_tools(knowledge_store=None) call or build the
existing tools list used elsewhere (respecting the conditional registry when
knowledge_store is set), then tools.append(my_tool) and return tools; reference
get_all_tools, my_tool, and the knowledge_store parameter when making the
change.
In `@tests/test_config_io.py`:
- Around line 321-331: The test test_list_available_tools_returns_starter_set
currently asserts presence of "current_time", "calculator", and "fetch_url" but
misses "web_search"; update the test in tests/test_config_io.py to also assert
"web_search" is in the names returned by list_available_tools (function
list_available_tools in graph.config_io), keeping the existing type check (all
elements are str) so the starter-tool contract is fully covered.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: e658a3ad-718e-464a-a26c-f5dc1e6ba884
⛔ Files ignored due to path filters (2)
README.mdis excluded by!*.mdTEMPLATE.mdis excluded by!*.md
📒 Files selected for processing (27)
config/langgraph-config.yamldocs/guides/customize-and-deploy.mddocs/guides/evals.mddocs/guides/fork-the-template.mddocs/guides/index.mddocs/guides/subagents.mddocs/reference/configuration.mddocs/reference/starter-tools.mddocs/tutorials/first-agent.mddocs/tutorials/first-tool.mdevals/README.mdevals/__init__.pyevals/client.pyevals/results/.gitkeepevals/runner.pyevals/tasks.jsonevals/verify.pygraph/config.pygraph/subagents/config.pyknowledge/__init__.pyknowledge/store.pyserver.pytests/test_config_io.pytests/test_skill_curator.pytests/test_skill_emission.pytests/test_starter_tools.pytools/lg_tools.py
💤 Files with no reviewable changes (1)
- tests/test_starter_tools.py
Real bugs: - evals/runner.py: teardown now runs in a finally block so seeded KB rows get cleaned up even when the verifier or client.ask() raises. expected_tools=[] now means "assert no tools fired" (was conflated with "no key" via the `or []` short-circuit, making the abstention case a no-op). - evals/runner.py + tasks.json: added a `stream` runner kind so AgentClient.stream() is reachable from tasks.json — new streaming_status_updates case asserts the SSE event sequence. - knowledge/store.py: PRAGMA journal_mode=WAL is now best-effort (read-only DBs no longer break _connect). FTS5 rebuild after schema install so an existing chunks table populated before FTS was added gets indexed. find_chunk_containing/delete_by_content reject empty/whitespace-only inputs to prevent LIKE '%%' wildcards from matching every row. Hardening: - tools/lg_tools.py: clamp memory_recall(k) to [1, 20] and memory_list(limit) to [1, 200] so the agent can't request arbitrarily large slices of the KB. Doc cleanup: - docs/guides/subagents.md: LangGraphConfig snippet had a stale "echo" reference; replaced with the new memory-tool list. - docs/tutorials/first-tool.md: WORKER_CONFIG example now appends git_sha alongside the bundled defaults instead of replacing them and dropping the memory tools. - docs/reference/starter-tools.md: "adding your own" snippet now preserves the conditional _build_memory_tools(knowledge_store) extension. - tests/test_config_io.py: starter-tool contract assertion now also covers web_search. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Round 1 of CodeRabbit feedback addressed in 4d0d428. Real bugs first, hardening after, docs last. Real bugs
Hardening
Docs
Tests in scope (203 across the affected files): same 1 pre-existing baseline failure ( 🤖 Generated with Claude Code |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@evals/runner.py`:
- Around line 243-244: The nested condition checking setup_applied before
testing for a teardown key is redundant; update the block that currently uses
`if setup_applied or "teardown" in case:` followed by `if "teardown" in case:`
to a single guard that directly checks `if "teardown" in case:` (remove the
`setup_applied` portion and the inner duplicate check) so the teardown branch
runs only when the `"teardown"` key is present; adjust any dependent logic that
relied on `setup_applied` being true inside that branch if necessary (look for
references to `setup_applied` and `case` in the surrounding code).
In `@knowledge/store.py`:
- Around line 91-92: Update the _now_iso function to use the Python 3.11+
datetime.UTC alias instead of timezone.utc; specifically modify the _now_iso()
implementation so it calls datetime.now(datetime.UTC).isoformat(); keep the
function name _now_iso and its return type, and only apply this change if the
project targets Python 3.11+ (otherwise leave as timezone.utc).
In `@tools/lg_tools.py`:
- Around line 331-334: Replace the explicit loop that builds lines with a list
comprehension for clarity and conciseness: instead of initializing lines and
looping over results, create lines with a comprehension like the one that
produces f"[{r.get('domain', '?')}] {r['preview']}" for each r in results, then
return "\n".join(lines); update the block containing variables results, lines,
and r accordingly to remove the loop and keep the same behavior.
- Around line 284-285: Update the _build_memory_tools function signature to
include a return type annotation (e.g. -> list[Tool] or -> list[Any]) to reflect
that it returns a list of tool objects; import or reference the appropriate type
(Tool or Any) from typing or the module that defines the tool class and adjust
the signature from def _build_memory_tools(knowledge_store): to include the
chosen return annotation so IDEs and type checkers pick up the correct return
type.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: c6e241b4-d596-4c12-9a86-9665470c476a
📒 Files selected for processing (8)
docs/guides/subagents.mddocs/reference/starter-tools.mddocs/tutorials/first-tool.mdevals/runner.pyevals/tasks.jsonknowledge/store.pytests/test_config_io.pytools/lg_tools.py
- evals/runner.py: collapse redundant nested teardown guard into a single `if "teardown" in case:` (SIM102); remove now-unused `setup_applied` flag - knowledge/store.py: use `datetime.UTC` alias (Python 3.11+, UP017) - tools/lg_tools.py: add `-> list` return annotation to `_build_memory_tools` (ANN202); replace explicit loop with list comprehension in `memory_recall` (PERF401) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> https://claude.ai/code/session_01148o8ppbuQwuZBsVGTQWwQ
Round 2 review — addressed in commit cab3bd8CodeRabbit's second review (submitted 2026-04-27T21:51Z, after the round-1 push at 21:47Z) posted 4 new inline comments, all labelled Trivial / Nitpick. All four have been applied:
All three touched files were spot-checked against the pre-existing test run: same 11 failures (asyncio/import issues) with and without these changes — confirmed pre-existing baseline. Generated by Claude Code |
There was a problem hiding this comment.
Actionable comments posted: 5
♻️ Duplicate comments (1)
evals/runner.py (1)
150-157:⚠️ Potential issue | 🟠 MajorSetup failures can still leak seeded KB rows.
evals/verify.py::apply_setup()applies steps incrementally and returns on the first failure. Because this return happens before Line 162 enters thetry, a partial setup failure skips thefinallyon Lines 237-242 and any configured teardown never runs, so later cases can inherit leftover fixture data. Move setup inside thetryso cleanup also happens on setup errors.🧹 Suggested fix
- # Pre-seed state via direct DB writes (model never sees this). - if "setup" in case: - err = verify.apply_setup(case["setup"]) - if err: - return CaseResult( - case["id"], case["category"], case["name"], False, - f"setup failed: {err}", - ) - events: list[dict] = [] result: TaskResult | None = None try: + # Pre-seed state via direct DB writes (model never sees this). + if "setup" in case: + err = verify.apply_setup(case["setup"]) + if err: + return CaseResult( + case["id"], case["category"], case["name"], False, + f"setup failed: {err}", + ) since = verify.audit_now()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evals/runner.py` around lines 150 - 157, The setup path currently calls verify.apply_setup(...) before entering the surrounding try/finally, so a partial failure can return early and skip the cleanup; move the call to verify.apply_setup(case["setup"]) inside the same try block that contains the finally cleanup, and on error still produce the same CaseResult (using CaseResult(case["id"], case["category"], case["name"], False, f"setup failed: {err}")) but only after the finally has run; reference verify.apply_setup and the existing try/finally/CaseResult flow to locate where to relocate the setup call and ensure teardown always executes even when setup fails.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@evals/runner.py`:
- Around line 97-103: The current code builds a bearer-only headers dict
(variables expected_status, bad, headers), which prevents tests from exercising
X-API-Key scenarios; modify the request header construction to accept headers
from the test case (e.g., case.get("headers")) and merge them with sensible
defaults like "Content-Type": "application/json" and a default Authorization
using case.get("bad_token",""), so tasks can specify missing/invalid X-API-Key
or other auth combos; ensure case-supplied headers override defaults when
present so auth scenarios in tasks.json are honored.
- Around line 192-199: The fixed 0.3s sleep is brittle; replace it with a short
polling loop that repeatedly calls verify.audit_entries_since(since) and
verify.assert_tools_fired(entries, expected_tools, require_success=...) until
either the assertion passes or a deadline (e.g. 1–3 seconds) is reached; on
timeout fail the test with the last detail. Locate the block that uses
expected_tools, since, verify.audit_entries_since and verify.assert_tools_fired
in runner.py and swap the single asyncio.sleep for an async loop that awaits
small intervals (e.g. 50–200ms) between polls, updates entries each iteration,
and breaks early when passed is True. Ensure the final behavior preserves
require_success computed from case.get("tool_outcome", "success") and returns or
raises the same failure path when timed out.
In `@knowledge/store.py`:
- Around line 311-316: The FTS5 query builds MATCH from raw tokens in
_search_fts (tokens = [...] and match = " OR ".join(tokens)), which lets
apostrophes and reserved words break the SQL; instead quote/escape each token
with SQLite's FTS5 quoting helper (use fts5_quote() or an equivalent escape
function) before joining so the MATCH expression is safe (e.g., mapped_token =
fts5_quote(t) for t in tokens), then join mapped tokens with " OR " and use that
for the MATCH; keep the early empty-token check and existing fallback-to-LIKE
behavior intact.
- Around line 176-199: The exception handlers in the store code only catch
sqlite3.OperationalError but must catch sqlite3.DatabaseError to honor the
corrupt-database/read-only contract; update every except block that currently
uses "except sqlite3.OperationalError" to "except sqlite3.DatabaseError" (or the
broader sqlite3.Error/DatabaseError as appropriate) so operations like the
schema init (the try around self._connect(), db.executescript(_SCHEMA), the FTS
rebuild block that calls db.execute(...), and the outer commit/close) and any
other error-handling paths that reference sqlite3.OperationalError will
correctly handle corrupt or read-only DBs (search for the other handlers in this
module and replace them similarly).
- Around line 429-458: The LIKE patterns in find_chunk_containing (the SQL using
"WHERE (content LIKE ? OR heading LIKE ?)") and delete_by_content (the "DELETE
FROM chunks WHERE content LIKE ?") must treat input literally: escape any '%'
and '_' in the input by replacing '%' -> '\%' and '_' -> '\_' (taking care to
escape the backslash in Python strings), build the parameter as f"%{escaped}%",
and add " ESCAPE '\'" to the SQL statements so SQLite treats '\' as the escape
character; update the params passed to db.execute accordingly and keep the
domain handling as-is.
---
Duplicate comments:
In `@evals/runner.py`:
- Around line 150-157: The setup path currently calls verify.apply_setup(...)
before entering the surrounding try/finally, so a partial failure can return
early and skip the cleanup; move the call to verify.apply_setup(case["setup"])
inside the same try block that contains the finally cleanup, and on error still
produce the same CaseResult (using CaseResult(case["id"], case["category"],
case["name"], False, f"setup failed: {err}")) but only after the finally has
run; reference verify.apply_setup and the existing try/finally/CaseResult flow
to locate where to relocate the setup call and ensure teardown always executes
even when setup fails.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: de696887-9338-4c91-8268-f0dc82e83016
📒 Files selected for processing (3)
evals/runner.pyknowledge/store.pytools/lg_tools.py
Real bugs: - evals/runner.py: setup is now inside the try block so a partial setup failure (e.g. step 2 of 3 errors) still triggers the finally teardown — rows from steps that did succeed no longer leak into the next case. Was flagged as a duplicate from round 2. - knowledge/store.py: LIKE patterns now escape % and _ via ESCAPE '\' on every clause that takes user input (find_chunk_containing, delete_by_content, _search_like). A query for "100%" or "hello_world" no longer silently matches every row containing "100" or any single character between "hello" and "world". - knowledge/store.py: FTS5 MATCH tokens are now double-quoted via _fts_quote() so user-supplied query terms can't smuggle FTS5 operators (column filters, prefix wildcards, NEAR, AND/OR/NOT). Defence in depth — the [\w']+ tokenizer already filters most special chars. Hardening: - evals/runner.py: the fixed 0.3s asyncio.sleep waiting for the audit log to flush is gone. _await_audit_assertion now polls every 50ms up to a 2s deadline and returns as soon as the assertion passes — exits early on success, only burns the full deadline when the tool genuinely never fired. - evals/runner.py: _run_auth_check accepts case["headers"] so cases can override the default bearer-only header set and exercise X-API-Key auth scenarios (or both auths together). - knowledge/store.py: per-method exception handlers broadened from sqlite3.OperationalError to sqlite3.DatabaseError. Catches IntegrityError, ProgrammingError, and corruption variants too without crashing the agent loop. _has_fts5 (probe) and _connect (connection-time errors only) keep the narrower OperationalError. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Round 3 of CodeRabbit feedback addressed in b3a9f1d. All 5 actionable comments + the duplicate Major from round 2 are applied. Real bugs
Hardening
Smoke tests pass:
🤖 Generated with Claude Code |
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (2)
knowledge/store.py (1)
234-239:⚠️ Potential issue | 🟠 MajorWiden exception scope in
_get_dbto honor the no-crash contract.The module docstring (lines 20–22) states that the store "swallows
sqlite3.DatabaseError" so it "never crashes the agent loop on a corrupt or read-only DB." However,_get_dbonly catchesOperationalError. Corrupt databases raisesqlite3.DatabaseError(the parent class), which would propagate and crash the caller.🔧 Suggested fix
def _get_db(self) -> sqlite3.Connection | None: try: return self._connect() - except sqlite3.OperationalError as exc: + except sqlite3.DatabaseError as exc: log.error("[knowledge] connect failed: %s", exc) return None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@knowledge/store.py` around lines 234 - 239, In _get_db, broaden the exception catch to sqlite3.DatabaseError (the parent of OperationalError) so corrupt or read-only DB errors are swallowed as the module docstring requires; change the except clause in the _get_db method (currently "except sqlite3.OperationalError as exc") to catch sqlite3.DatabaseError instead and keep the existing log.error call and return None behavior.evals/runner.py (1)
237-239: 🧹 Nitpick | 🔵 TrivialPrefix unused
entriesvariable with underscore.The
entriesvalue from_await_audit_assertionis unpacked but never used within_run_prompt_case. Prefix it with_to indicate it's intentionally ignored and silence the static analysis warning.♻️ Suggested fix
- entries, passed, detail = await _await_audit_assertion( + _entries, passed, detail = await _await_audit_assertion( since, expected_tools, require_success=require_success, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evals/runner.py` around lines 237 - 239, In _run_prompt_case, the first value returned by _await_audit_assertion is unpacked to entries but never used; change the unpacking to prefix that variable with an underscore (e.g., _entries) so it becomes _entries, passed, detail = await _await_audit_assertion(...) to signal intentional unused value and silence static-analysis warnings referencing _await_audit_assertion and _run_prompt_case.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@evals/runner.py`:
- Around line 170-179: Replace uses of asyncio.get_event_loop().time() with
asyncio.get_running_loop().time() when computing deadline and checking time in
the loop: update the two call sites around the deadline assignment and the loop
condition where deadline is compared, keeping the same logic with variables
'deadline', 'since', and the verify.audit_entries_since /
verify.assert_tools_fired calls unchanged so the behavior and comparisons remain
identical.
- Around line 339-340: The timestamp in the payload uses
datetime.now(timezone.utc).isoformat() which is inconsistent with the codebase
alias; update the payload construction to use the datetime.UTC alias (i.e.,
datetime.now(datetime.UTC).isoformat()) so it matches other files like
knowledge/store.py and keeps style consistent; locate the payload assignment in
evals/runner.py (the payload variable and the datetime.now(...) call) and
replace timezone.utc with datetime.UTC.
In `@knowledge/store.py`:
- Around line 229-230: Replace the log.error(...) calls inside the exception
handlers that catch sqlite3.DatabaseError (the blocks that currently read
"except sqlite3.DatabaseError as exc: log.error(..., exc)") with
log.exception(...) so the error message for schema/init/database failures
includes the full traceback; keep the same descriptive message (including
self.path or other context) when calling log.exception so callers see both the
message and the exception stack. Ensure you update every occurrence of that
pattern (the sqlite3.DatabaseError except blocks where log.error is used) rather
than adding new try/except logic.
---
Duplicate comments:
In `@evals/runner.py`:
- Around line 237-239: In _run_prompt_case, the first value returned by
_await_audit_assertion is unpacked to entries but never used; change the
unpacking to prefix that variable with an underscore (e.g., _entries) so it
becomes _entries, passed, detail = await _await_audit_assertion(...) to signal
intentional unused value and silence static-analysis warnings referencing
_await_audit_assertion and _run_prompt_case.
In `@knowledge/store.py`:
- Around line 234-239: In _get_db, broaden the exception catch to
sqlite3.DatabaseError (the parent of OperationalError) so corrupt or read-only
DB errors are swallowed as the module docstring requires; change the except
clause in the _get_db method (currently "except sqlite3.OperationalError as
exc") to catch sqlite3.DatabaseError instead and keep the existing log.error
call and return None behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 547c1280-742e-4685-bc51-8c02425b92dd
📒 Files selected for processing (2)
evals/runner.pyknowledge/store.py
- evals/runner.py: use asyncio.get_running_loop() instead of the deprecated get_event_loop() inside the _await_audit_assertion coroutine - evals/runner.py: prefix unused _entries return value with underscore - evals/runner.py: use datetime.UTC alias (consistent with store.py), drop now-unused timezone import - knowledge/store.py: broaden _get_db exception catch from OperationalError to DatabaseError so corrupt-DB errors are swallowed per the module's no-crash contract - knowledge/store.py: replace log.error with log.exception in all three DatabaseError handlers (schema init, _get_db, add_chunk) so tracebacks appear in error logs Co-Authored-By: Claude <claude@anthropic.com> https://claude.ai/code/session_01YW5U6mtpLy4rzKmqd4trkH
Round 4 —
|
| # | File | Finding | Action |
|---|---|---|---|
| 1 | evals/runner.py L170, L179 |
asyncio.get_event_loop().time() used inside a coroutine — deprecated since Python 3.10, raises DeprecationWarning |
Changed to asyncio.get_running_loop().time() in both call sites inside _await_audit_assertion |
| 2 | knowledge/store.py _get_db |
Catches sqlite3.OperationalError instead of the broader sqlite3.DatabaseError, violating the module's stated no-crash contract (corrupt-DB errors would propagate) |
Changed except sqlite3.OperationalError → except sqlite3.DatabaseError |
| 3 | knowledge/store.py L230, L238, L270 |
log.error(...) in three DatabaseError handlers omits tracebacks |
Replaced with log.exception(...) (no exc arg needed — captured automatically) across schema-init, _get_db, and add_chunk |
| 4 | evals/runner.py L237 |
entries unpacked from _await_audit_assertion but never used (duplicate carryover) |
Renamed to _entries |
| 5 | evals/runner.py L340 |
datetime.now(timezone.utc) inconsistent with knowledge/store.py style (duplicate carryover) |
Changed to datetime.now(UTC); removed now-unused timezone import |
ℹ️ Not addressed (out of scope / already correct)
knowledge/store.pyL133 (_has_fts5) and L202 (WAL PRAGMA) — both still catchsqlite3.OperationalError. These are intentional:_has_fts5is probing for FTS5 support, and the WAL PRAGMA is explicitly best-effort for read-only mounts. Neither is a contract-violation.knowledge/store.pyL319, L422, L439, L479, L504, L522 — theseDatabaseErrorhandlers uselog.warning, notlog.error. CodeRabbit's finding targeted thelog.errorpattern specifically; warning-level handlers are deliberately lower-severity for expected operational failures (search miss, stats failure, etc.).
Round 4 has no new blockers. Ready to merge when the team is.
Generated by Claude Code
There was a problem hiding this comment.
♻️ Duplicate comments (1)
evals/runner.py (1)
231-241:⚠️ Potential issue | 🟠 MajorRegression:
expected_tools: []no longer asserts abstention.The logic at Line 235 still treats empty lists like normal expected-tool checks, and
verify.assert_tools_fired(..., expected=[])passes even when tools fired. This contradicts the Line 231-233 contract.💡 Suggested fix
expected_tools = case.get("expected_tools") if expected_tools is not None: require_success = case.get("tool_outcome", "success") == "success" - _entries, passed, detail = await _await_audit_assertion( - since, expected_tools, require_success=require_success, - ) - if not passed: - problems.append(detail) + if not expected_tools: + loop = asyncio.get_running_loop() + deadline = loop.time() + _AUDIT_POLL_DEADLINE_S + while True: + entries = verify.audit_entries_since(since) + fired = [e.get("tool", "?") for e in entries if e.get("tool")] + if fired: + problems.append(f"unexpected tools fired: {fired}") + break + if loop.time() >= deadline: + break + await asyncio.sleep(_AUDIT_POLL_INTERVAL_S) + else: + _entries, passed, detail = await _await_audit_assertion( + since, expected_tools, require_success=require_success, + ) + if not passed: + problems.append(detail)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evals/runner.py` around lines 231 - 241, The test treats an explicit empty list for expected_tools as an abstention assertion but the current branch treats [] like a normal expectation; add an explicit branch for expected_tools == [] that enforces “no tools fired”: detect if expected_tools is a list and len(expected_tools) == 0, then call the audit check that verifies zero tool entries (either by adding an assert_none/expect_empty flag to _await_audit_assertion or by calling a new helper like _await_no_tools_assertion(since)) and append detail to problems if that check fails; keep the existing path for non-empty expected_tools and continue to use require_success derived from case.get("tool_outcome").
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@evals/runner.py`:
- Around line 231-241: The test treats an explicit empty list for expected_tools
as an abstention assertion but the current branch treats [] like a normal
expectation; add an explicit branch for expected_tools == [] that enforces “no
tools fired”: detect if expected_tools is a list and len(expected_tools) == 0,
then call the audit check that verifies zero tool entries (either by adding an
assert_none/expect_empty flag to _await_audit_assertion or by calling a new
helper like _await_no_tools_assertion(since)) and append detail to problems if
that check fails; keep the existing path for non-empty expected_tools and
continue to use require_success derived from case.get("tool_outcome").
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 1d06a7a4-c90c-4424-9401-2d0aeb53f4d3
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (2)
evals/runner.pyknowledge/store.py
* chore: release v0.2.0 First tagged release. Contents of community-improvements project: M1 — Security Hardening (A2A bearer auth, audit redaction, origin verification) M2 — Memory On By Default (session persistence + load-on-start) M3 — Skill Loop (skill-v1 emission + SQLite FTS5 index + curator) Plus: .gitignore cleanup for .automaker-lock + .worktrees, docs coverage of security layer, skill-loop architecture, and new env vars. Manual bump because prepare-release.yml requires GH_PAT secret (not configured). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: release v0.2.1 Bug fixes from v0.2.0 smoke testing: - Agent card now advertises bearer scheme when A2A_AUTH_TOKEN is set - Session memory persistence actually fires (moved from unreachable on_session_end to after_agent) - Test suite collects cleanly in fresh Docker env - MemoryMiddleware activates standalone (without knowledge_store) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(ci): update repo homepage after docs deploy (#149) Writes the deployed GitHub Pages URL back to the repo's `homepage` field so it renders in the About sidebar on the repo page. Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(ui): live-edit config drawer with model discovery + SOUL.md editor Elevates langgraph-config.yaml + SOUL.md into a typed form inside the Gradio sidebar so forks can iterate on model / temperature / tools / middleware / persona without a code edit + restart. Save rebuilds the compiled graph in place; in-flight turns finish on the prior graph. The model dropdown is populated from the connected gateway's `/v1/models` endpoint so forks always see what's actually available through the configured api_base + api_key, no hardcoded list to drift. graph/config_io.py is the new I/O layer: YAML round-trip preserves comments and unknown top-level sections (the shipped YAML's memory/skills blocks that the dataclass doesn't model), dual-location SOUL.md handling writes to both /sandbox/SOUL.md (runtime) and config/SOUL.md (source), and gateway model discovery returns a readable error string instead of raising when the endpoint is down. Also exposes GET/POST /api/config + GET /api/config/models for external control, and falls SOUL back to config/SOUL.md in graph/prompts.py so local dev without a /sandbox mount still picks up drawer edits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: first-run setup wizard + autostart + SOUL presets Turns the "fork a template and edit code" onboarding into a download-and-run flow. A fresh clone boots without any env vars, lands in a 4-step wizard (Connect / Identity / Tools / Profile), and writes out config + SOUL.md + a .setup-complete marker on Launch — the chat UI then appears on the same page, drawer pre-populated with the wizard's values. Key pieces: - Wizard UI in chat_ui.py: visibility-toggled wizard pane vs chat pane, populated from the live config so re-runs pre-fill. 4 ship-with presets in config/soul-presets/ (generic-assistant, research, coding, blank) power the persona dropdown. - Lazy graph init in server.py: no model required at boot. The chat endpoints return a friendly "setup required" message until the wizard completes. After wizard save, the marker is flipped BEFORE the graph reload so the rebuild actually compiles (this order matters — earlier iteration reloaded before marking complete and left _graph=None). - Identity/auth/runtime sections added to LangGraphConfig so the wizard-captured name, operator, A2A token, and autostart flag round-trip through the existing YAML infrastructure. agent_name() resolver prefers YAML identity.name → env → "protoagent" so the agent card + OpenAI-compat model id reflect the wizard value without a process restart. - autostart.py: macOS LaunchAgent install/uninstall with Linux/Windows stubs. Captures sys.executable at install time so venv-based runs survive a reboot. Opt-in via wizard checkbox; toggle from drawer anytime. - Dockerfile: config volume declared so setup persistence survives docker run without a -v flag. - Docs: first-agent.md rewritten for clone→pip→run→wizard flow; old fork/sed/docker content moved to new customize-and-deploy.md guide. Tests: 29 passing (7 new — setup marker lifecycle, preset discovery, preset content shape, shipped starter presence). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ui): sync wizard/chat visibility on every page load Without this, a browser refresh after the marker was written externally (POST /api/config/setup, or /api/config/reset-setup from another tab) kept Gradio serving its initial visibility snapshot — wizard visible even though setup is done, or vice versa. app.load runs per page visit so visibility tracks is_setup_complete() live. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(llm): override OpenAI SDK User-Agent to bypass gateway WAF Cloudflare's managed WAF in front of api.proto-labs.ai (and likely other gateways behind a default WAF config) blocks the OpenAI Python SDK's `OpenAI/Python <ver>` User-Agent with a 403 "Your request was blocked". /v1/models went through fine because the gateway's model-list handler doesn't gate on UA the same way — only /v1/chat/completions 403'd, which made this look like a key or model-alias problem rather than what it actually was. tools/lg_tools.py already sets a custom UA on its outbound httpx fetches for exactly this reason; graph/llm.py had no equivalent, so ChatOpenAI fell back to the SDK default. Threading the same identifier through default_headers makes every protoAgent egress present a consistent allowlisted UA. Verified: product-director wizard → chat turn → 200 OK from api.proto-labs.ai with the groq-llama-70b alias. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(review): address PR #150 review feedback Critical — path traversal in preset loader (graph/config_io.py: read_soul_preset): Inputs like "../secret" escaped config/soul-presets/ and read arbitrary .md files anywhere on disk. Resolve both the preset root and the candidate path and require the latter live inside the former before reading. 7 parametrized tests cover the malicious inputs I could think of (../, ../../, absolute paths, bare "..", mid-path ../../). Major — YAML auth.token was non-functional for A2A bearer: register_a2a_routes captured _a2a_token at register time, so wizard-set tokens were ignored until process restart. Promoted _a2a_token to a module-level mutable holder (_A2A_TOKEN: list) that the closure reads on every request, added set_a2a_token() as the public mutator, and a new auth_token= arg to register_a2a_routes as the seed source (env still the fallback). server.py's reload path now calls set_a2a_token on every YAML change so the wizard → live bearer enforcement flow works with no restart — verified: fresh boot open → wizard token set → 401 on wrong token / 200 on right → drawer clears token → open again. Major — plist XML injection in autostart.py: Agent names containing <, >, & produced malformed plists (and could theoretically inject nodes). xml.sax.saxutils.escape() every interpolated string field before embedding. Major — install_autostart defaulted to port 7870 regardless of --port flag (autostart.py / server.py): Captured the active port in a module-level _active_port at _main() time and threaded it through both finish_setup's autostart sync and the drawer's toggle_autostart callback. The generated LaunchAgent now reboots on whatever port the operator launched with. Minor — chat_ui polish: * Numeric fields (max_tokens, max_iterations, worker_max_turns) fall back to sensible defaults (4096/50/20) instead of 0 when cleared — validate_config_dict rejects zero, so "or 0" blocked legitimate saves with a confusing validation error. * _sync_visibility no longer aliases the sidebar output slot to wizard_pane when the sidebar is absent; split into two closures with matching output arities so Gradio doesn't receive duplicate updates to the same component. * Legacy load_provider_choices handler guards get_current_provider existence — KeyError risk when a fork provides get_provider_choices alone. Nitpicks: * Remove unused _FIELD_MAP from config_io.py. * ASCII hyphen (U+002D) instead of en-dash (U+2013) in the temperature validation error. * Pin ruamel.yaml>=0.18 in Dockerfile to match requirements.txt. * Document the VOLUME anonymous-volume lifecycle and named-volume recommendation in the Dockerfile comment. Not addressed (deliberate): * CodeRabbit flagged test_list_gateway_models_http_error as expecting httpx.ConnectError to be caught by except httpx.HTTPError — false positive, ConnectError → NetworkError → TransportError → RequestError → HTTPError, test already passes. * "Reuse config_io.read_soul() in graph/prompts.py" — kept the inline check to avoid introducing an import dependency from prompts.py (loaded early, widely used) into config_io.py. * "Use tuple form for @pytest.parametrize" — stylistic; comma- separated string works identically. Test surface: 36 passing (7 new — the path-traversal parametrize set). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(review-2): address round-2 PR #150 feedback Major — atomic graph reload (server.py::_reload_langgraph_agent): Previously swapped _graph_config + set_a2a_token BEFORE calling create_agent_graph, so a failed build would leave the running _graph pinned to the OLD agent but reporting the NEW config and rotated bearer token. Now build first; commit config/auth/graph state only on success. Major — rollback marker on failed first-run reload (finish_setup): mark_setup_complete fires before the reload so the graph compiles. If the reload fails, the marker stays and the next page load drops the user into chat with _graph=None and no obvious recovery path. finish_setup now reset_setup()s on reload failure, so the wizard returns for a retry. Major — sanitize agent_name for plist path (autostart.py::_macos_label): Prior sanitization only lowercased + replaced spaces. `/` and `..` survived, so an agent name with path-traversal chars could target arbitrary paths relative to ~/Library/LaunchAgents/ on install / status / uninstall. Strip input to [a-z0-9_.-] and fall back to "protoagent" when the result is empty/dots-only. Verified resolved plist path stays inside LaunchAgents/ for every path-traversal payload I could think of. Major — gateway api_key off query string (POST /api/config/models): GET with ?api_key=... leaks credentials into browser history, reverse-proxy access logs, and uvicorn's own request log. Switched to POST taking a small ModelsProbeRequest body. Empty body still falls back to stored config so the drawer's initial render works. Major — round-trip identity/security/autostart through the drawer (chat_ui.py + server.py): Drawer previously only edited model/worker/middleware/knowledge/ SOUL, leaving the wizard's agent name, operator, bearer token, and autostart flag with no post-setup edit path. Added three new accordion sections (Identity / Security / Autostart), wired them into _config_components, _load_all, and _save. Extracted _sync_autostart_with_config so the wizard's finish_setup and the drawer's save_all both drive the LaunchAgent install/ uninstall from the same code path — flipping the drawer's Autostart checkbox + Save & Reload now installs/removes the plist the same way the wizard does. Verified end-to-end on product-director: * Fresh clone → wizard → "pd-renamed" via drawer → agent card says pd-renamed, old bearer token → 401, new bearer token → 200 * Invalid temperature (99) → rejected at validation; YAML + marker untouched * Path traversal via /api/config/presets/../secret → 404 Tests: 36 passing (no new cases — existing coverage was already sufficient for these fixes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: ship default knowledge store + side-effect-verified eval harness The template now ships a working memory loop and end-to-end eval suite on day one so forks have a green baseline before they touch a single line of code. Closes #154. What lands: - knowledge/store.py — sqlite + FTS5 (LIKE fallback). One ``chunks`` table backs operator notes (memory_ingest), daily-log entries (daily_log), and conversation findings extracted by MemoryMiddleware (domain='finding'). Path resolves env > config > default with an automatic ~/.protoagent/ fallback when /sandbox isn't writable. - tools/lg_tools.py — five new memory tools (memory_ingest, memory_recall, memory_list, memory_stats, daily_log) bound to the store via a closure factory so tests get a fresh store per run. ``echo`` removed; ``get_all_tools(knowledge_store)`` actually uses its parameter now. - server.py — _build_knowledge_store() constructs the store and threads it through both initial init and the drawer reload path. Defaults flipped: knowledge_middleware + memory_middleware now ON by default (config/langgraph-config.yaml + graph/config.py). - evals/ — A2A client + runner + verify helpers + 15 starter cases (tasks.json) covering agent card discovery, bearer auth gating, abstention, every shipped tool, KB recall, a chained two-tool case, and KnowledgeMiddleware injection. Side-effect-verified: audit log + reply text + KB chunks all checked independently so hallucinated tool results get caught. - docs/guides/evals.md — full how-to. README/TEMPLATE/configuration/ starter-tools/first-agent updated to reflect the new defaults and the additional five memory tools. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(review): address PR #155 CodeRabbit feedback Real bugs: - evals/runner.py: teardown now runs in a finally block so seeded KB rows get cleaned up even when the verifier or client.ask() raises. expected_tools=[] now means "assert no tools fired" (was conflated with "no key" via the `or []` short-circuit, making the abstention case a no-op). - evals/runner.py + tasks.json: added a `stream` runner kind so AgentClient.stream() is reachable from tasks.json — new streaming_status_updates case asserts the SSE event sequence. - knowledge/store.py: PRAGMA journal_mode=WAL is now best-effort (read-only DBs no longer break _connect). FTS5 rebuild after schema install so an existing chunks table populated before FTS was added gets indexed. find_chunk_containing/delete_by_content reject empty/whitespace-only inputs to prevent LIKE '%%' wildcards from matching every row. Hardening: - tools/lg_tools.py: clamp memory_recall(k) to [1, 20] and memory_list(limit) to [1, 200] so the agent can't request arbitrarily large slices of the KB. Doc cleanup: - docs/guides/subagents.md: LangGraphConfig snippet had a stale "echo" reference; replaced with the new memory-tool list. - docs/tutorials/first-tool.md: WORKER_CONFIG example now appends git_sha alongside the bundled defaults instead of replacing them and dropping the memory tools. - docs/reference/starter-tools.md: "adding your own" snippet now preserves the conditional _build_memory_tools(knowledge_store) extension. - tests/test_config_io.py: starter-tool contract assertion now also covers web_search. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(review-2): address round-2 PR #155 CodeRabbit feedback - evals/runner.py: collapse redundant nested teardown guard into a single `if "teardown" in case:` (SIM102); remove now-unused `setup_applied` flag - knowledge/store.py: use `datetime.UTC` alias (Python 3.11+, UP017) - tools/lg_tools.py: add `-> list` return annotation to `_build_memory_tools` (ANN202); replace explicit loop with list comprehension in `memory_recall` (PERF401) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> https://claude.ai/code/session_01148o8ppbuQwuZBsVGTQWwQ * fix(review-3): address round-3 PR #155 CodeRabbit feedback Real bugs: - evals/runner.py: setup is now inside the try block so a partial setup failure (e.g. step 2 of 3 errors) still triggers the finally teardown — rows from steps that did succeed no longer leak into the next case. Was flagged as a duplicate from round 2. - knowledge/store.py: LIKE patterns now escape % and _ via ESCAPE '\' on every clause that takes user input (find_chunk_containing, delete_by_content, _search_like). A query for "100%" or "hello_world" no longer silently matches every row containing "100" or any single character between "hello" and "world". - knowledge/store.py: FTS5 MATCH tokens are now double-quoted via _fts_quote() so user-supplied query terms can't smuggle FTS5 operators (column filters, prefix wildcards, NEAR, AND/OR/NOT). Defence in depth — the [\w']+ tokenizer already filters most special chars. Hardening: - evals/runner.py: the fixed 0.3s asyncio.sleep waiting for the audit log to flush is gone. _await_audit_assertion now polls every 50ms up to a 2s deadline and returns as soon as the assertion passes — exits early on success, only burns the full deadline when the tool genuinely never fired. - evals/runner.py: _run_auth_check accepts case["headers"] so cases can override the default bearer-only header set and exercise X-API-Key auth scenarios (or both auths together). - knowledge/store.py: per-method exception handlers broadened from sqlite3.OperationalError to sqlite3.DatabaseError. Catches IntegrityError, ProgrammingError, and corruption variants too without crashing the agent loop. _has_fts5 (probe) and _connect (connection-time errors only) keep the narrower OperationalError. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(review-4): address round-4 PR #155 CodeRabbit feedback - evals/runner.py: use asyncio.get_running_loop() instead of the deprecated get_event_loop() inside the _await_audit_assertion coroutine - evals/runner.py: prefix unused _entries return value with underscore - evals/runner.py: use datetime.UTC alias (consistent with store.py), drop now-unused timezone import - knowledge/store.py: broaden _get_db exception catch from OperationalError to DatabaseError so corrupt-DB errors are swallowed per the module's no-crash contract - knowledge/store.py: replace log.error with log.exception in all three DatabaseError handlers (schema init, _get_db, add_chunk) so tracebacks appear in error logs Co-Authored-By: Claude <claude@anthropic.com> https://claude.ai/code/session_01YW5U6mtpLy4rzKmqd4trkH * chore: add uv.lock generated during round-4 review session https://claude.ai/code/session_01YW5U6mtpLy4rzKmqd4trkH * feat: ship pluggable scheduler (local sqlite + Workstacean adapter) Adds a default scheduler so agents can defer work to themselves — "remind me tomorrow", recurring sweeps, deadline check-ins. Three new tools land in get_all_tools() when a backend is wired up: schedule_task, list_schedules, cancel_schedule. Two backends ship behind a single SchedulerBackend protocol: - LocalScheduler (default): sqlite + asyncio polling. Per-agent jobs.db at /sandbox/scheduler/<agent_name>/ with a ~/.protoagent/scheduler/<agent_name>/ fallback. Fires by POSTing message/send to the running agent's own /a2a endpoint, going through bearer + X-API-Key auth like a real caller (audit log + cost-v1 capture work the same). Cron expressions reschedule via croniter; ISO datetimes are one-shot. Missed-fire recovery: within 24h fires immediately, older fires roll forward without firing. - WorkstaceanScheduler: HTTP adapter to a Workstacean install's POST /publish. Activated automatically when WORKSTACEAN_API_BASE and WORKSTACEAN_API_KEY env vars are set. Topic and job IDs are namespaced cron.<agent_name>.<job_id> so a single Workstacean can serve N protoAgent forks safely. Multi-agent isolation is the headline architectural property — spinning up gina-personal alongside gina-work on the same box (or sharing one Workstacean) won't cross-fire scheduled prompts. Verified with explicit tests in test_scheduler_local.py. Wiring: - scheduler/{__init__,interface,local,workstacean}.py — module - tools/lg_tools.py — _build_scheduler_tools factory; get_all_tools takes a new optional scheduler= kwarg - graph/agent.py — create_agent_graph and create_simple_agent accept scheduler= - server.py — _build_scheduler() picks backend at boot, on_event("startup"/"shutdown") drives the polling task lifecycle, reload path reuses the running scheduler instance - config/langgraph-config.yaml + graph/{config,subagents/config}.py — worker subagent gets the three new tools in its allowlist - requirements.txt — croniter>=2.0 Tests: 48 new (test_scheduler_local.py covers add/list/cancel, multi-agent isolation, reschedule-vs-delete, missed-fire recovery, and an end-to-end fire path with httpx mocked; test_scheduler_workstacean.py covers all the publish payload assertions, namespacing, custom topic prefix, and HTTP error handling). Docs: docs/guides/scheduler.md (Diataxis how-to with the firing model, multi-agent story, env reference, and notes on the Workstacean A2A-bridge gap), plus index/configuration/README/ TEMPLATE updates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(review): address round-1 CodeRabbit findings on scheduler PR - docs/guides/scheduler.md: replace jargon "multiple ginas" with "multiple agents" - scheduler/__init__.py: sort __all__ lexicographically (RUF022) - scheduler/local.py: log.error → log.exception in _init_db to preserve traceback (TRY400) - scheduler/workstacean.py: correct stale module docstring that claimed list_jobs() issues a list command — it returns [] unconditionally - server.py: add -> None return annotations to _scheduler_startup/_scheduler_shutdown (ANN202) - tests: add match= to two bare pytest.raises(ValueError) calls (PT011) - tools/lg_tools.py: wrap blocking scheduler calls in asyncio.to_thread() to avoid blocking the event loop under concurrent load; fix cancel_schedule error message to not conflate transport/DB failures with "no such job" Co-Authored-By: claude-code <claude@anthropic.com> https://claude.ai/code/session_01JmFYJSYRMRndZ43g3AYW2q * fix(review-2): address round-2 PR #156 CodeRabbit feedback Real bugs: - scheduler/local.py: _fire() now returns bool (True on 2xx, False on HTTP error or network exception). _tick() only reschedules / deletes when _fire() succeeds, so a transient failure leaves the job in place for the next tick to retry. Previously a one-shot job hit by a 5xx silently vanished. - server.py: the API key env var name now uses AGENT_NAME_ENV.upper() to match the auth handler at L893. The previous code read agent_name() (which returns the wizard-set identity.name when set), so a wizard rename pointed the scheduler at <RENAMED>_API_KEY while the auth handler still expected <ENV>_API_KEY → self-invocation 401'd silently after every wizard rename. - server.py: reload path now constructs a scheduler when _scheduler is None (first-run case: server boots pre-setup, wizard finishes, drawer triggers reload — this is when we *first* construct the scheduler). Existing instances are still reused — drawer saves don't tear down the polling loop. Surface: - tools/lg_tools.py: exported SCHEDULER_TOOL_NAMES and MEMORY_TOOL_NAMES as module constants. - graph/config_io.py::list_available_tools: now exposes scheduler + memory tool names to the wizard's checkbox group even when the runtime hasn't yet constructed the underlying backends. Otherwise the wizard would hide tools that the runtime registers as soon as the user finishes setup. Declined: - scheduler/local.py L141-149: CodeRabbit asked to re-raise sqlite3.DatabaseError from _init_db. The store is intentionally fail-soft (matches knowledge/store.py + audit.py): _resolve_db_path already falls back to ~/.protoagent/scheduler/ when /sandbox is unwritable, and re-raising would crash boot in exactly the scenario the fallback is designed to handle. The graceful degradation contract is "scheduler tools return errors when storage is broken, agent keeps serving everything else". Tests: - tests/test_scheduler_local.py: new test_fire_failure_leaves_job_in_place regression guard + test_fire_returns_bool contract test. - tests/test_config_io.py: list_available_tools assertions now check for memory + scheduler tools and no duplicates. 86 scheduler-scope tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(scheduler): YAML opt-out via middleware.scheduler (symmetric with knowledge/memory) Scheduler had asymmetric opt-out — only env-based (SCHEDULER_DISABLED=1). The knowledge and memory subsystems already exposed YAML toggles (middleware.knowledge, middleware.memory) so forks could flip them through the drawer or wizard. Scheduler now matches: - LangGraphConfig.scheduler_enabled: bool = True (default-on) - from_yaml() reads middleware.scheduler - config_to_dict() emits it for the drawer round-trip - config/langgraph-config.yaml has middleware.scheduler: true - server.py::_build_scheduler honors the YAML toggle first, env second Both subsystems are now genuinely opt-out: middleware: knowledge: true # was already so memory: true # was already so scheduler: true # NEW — was env-only audit: true Drawer/wizard can flip any of them without restart (the existing reload path already rebuilds on config change). The env opt-out (SCHEDULER_DISABLED=1) stays as a runtime escape hatch for fleet operators who can't edit YAML in the moment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(review-3+4): address rounds 3 + 4 PR #156 CodeRabbit feedback Real bugs: - scheduler/local.py: stop() now suppresses only asyncio.CancelledError (the expected outcome of cancelling our own task) and logs any other exception via log.exception. Previously every exception was silently swallowed, so a polling-loop crash during shutdown would vanish without a trace. - server.py: reload path now honors the new middleware.scheduler toggle. Three states: - flipped OFF (was on) → stop + drop the running scheduler; new graph builds with scheduler=None. - flipped ON (was off / first run) → construct + start. - unchanged → reuse the running instance. Helpers _start_scheduler_async / _stop_scheduler_async fire start()/stop() onto the active loop without forcing the entire reload chain to become async. Type / nits: - server.py: added `-> "SchedulerBackend | None"` return type to _build_scheduler, with a TYPE_CHECKING import to avoid runtime cycles. - tests/test_scheduler_local.py: raw-string regex for `|` alternation (test_malformed_raises); added match= to the two bare ValueError tests (test_empty_prompt_rejected, test_malformed_schedule_rejected) so they only pass for the intended error message. - tests/test_config_io.py: assert list_schedules in names alongside schedule_task / cancel_schedule. - docs/reference/configuration.md: clarified the scheduler opt-out description — middleware.scheduler is canonical, SCHEDULER_DISABLED is a runtime escape hatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(review-5): address round-5 PR #156 CodeRabbit feedback Real bugs: - scheduler/local.py::_fire(): metadata moved from params.message.metadata to params.metadata. The A2A handler reads custom metadata from params.metadata (a2a_handler.py L1244 — `msg_metadata = params.get("metadata")`), so the previous nesting silently dropped the scheduler_job_id / scheduler_kind breadcrumb. Observers now get it as intended. - server.py reload path: scheduler swap is now planned before the graph rebuild and only committed after rebuild succeeds. A failed graph rebuild used to leave the scheduler torn down or a fresh one already started, dis-aligning runtime state. The new ordering: build candidate, rebuild graph (rollback-safe on failure), commit graph + scheduler atomically. - scheduler/local.py: _resolve_db_path now sanitizes agent_name via a new _safe_segment() helper. Strips path separators, ``..``, and absolute-path prefixes; falls back to "default" when nothing usable remains. Defence in depth — the value comes from operator- controlled env / YAML, but a typo or copy-paste shouldn't be able to put a sqlite file outside the configured scheduler dir. Tests: - tests/test_scheduler_local.py::test_cron_rescheduled_after_fire: pinned to a fixed fired_at timestamp so the assertion is exact (next_fire == "2026-04-29T09:00:00+00:00") instead of a "different from original" near-tautology that depends on datetime.now(). Docs: - docs/reference/configuration.md: clarified that the scheduler's enable/disable lives in YAML (middleware.scheduler), while backend selection and runtime knobs are env-driven. Repositioned SCHEDULER_DISABLED as the runtime escape hatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: sync surface counts and add scheduler/memory tool coverage After PRs #155 (default KB store + memory tools) and #156 (default scheduler), the docs claimed nine tools, missed scheduler tools entirely in the reference, and skipped scheduler env vars. This pass syncs every stale claim flagged by the audit. Updates: - docs/reference/starter-tools.md - Corrected count: nine → twelve - New tool sections: schedule_task, list_schedules, cancel_schedule (signatures, output formats, multi-agent isolation notes) - "adding your own" snippet now threads scheduler= through get_all_tools alongside knowledge_store= - Related links include the scheduler guide - docs/reference/environment-variables.md - New "Knowledge store" section: KNOWLEDGE_DB_PATH override + the ~/.protoagent fallback - New "Audit log" section: AUDIT_PATH (used by evals/verify.py) - New "Scheduler" section: WORKSTACEAN_API_BASE/KEY/TOPIC_PREFIX, SCHEDULER_DB_DIR/INVOKE_URL/DISABLED, plus the protoLabs operators callout pointing at the ava node + secrets manager for the actual key - docs/tutorials/first-agent.md - Wizard description now mentions all twelve tools and the four middleware toggles (added Scheduler alongside Audit/Memory/ Knowledge) - docs/tutorials/first-tool.md - "Where to go next" link copy: five → twelve - docs/guides/fork-the-template.md - Tool list paragraph corrected to all twelve, with the binding-by-backend split called out - docs/guides/customize-and-deploy.md - "Add domain tools" section now mentions memory + scheduler tool binding and the middleware.* toggles for opt-out - README.md - Starter tools row now lists all twelve, grouped 4+5+3 with backend bindings shown Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: conform to workspace-config standard (.beads + .automaker baseline) Scaffolds .beads/issues.jsonl + .automaker/settings.json + standard .gitignore (narrowing any blanket ignore) via release-tools init-workspace-config. Any remaining runner-rule errors need per-job migration to namespace-profile-protolabs-linux. * chore(ci): migrate node/git workflows to org-owned runner Migrate plain-Linux jobs to namespace-profile-protolabs-linux; annotate jobs that genuinely require GitHub-hosted infra. MIGRATED: - docs.yml build (vitepress docs build + upload-pages-artifact) - prepare-release.yml prepare (python version bump + gh PR ops) ANNOTATED (allow-hosted-runner): - docker-publish.yml build-and-push — docker buildx + registry push - docs.yml deploy — GitHub Pages deploy needs hosted Pages environment - release.yml release — docker buildx + registry push (+ OIDC attestation) Also finishes the .beads scaffold that the conform commit referenced but never landed: narrow the blanket *.jsonl ignore so .beads/issues.jsonl is committed (workspace-config requires the git-friendly export tracked). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Ava <ava@protolabs.ai> Co-authored-by: Automaker <josh@protolabs.studio>
Closes #154.
Summary
knowledge/store.py(sqlite + FTS5, LIKE fallback). Onechunkstable backs operator notes, daily-log entries, and conversation findings extracted byMemoryMiddleware. Default-on;middleware.knowledge: falseto opt out.tools/lg_tools.py—memory_ingest,memory_recall,memory_list,memory_stats,daily_log— bound to the bundled store.echoremoved.evals/— A2A client + runner + side-effect-verifiedtasks.json(15 starter cases: agent card, auth, abstention, every shipped tool, KB recall, chained, KnowledgeMiddleware). Audit log + reply text + KB state all checked independently — hallucinated tool results get caught.docs/guides/evals.mdhow-to + every reference page updated to match the new defaults / nine-tool starter set.Why side-effect verification
When the model produces "Logged: ..." without actually calling
daily_log, text-only checks pass while the DB stays empty. The harness readsaudit.jsonland thechunkstable after every turn so that bug doesn't ship.Default flip — what existing forks will notice
knowledge_middlewareandmemory_middlewarenow default totrueinLangGraphConfig. On first run:/sandbox/knowledge/agent.db(Docker default).~/.protoagent/knowledge/agent.dbautomatically — no permission errors.middleware.knowledge: falseinconfig/langgraph-config.yaml.Test plan
python -m evals.runneragainst a wizard-set-up local agent — expect ≥13/15 cases passing on a small/fast model (text-pattern noise on a couple).python -m evals.runner --category a2a-protocol— agent card + auth gating gate at 100% (deterministic).pytest tests/test_starter_tools.py tests/test_config_io.py tests/test_skill_emission.py tests/test_a2a_handler.py— should be all green (was 202/203 with one unrelated origin test failing onmain).🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Configuration
Documentation
Tests