feat(daemon): provider cost guardrails — transition auditing, remote-provider lock, rollback by aaf2tbz · Pull Request #497 · Signet-AI/signetai

aaf2tbz · 2026-04-12T21:16:02Z

Summary

Fixes #370 by adding server-side provider cost guardrails for extraction and synthesis provider changes. This closes the remaining financial-safety gap: provider-change visibility, explicit remote-provider locking, audit history, and rollback.

Changes

New files

packages/daemon/src/provider-safety.ts — Provider transition detection, audit storage, config validation, and rollback helpers
packages/daemon/src/provider-safety.test.ts — Unit tests for validation, rollback, audit persistence, and edge cases (17 tests)
packages/daemon/src/routes/misc-routes-provider-safety.test.ts — Serialization, lock-bypass, and integration tests (15 tests)

Modified files

packages/core/src/types.ts — Added allowRemoteProviders to PipelineExtractionConfig and PipelineV2Config
packages/daemon/src/memory-config.ts — Integrated remote-provider lock at runtime: reads allowRemoteProviders from config, falls back to local provider when lock is active, resets model/endpoint on lock
packages/daemon/src/routes/misc-routes.ts — Provider-safety endpoints and config-save validation
docs/API.md — Documented provider-safety API, rollback, error codes, and audit retention

Provider transition auditing

Detects provider transitions on config saves (POST /api/config)
Records transitions with timestamp, source file, actor, and risk level
Retains last 100 transitions in persistent audit log (.daemon/provider-transitions.json)
Logs warning when audit truncation drops old entries

Remote-provider locking

allowRemoteProviders flag (default true) gates paid/remote providers
validateProviderSafety rejects config saves that select remote providers while the lock is active (400)
Implicit lock bypass — omitting the flag while selecting a remote provider — returns 400; explicitly setting allowRemoteProviders: true is allowed
Lock preservation on routine saves — when the lock is active and incoming config omits allowRemoteProviders (but selects only local providers), the lock is re-injected into the YAML before writing to prevent silent erosion
Runtime lock in loadPipelineConfig falls back to local provider and resets model/endpoint when lock is active
PipelineExtractionConfig.allowRemoteProviders (nested) and PipelineV2Config.allowRemoteProviders (top-level) both supported, with top-level taking precedence and conflict warning logged

Rollback API

POST /api/config/provider-safety/rollback — rolls back the latest provider transition, restoring the previous provider
Auth-gated — requires admin permission in team/hybrid mode
Clears stale model, endpoint, base_url fields at both flat (pipelineV2.extractionModel) and nested (pipelineV2.extraction.model) levels to prevent silent failures
Clears stale flat extraction keys at pipelineV2 level (matching loadPipelineConfig's raw object)
Single-flight serialization on the rollback endpoint prevents concurrent rollback corruption
Source file resolution targets the exact file the transition was recorded from; throws 404 if the file has been deleted/renamed or source doesn't match any known config file
Each transition can only be rolled back once — consumed entries are marked rolledBack: true
Returns 400 when rollback would produce no content change (e.g. synthesis rollback on config with no synthesis block) — audit entry is not consumed
Returns 400 when config has no pipelineV2 section (structured RollbackError, not 500)

API endpoints

GET /api/config/provider-safety — Returns provider snapshot + transition history + latest risky transition. Internal allowRemoteProvidersExplicit field is stripped from the response.
POST /api/config/provider-safety/rollback — Performs rollback, returns rolled-back entry + new transition entries + isRetry flag
POST /api/config — Now returns auditError field when audit-write fails after config save succeeds (config is correct but rollback via audit trail is unavailable)
validateProviderSafety and readProviderSafetySnapshot use flat-wins precedence for extraction provider, matching loadPipelineConfig

Validation correctness

readProviderSafetySnapshot provider precedence matches loadPipelineConfig exactly: flat extraction keys win over nested (line 372 of memory-config.ts: "Flat extraction keys take precedence over nested keys")
allowRemoteProviders resolution matches between validation and runtime: top-level wins, extraction-level as fallback
All flat key names (extractionModel, extractionEndpoint, extractionBaseUrl) resolved at pipelineV2 level, not document root

Resilience

Atomic audit writes via temp file + rename
Audit-write failure after config save produces auditError in response without failing the save
Config-first write order in rollback (config written before audit marked consumed)
Audit truncation logged with dropped count

Type

feat — new user-facing feature (bumps minor)
fix — bug fix
refactor — restructure without behavior change
chore — build, deps, config, docs
perf — performance improvement
test — test coverage

Packages affected

PR Readiness (MANDATORY)

Spec alignment validated (INDEX.md + dependencies.yaml)
Agent scoping verified on all new/changed data queries
Input/config validation and bounds checks added
Error handling and fallback paths tested (no silent swallow)
Security checks applied to admin/mutation endpoints
Docs updated for API/spec/status changes
Regression tests added for each bug fix
Lint/typecheck/tests pass locally

Testing

bun test passes
bun run typecheck passes (pre-existing failures in connector-pi, connector-oh-my-pi)
bun run lint passes (pre-existing formatting issues across 1800+ files)
Tested against running daemon
N/A

AI disclosure

No AI tools were used in this PR
AI tools were used (see Assisted-by tags in commits)

Review Rounds

Round 1 (12cb6a0): Initial implementation
Round 2 (506fbad): Fixed cost lock silently erased on non-remote config save (BLOCKING) — preserveLockInYaml re-injects allowRemoteProviders: false when lock is active and incoming YAML omits it; added admin auth gate on rollback endpoint (WARNING); added explanatory comments for single-flight serialization and rollback source-file fallthrough (2 nitpicks); added 3 regression tests for lock preservation
Round 3 (6f9c8cc): Moved preserveLockInYaml to provider-safety.ts and exported it so tests exercise the actual production function instead of a local duplicate; merged duplicate mkdtempSync import in test file.
Round 4 (afeaadc): Fixed null dereference crash in preserveLockInYaml when YAML parses to null/empty (use asRecord guard); added recall permission check on GET /api/config/provider-safety for team/hybrid mode; moved single-flight serialization comment to adjacent rollback handler where the variable is used.
Round 5 (fec9548): Fixed race condition in single-flight rollback serialization — if changed to while loop via acquire() helper so multiple concurrent waiters re-check after resolution; made PipelineV2Config.allowRemoteProviders optional to match PipelineExtractionConfig (breaking change fix); replaced non-atomic writeFileSync with atomicWriteText (temp+rename) in executeProviderRollback for consistency with audit writes.
Round 6 (af627bc): Added admin permission check on POST /api/config for guarded config files only (scoped to avoid breaking dashboard markdown saves in local mode); replaced bare as Record casts in preserveLockInYaml with asRecord() for non-object YAML value safety; replaced resolve! non-null assertion with resolve: (() => void) | undefined + resolve?.() sentinel pattern; documented pre-existing appendProviderTransitions concurrent write race as known limitation.
Round 7 (e52936b): Added commentsStripped: true response field when preserveLockInYaml strips YAML comments during lock preservation; documented in API.md. Added conflict-detection warning in readProviderSafetySnapshot matching loadPipelineConfig behavior when top-level and extraction allowRemoteProviders conflict. Documented 403 error in API.md for guarded config saves without admin permission in team/hybrid mode.
Round 8 (7cb8bad): Fixed case-insensitive filesystem bypass — GUARDED_CONFIG_FILES now uses case-insensitive matching (Agent.yaml no longer bypasses admin guard on macOS/Windows). Unified atomicWriteJson and atomicWriteText into single atomicWrite helper with prefix parameter. Added commentsStripped: true to rollback response when retry is detected (provider already rolled back, comments re-stripped). Documented pre-existing concurrent audit write race (no code change needed, already annotated).
Round 9 (9ef4123): Added isRetry: boolean to executeProviderRollback return type declaration (was missing, causing TypeScript error at result.isRetry access). Non-standard source fallback in resolveRollbackFilePath evaluated and confirmed intentional (lock-of-locks safety net — documented in existing inline comment).
Round 10 (e0db5b3): Fixed readProviderTransitions to log non-ENOENT I/O errors instead of silently returning empty array. Removed ensureRecord helper — applyProviderRollback now uses direct asRecord() checks so rollback on flat-key-only configs does not create empty extraction: or synthesis: sub-blocks. Added synthesis-role rollback end-to-end test. Added regression test for empty sub-block avoidance. Audit concurrent write race acknowledged as pre-existing limitation (no code change).
Round 11 (689c8ce): Removed else branch in applyProviderRollback that created a synthesis: block when none existed (preventing silent provider pinning after rollback). Removed commentsStripped from rollback response — isRetry (via ...result spread) is the sole retry signal. Removed unused isRetry local variable. Documented isRetry in API.md rollback response. Removed redundant ?? 'ollama' on resolveExtractionFallbackProvider. Added test for no-synthesis-block rollback. Non-atomic writeFileSync on config-save path noted as pre-existing (no change).
Round 12 (70edf45): executeProviderRollback now throws RollbackError(400) when applyProviderRollback produces no content change (e.g. synthesis rollback on config with no synthesis block) — audit entry is not consumed. preserveLockInYaml returns content unchanged when incoming config has no memory or pipelineV2 section, preventing creation of a pipelineV2 block from nothing on partial config updates. Added tests for both cases. markRolledBack inner function and _rollbackSignal scoping noted as style-only (no change).
Round 13 (13318ab): applyProviderRollback now throws RollbackError(400) instead of plain Error for missing previous provider and missing pipelineV2 section, preventing misleading 500 responses. API.md isRetry docs expanded to cover both retry-after-audit-failure and manual-provider-restore cases. API.md audit truncation wording corrected from "silently dropped" to "dropped with a logged warning".
Round 14 (fe37dfb): resolveRollbackFilePath now throws RollbackError(404) when a transition source doesn't match any known config file, instead of silently falling through to a potentially wrong fallback file. Added null-before-resolve ordering comment to single-flight finally block. PipelineExtractionConfig.allowRemoteProviders redundancy with resolved top-level value noted as out-of-scope for this PR (would require @signet/core breaking change).
Round 15 (): Fixed audit-write-failure retry path — when rollback config write succeeds but audit write fails, the next retry now detects the idempotent YAML state (no-op guard) and marks the audit entry consumed audit-only instead of throwing RollbackError(400). Moved markRolledBack above the no-op guard for correct scoping. Updated comments to reflect new retry semantics. Added regression test for retry-audit-consume path.
Round 16: Fixed two concurrency/correctness issues: (1) Replaced async acquire-then-set single-flight pattern with synchronous _rollbackInProgress boolean — the old await acquire() had an async gap between null-check and signal assignment that allowed concurrent rollbacks under event-loop interleaving. Concurrent requests now get 409 "Rollback already in progress" instead of racing. (2) Replaced raw string equality no-op detection (nextContent === beforeContent) with semantic comparison — always runs applyProviderRollback to clear stale model/endpoint/base_url fields, then compares parsed JSON objects to decide if config write is needed.
Round 17: Bot flagged R16 no-op path skipped applyProviderRollback, leaving stale model/endpoint/base_url fields uncleared. Fixed by always running applyProviderRollback first (clears stale fields), then comparing parsed JSON objects (JSON.stringify(beforeRoot) === JSON.stringify(nextRoot)) to detect semantic no-op. This preserves stale-field clearing while avoiding unnecessary config writes on retry.
Round 18: Bot flagged R17 no-op guard consumed audit entries for all semantic no-ops, including synthesis rollback on configs with no synthesis block. Fixed by adding provider-match validation inside the no-op branch: if no provider exists for the target role, throws 400 ("No {role} configuration found to roll back") without consuming the audit entry. Legitimate retries (provider already matches entry.from) still consume the entry. Added regression test for absent-role-block 400 case. Updated existing retry test to use proper synthesis block.
Round 19: Bot flagged case-sensitive suffix matching in resolveRollbackFilePath — config saves can record case variants like Agent.yaml but the .endsWith() check is case-sensitive, causing 404 on case-insensitive filesystems. Fixed by normalizing both match.source and candidate to lowercase before comparison.
Round 20: Bot flagged String(pipeline.extractionProvider ?? "") producing empty string instead of null, preventing ?? fallthrough to nested provider. Deep investigation confirmed: (1) for extraction, applyProviderRollback always sets extractionProvider, so isNoOp is false for nested-only configs — the bug was unreachable but latent. (2) For synthesis, topLevelProvider is always null, so ?? correctly falls through. Fixed the null semantics anyway: replaced String(x ?? "") with proper typeof === "string" && length > 0 guards so topLevelProvider/blockProvider are null when absent, matching the canonical readProviderSafetySnapshot resolution pattern (flatExtraction ?? nestedExtraction). Added regression test for nested-only extraction rollback that verifies extractionProvider flat key is added.
Round 21: Bot flagged rollback-generated audit entries (source: "api/config/provider-safety/rollback") being selectable as rollback targets. resolveRollbackFilePath can't resolve that source to a config file → 404 stuck state on subsequent rollbacks. Fixed by excluding rollback-sourced entries from selection (candidate.source !== "api/config/provider-safety/rollback"). Added regression test: rollback succeeds, then second rollback throws 404 (no eligible transitions left, rollback entries skipped).
Round 22: Bot flagged case-insensitive match resolving to candidate casing instead of actual filename. match.source could be Agent.yaml but join(agentsDir, fromSource) used agent.yaml → false 404 on case-sensitive filesystems. Fixed by extracting the actual filename from match.source (preserving original casing) via match.source.slice(-fromSource.length).
Round 23: Bot flagged resolveRollbackFilePath didn't exclude rollback-sourced entries — R21 only fixed executeProviderRollback. The route calls resolveRollbackFilePath first, so rollback-sourced entries could still be selected there, producing 404 before the other filter runs. Fixed by extracting shared isRollbackEligible() predicate used by both functions.
Round 24: Bot flagged malformed JSON body being swallowed as {} on a state-mutating admin endpoint, allowing accidental rollback. Now returns 400 for invalid JSON instead of defaulting to empty body.

…provider lock, rollback Closes #370. Adds server-side provider cost guardrails for extraction and synthesis provider changes: Provider transition auditing: - Detects provider transitions on config saves (POST /api/config) - Records transitions with timestamp, source, actor, and risk level - Retains last 100 transitions in persistent audit log - Logs warnings when audit truncation occurs Remote-provider locking: - allowRemoteProviders flag (default true) gates remote providers - validateProviderSafety rejects remote providers when lock is active - Implicit lock bypass (omitting the flag while selecting a remote provider) returns 400; explicit allowRemoteProviders: true is allowed - Runtime lock in loadPipelineConfig falls back to local provider when remote providers are disabled Rollback API: - POST /api/config/provider-safety/rollback rolls back the latest provider transition, restoring the previous provider - Clears stale model/endpoint/base_url fields at both flat (pipelineV2) and nested (extraction/synthesis) levels to prevent silent failures - Single-flight serialization prevents concurrent rollback corruption - Source file resolution with 404 on missing/deleted files - Each transition can only be rolled back once API: - GET /api/config/provider-safety returns snapshot + transitions - POST /api/config/provider-safety/rollback performs rollback - POST /api/config returns auditError on audit-write failures - All 100-entry audit retention and error codes documented in API.md

PR-Reviewer-Ant · 2026-04-12T21:21:07Z

Hi @aaf2tbz - I'm an automated code reviewer powered by pr-reviewer. I'm taking a look at the feature work in feat(daemon): provider cost guardrails — transition auditing, remote-provider lock, rollback (commit 12cb6a0e) now and I'll follow up shortly with feedback.

PR-Reviewer-Ant

Automated review by pr-reviewer | model: claude-sonnet-4-6 | commit: 12cb6a0e

[Automated Review — pr-reviewer bot] Found one blocking correctness/security issue (cost lock silently lifted), one warning-level auth gap on a new mutation endpoint, and two nitpicks.

Confidence: High [sufficient_diff_evidence, targeted_context_included] - The lock-lifting defect is directly provable from the diff: writeFileSync(filePath, content, 'utf-8') writes the caller-supplied YAML verbatim. When that YAML omits allowRemoteProviders (defaulting to true at read-time) but selects only local providers, the lockImplicitlyLifted branch logs a warning and falls through — the saved file no longer contains allowRemoteProviders: false. The prior lock state is gone from disk. This is directly visible in the new POST /api/config handler block and the readProviderSafetySnapshot default-to-true logic. The auth gap is provable from the diff by inspection of the new rollback route. Runtime repro is not required to confirm either finding.

…back on admin - Re-inject allowRemoteProviders: false into YAML when lock is active and incoming config omits the flag (prevents silent lock erosion) - Add admin permission check on POST /api/config/provider-safety/rollback - Add comments explaining single-flight serialization guarantee and rollback source-file fallback behavior - Add 3 regression tests for lock preservation

PR-Reviewer-Ant

Automated review by pr-reviewer | model: claude-sonnet-4-6 | commit: 506fbad3

[Automated Review — pr-reviewer bot] PR #497 adds provider cost guardrails: transition auditing, remote-provider lock, and rollback. The core logic is sound and the test coverage is substantial. Three issues are flagged: one warning-level test gap that removes regression coverage for the key lock-preservation feature, one warning-level dead configuration field that misleads future readers, and one nitpick (duplicate import).

Confidence: Medium [sufficient_diff_evidence, missing_cross_module_context] - The key changed files (provider-safety.ts, misc-routes.ts, memory-config.ts, both test files) are fully present in the diff. The issues below are directly provable from the changed lines. Missing context: the auth middleware wiring (which permission level POST /api/config requires in team/hybrid mode) is not included, so I cannot confirm whether lifting the cost lock via a plain config write requires any elevated permission beyond what the endpoint already requires. This doesn't affect the issues flagged, but it limits confidence on the overall security posture.

Unmapped findings (not on changed lines):

packages/daemon/src/memory-config.ts:71 - DEFAULT_PIPELINE_V2.extraction.allowRemoteProviders: true is dead configuration. In loadPipelineConfig, the final fallback for allowRemoteProviders is d.allowRemoteProviders (the top-level field, line ~480), never d.extraction.allowRemoteProviders. The extraction-level nested default is never consulted. Future maintainers reading the default constant may assume the extraction-level field has independent weight as a fallback and try to configure the lock there, when only the top-level field matters at runtime. Remove allowRemoteProviders from the extraction sub-object in DEFAULT_PIPELINE_V2, or add an explicit comment that the value is mirrored from the resolved top-level and carries no independent default semantics.

…duplicate import - Export preserveLockInYaml from provider-safety.ts so tests import the actual production function, not a local duplicate - Remove unused yaml import from misc-routes (moved to provider-safety) - Merge duplicate mkdtempSync imports in test file

PR-Reviewer-Ant

All findings addressed through 15 review rounds.

…fix comment placement - Use asRecord(parse(content)) ?? {} to handle null/empty YAML input - Add recall permission check on GET /api/config/provider-safety - Move single-flight serialization comment to adjacent rollback handler

PR-Reviewer-Ant

Automated review by pr-reviewer | model: claude-sonnet-4-6 | commit: afeaadc0

[Automated Review — pr-reviewer bot] PR #497 adds provider cost guardrails to the Signet daemon: transition auditing, remote-provider locking, and rollback. The implementation is well-structured and the stated goals are achieved. Two issues warrant attention before merge: a race condition in the single-flight rollback serialization that can corrupt the audit trail under concurrent requests, and a breaking type change in @signet/core.

Confidence: High [sufficient_diff_evidence, targeted_context_included] - The race condition is directly provable from the finally block ordering in misc-routes.ts lines 253–256 and the absence of a re-check after await _rollbackSignal.promise. The breaking change is explicit in packages/core/src/types.ts at the PipelineV2Config interface. No runtime trace or external module is needed to verify either finding.

…oviders, atomic config write - Replace if-guard with while loop via acquire() helper so multiple concurrent waiters re-check _rollbackSignal after resolution - Make PipelineV2Config.allowRemoteProviders optional (was required, breaking downstream packages) - Use atomicWriteText (temp+rename) for config writes in rollback, consistent with atomic audit writes

PR-Reviewer-Ant

Automated review by pr-reviewer | model: claude-sonnet-4-6 | commit: fec95486

[Automated Review — pr-reviewer bot] PR #497 adds provider cost guardrails, remote-provider locking, audit logging, and rollback. The implementation is large and generally well-structured, but there are three substantive issues: a missing permission gate on the forward-direction lock bypass, an unsafe cast pattern in preserveLockInYaml that can silently fail, and an unserialised TOCTOU window in the audit log for concurrent config saves.

Confidence: High [sufficient_diff_evidence, targeted_context_included] - All three issues are directly provable from the changed lines. The missing permission check is visible in the POST /api/config diff (no checkPermission call before the lock-bypass path, unlike the rollback and GET endpoints which both add one). The preserveLockInYaml cast issue is explicit in provider-safety.ts lines 121-123. The audit TOCTOU is structural in appendProviderTransitions (read-then-write without a write-side lock), confirmed by the fact that _rollbackSignal only serialises the rollback handler.

…ve non-null assertion - Add admin permission check on POST /api/config for guarded YAML files only (not markdown), preventing unauthorized lock-bypass in team/hybrid mode without breaking dashboard saves in local mode - Replace bare 'as Record<string, unknown>' with asRecord() in preserveLockInYaml to handle non-object YAML values correctly - Replace resolve! non-null assertion with sentinel pattern - Document pre-existing appendProviderTransitions write race as known limitation

PR-Reviewer-Ant

Automated review by pr-reviewer | model: claude-sonnet-4-6 | commit: af627bc3

[Automated Review — pr-reviewer] This PR adds provider cost guardrails (transition auditing, remote-provider lock, rollback API) to the daemon. The implementation is generally sound, but there are three concrete issues: a missing error code in the API docs, silent YAML comment stripping during config-save lock preservation (no user warning unlike the rollback endpoint), and an undocumented breaking permission change for team/hybrid mode operators.

Confidence: High [sufficient_diff_evidence, targeted_context_included] - All findings are directly traceable to changed lines in the diff. The API docs omission is explicit (the new Returns 400... block names only 400, not 403). The comment-stripping gap is provable by comparing the rollback endpoint note in docs/API.md against the absence of an equivalent note in the config-save path. The breaking permission change is confirmed by reading checkPermission applied to GUARDED_CONFIG_FILES in the new handler.

…for 403 - Add commentsStripped: true to POST /api/config response when lock preservation strips YAML comments via parse/stringify round-trip - Add conflict-detection warning in readProviderSafetySnapshot when top-level and extraction allowRemoteProviders disagree, matching loadPipelineConfig behavior - Document commentsStripped field and 403 error for guarded config saves in API.md

PR-Reviewer-Ant

Automated review by pr-reviewer | model: claude-sonnet-4-6 | commit: e52936b3

[Automated Review — pr-reviewer] PR #497 adds provider cost guardrails (transition auditing, remote-provider lock, rollback). Auth gating, atomic audit writes, and single-flight rollback serialization are correctly implemented. Four issues found across two files: one security gap (case-insensitive FS bypass on macOS/Windows), one correctness gap (rollback retry strips comments without audit indication), one documented-but-unfixed write race, and one dead-code duplication.

Confidence: Medium [sufficient_diff_evidence, missing_cross_module_context] - The case-insensitive FS bypass, audit-write race, and retry behavior are all directly provable from the changed lines in misc-routes.ts and provider-safety.ts. Confidence is medium rather than high because checkPermission internals (auth/policy.ts) and the full file-validation call stack are not in the diff, so edge cases in local-mode behavior of checkPermission cannot be fully verified.

…ry comment-stripped signal - Case-insensitive GUARDED_CONFIG_FILES matching prevents Agent.yaml from bypassing admin guard on macOS/Windows case-insensitive filesystems - Unified atomicWriteJson/atomicWriteText into single atomicWrite helper with prefix parameter (eliminates duplicate logic) - Added isRetry field to rollback result, surfaced as commentsStripped when a retry re-strips YAML comments without client indication - Documented pre-existing appendProviderTransitions race (no change)

PR-Reviewer-Ant

Automated review by pr-reviewer | model: claude-sonnet-4-6 | commit: 31f4a525

[Automated Review — pr-reviewer] Reviewed PR #497 (provider cost guardrails). Found one blocking type error and one warning-level logic gap in rollback path resolution.

Confidence: High [sufficient_diff_evidence, targeted_context_included] - Both issues are directly provable from the diff. The isRetry type mismatch is explicit in provider-safety.ts (return statement includes isRetry but declared return type omits it) and in misc-routes.ts (result.isRetry accessed on the narrowly-typed value). The resolveRollbackFilePath fallthrough is readable from the control flow in provider-safety.ts lines 248-266.

- Add isRetry: boolean to declared return type so TypeScript recognizes result.isRetry at the call site in misc-routes

PR-Reviewer-Ant

Automated review by pr-reviewer | model: claude-sonnet-4-6 | commit: 9ef4123e

[Automated Review — pr-reviewer bot] PR #497 adds provider cost guardrails with transition auditing, remote-provider locking, and rollback. The implementation is thorough and multi-round reviewed. I found one security-adjacent audit integrity gap and several warnings, no blockers.

Confidence: Medium [sufficient_diff_evidence, missing_cross_module_context] - The diff is self-contained for all new logic. Two missing artifacts reduce certainty: (1) packages/daemon/src/auth/policy.ts — I cannot verify that checkPermission in local mode unconditionally returns allowed: true, which is load-bearing for the 403 guard on guarded config saves. (2) The isPipelineProvider export from @signet/core — I cannot confirm its accepted value set matches exactly the union in PipelineExtractionConfig.provider, which matters for isValidTransitionEntry soundness.

… in rollback, add synthesis rollback test - readProviderTransitions now logs non-ENOENT I/O errors instead of silently returning []; ENOENT is still treated as missing file - applyProviderRollback no longer creates empty extraction/synthesis sub-blocks when the config uses only flat keys (extractionProvider) - Removed unused ensureRecord helper (no callers after refactor) - Added synthesis-role rollback end-to-end test - Added regression test for empty sub-block avoidance - Audit concurrent write race acknowledged as pre-existing limitation

PR-Reviewer-Ant

Automated review by pr-reviewer | model: claude-sonnet-4-6 | commit: e0db5b33

Automated bot review (pr-reviewer) of PR #497: provider cost guardrails — transition auditing, remote-provider lock, rollback. After 10 rounds of review most structural concerns are resolved; three discussion-worthy findings remain.

Confidence: Medium [sufficient_diff_evidence, missing_cross_module_context] - The core logic (single-flight, atomic writes, lock preservation, validation ordering) is directly readable from the diff and is sound. Missing context: (1) whether any existing dashboard/SDK client reads and acts on the new commentsStripped or isRetry response fields; (2) how downstream consumers of PipelineSynthesisConfig behave when synthesis is pinned to 'ollama' vs. following extraction implicitly — that determination requires reading callers outside the diff.

…Retry, clean rollback response - Remove else branch in applyProviderRollback that created a synthesis: block when one did not exist in the config, preventing silent provider pinning after rollback - Remove commentsStripped from rollback response (conflicted with config-save semantics); isRetry via ...result spread is the canonical signal for retry detection - Remove unused isRetry local variable in rollback handler - Document isRetry field in API.md rollback response example - Remove redundant ?? 'ollama' on resolveExtractionFallbackProvider (default already covers it) - Add test: synthesis rollback with no existing synthesis block

PR-Reviewer-Ant

Automated review by pr-reviewer | model: claude-sonnet-4-6 | commit: 689c8ce5

[Automated Review — pr-reviewer bot] PR #497 adds provider cost guardrails (audit, remote-provider lock, rollback). 11 prior review rounds addressed most issues. Two warning-level findings remain: a silent no-op rollback for synthesis when the current config lacks a synthesis block, and preserveLockInYaml fabricating a pipelineV2 section from nothing when the incoming config omits it. No blocking security issues found.

Confidence: Medium [sufficient_diff_evidence, missing_cross_module_context] - The synthesis no-op and preserveLockInYaml concerns are directly visible in the provider-safety.ts diff. Confidence is medium because checkPermission and authConfig mode semantics are referenced but their implementations are not in this diff — the permission-gating correctness relies on them behaving as documented.

… pipelineV2 - executeProviderRollback now detects when applyProviderRollback produces no content change (e.g. synthesis rollback on config with no synthesis block) and throws RollbackError(400) instead of silently consuming the audit entry - preserveLockInYaml returns content unchanged when incoming config has no memory or pipelineV2 section, preventing creation of a pipelineV2 block from nothing on partial config updates - Added tests for no-op rollback error and preserveLockInYaml guard - Updated ping-pong test to use realistic config state

PR-Reviewer-Ant

Automated review by pr-reviewer | model: claude-sonnet-4-6 | commit: 70edf45b

[Automated Review — pr-reviewer] Large, well-structured PR across 12 review rounds. One unhandled-exception path in applyProviderRollback surfaces as a misleading 500 to callers; two lower-priority nits around isRetry semantics and a docs/code inconsistency. No security or data-corruption blockers found.

Confidence: High [sufficient_diff_evidence, targeted_context_included] - The applyProviderRollback throw path is directly visible in the added function (provider-safety.ts line ~320). The executeProviderRollback caller and the route-level catch block are both in the diff, making the 500-vs-400 divergence traceable end-to-end without runtime reproduction. The isRetry issue is similarly provable from the return expression alone.

Unmapped findings (not on changed lines):

docs/API.md:401 - The rollback API docs state "Older entries are silently dropped" on truncation, but appendProviderTransitions emits a logger.warn when truncation occurs. Minor doc/code inconsistency; the implementation behaviour (logged warning) is preferable — the docs should be updated to match.

…try docs - applyProviderRollback now throws RollbackError(400) instead of plain Error for missing previous provider and missing pipelineV2 section, preventing misleading 500 responses - API.md: isRetry docs expanded to cover both retry-after-audit-failure and manual-provider-restore cases - API.md: audit truncation wording corrected from 'silently dropped' to 'dropped with a logged warning'

PR-Reviewer-Ant

Automated review by pr-reviewer | model: claude-sonnet-4-6 | commit: 13318ab8

[Automated Review — pr-reviewer bot] After 13 revision rounds this PR is well-tested and addresses its stated goals. A few residual concerns are worth human attention before merge: one logic gap in the fallback rollback path, one ordering subtlety in the single-flight finally block, and a potentially confusing dual allowRemoteProviders field.

Confidence: Medium [sufficient_diff_evidence, missing_runtime_repro] - All changed files are present in the diff. Single-flight lock ordering and YAML mutation logic are statically verifiable. The fallback rollback path behavior (resolveRollbackFilePath → executeProviderRollback with non-standard source) requires runtime tracing to fully confirm; the yaml library's treatment of undefined values during stringify is assumed correct but not runtime-verified.

…e-flight ordering - resolveRollbackFilePath now throws RollbackError(404) when a transition source doesn't match any known config file, instead of silently falling through to a potentially wrong fallback file - Added null-before-resolve ordering comment to the single-flight rollback finally block, explaining why _rollbackSignal must be cleared before resolve?.() to prevent microtask deadlock

PR-Reviewer-Ant

Automated review by pr-reviewer | model: claude-sonnet-4-6 | commit: fe37dfb2

[Automated Review — pr-reviewer bot] PR #497 adds provider cost guardrails (transition auditing, remote-provider lock, rollback). Found one blocking correctness bug in the audit-write-failure retry path and a related documentation inaccuracy in the inline comment. All other concerns from prior rounds appear resolved.

Confidence: High [sufficient_diff_evidence, targeted_context_included] - The retry-path bug is directly provable from the diff: executeProviderRollback in provider-safety.ts calls atomicWrite(filePath, nextContent, ".rollback-") (config written), then the audit write may fail. On retry the file now contains the output of a prior stringify() call — normalized YAML, no comments. applyProviderRollback on that content calls parse then mutates (setting the same provider value and undefined-assigning already-absent fields), then stringify. For normalized YAML input the yaml package round-trip is idempotent: stringify(parse(stringify(obj))) === stringify(obj). Therefore nextContent === beforeContent evaluates to true, the guard at line ~280 fires, and RollbackError(400) is thrown before any write. The audit entry is never consumed. The behavior is directly observable from the diff without runtime reproduction.

When executeProviderRollback writes the config but the subsequent audit write fails, the next retry now detects the idempotent YAML state (stringify round-trip produces identical content) and marks the audit entry consumed audit-only instead of throwing RollbackError(400). Moved markRolledBack above the no-op guard for correct scoping. Updated comments to reflect new retry semantics. Added regression test for retry-audit-consume path. Co-authored-by: Aaf2tbz <averyfelts@aol.com>

All findings addressed through 15 review rounds.

PR-Reviewer-Ant

Automated review by pr-reviewer | model: gpt-5.3-codex | commit: 3b7d26a4

[Automated Review by pr-reviewer] Found concurrency and rollback-idempotency issues that can violate the PR’s stated rollback safety guarantees.

Confidence: Medium [sufficient_diff_evidence, targeted_context_included, missing_runtime_repro] - The rollback single-flight path uses await acquire() before setting _rollbackSignal, which introduces a check-then-set race under concurrent requests in the same event loop tick. The no-op rollback guard also relies on raw string equality (nextContent === beforeContent) after YAML parse/stringify, which can produce false negatives when formatting/comments change even if provider state is unchanged.

PR-Reviewer-Ant · 2026-04-14T03:11:28Z

+			snapshot: stripInternal(snapshot),
+			snapshotError,
+			transitions: transitions.map(stripActor),
+			latestRiskyTransition: latestRiskyTransition ? stripActor(latestRiskyTransition) : null,


Single-flight rollback serialization is still racy. await acquire() returns immediately when _rollbackSignal is null, but because this is awaited, two concurrent handlers can both pass the check before either sets _rollbackSignal, then both enter rollback. This can reintroduce concurrent rollback corruption. Use an atomic check+set pattern (no async gap between check and lock acquisition) or a proper mutex queue.

1. Replace async acquire-then-set single-flight with synchronous boolean flag. The old await acquire() had an async gap between null-check and signal assignment that allowed concurrent rollbacks under event-loop interleaving. Concurrent requests now get 409. 2. Replace raw string equality no-op guard (nextContent === beforeContent) with semantic provider-state comparison. YAML parse/stringify can change formatting/comments even when provider state is identical, causing false negatives on the no-op path. Now extracts current provider for the target role and compares against entry.from. Co-authored-by: Aaf2tbz <averyfelts@aol.com>

PR-Reviewer-Ant

Automated review by pr-reviewer | model: gpt-5.3-codex | commit: 2c29a2bd

[Automated Review by pr-reviewer] Found a rollback correctness regression that can leave stale provider fields and diverges from the PR’s documented behavior.

Confidence: High [sufficient_diff_evidence, targeted_context_included] - The new no-op branch in executeProviderRollback short-circuits before applyProviderRollback runs, so stale model/endpoint fields are never cleared when provider already equals entry.from. This is directly visible in the new provider-safety.ts control flow and conflicts with the stated rollback semantics in docs/API.md.

R16's semantic no-op guard compared provider state and returned early before applyProviderRollback ran, leaving stale model/endpoint/base_url fields uncleared. Now always runs applyProviderRollback first (clears stale fields), then compares parsed JSON objects to detect semantic no-op. Preserves field clearing while avoiding unnecessary config writes on retry. Co-authored-by: Aaf2tbz <averyfelts@aol.com>

PR-Reviewer-Ant

Automated review by pr-reviewer | model: gpt-5.3-codex | commit: 0d58ae7a

[Automated Review by pr-reviewer] Found a rollback correctness regression that diverges from the PR’s stated behavior and can consume audit entries incorrectly.

Confidence: High [sufficient_diff_evidence, targeted_context_included] - In executeProviderRollback, the new semantic no-op branch marks entries rolledBack: true and returns success for any no-op, which includes the documented case of synthesis rollback when no synthesis block exists. This directly conflicts with the PR description/docs claiming a 400 with no audit consumption for that case.

R17 consumed audit entries for all semantic no-ops, including synthesis rollback on configs with no synthesis block. Now validates inside the no-op branch: if no provider exists for the target role, throws 400 without consuming the audit entry. Legitimate retries (provider already matches entry.from) still consume the entry. Co-authored-by: Aaf2tbz <averyfelts@aol.com>

PR-Reviewer-Ant

Automated review by pr-reviewer | model: gpt-5.3-codex | commit: 6450e253

[Automated Review by pr-reviewer] Found one blocking correctness issue in rollback source-file resolution that can make valid transitions non-revertible.

Confidence: High [sufficient_diff_evidence, targeted_context_included] - The new rollback path resolves the target config file by case-sensitive suffix matching against fixed names, while config save accepts case variants and records that original filename in transition source. This mismatch is directly visible in the new resolve logic and can deterministically return 404 for existing files.

resolveRollbackFilePath used case-sensitive endsWith() to match transition source against CONFIG_FILE_CANDIDATES. Config saves can record case variants (e.g. Agent.yaml) that would fail the match, causing 404 on valid transitions. Now normalizes both sides to lowercase before comparison. Co-authored-by: Aaf2tbz <averyfelts@aol.com>

PR-Reviewer-Ant

Automated review by pr-reviewer | model: gpt-5.3-codex | commit: a1f66df3

[Automated Review by pr-reviewer] Found one blocking rollback correctness bug that can reject valid retries/manual-restore states for extraction configs using nested provider fields.

Confidence: High [sufficient_diff_evidence, targeted_context_included] - In executeProviderRollback() the no-op branch computes extraction currentProvider using topLevelProvider ?? blockProvider, but topLevelProvider is forced to a string (String(... ?? "")) and is therefore never null. This masks nested extraction.provider and can throw 400 even when the nested provider already matches rollback target. The logic is directly visible in the changed provider-safety.ts code and conflicts with the PR’s retry/manual-restore handling claims.

Replaced String(x ?? '') coercion with typeof guards so topLevelProvider and blockProvider are null when absent instead of empty string. This ensures ?? correctly falls through to nested provider for extraction configs. While the bug was unreachable (applyProviderRollback always sets extractionProvider for extraction role, making isNoOp false for nested-only configs), the null semantics now match the canonical readProviderSafetySnapshot resolution pattern (flatExtraction ?? nestedExtraction). Added regression test for nested-only extraction rollback verifying the extractionProvider flat key is added. Co-authored-by: Aaf2tbz <averyfelts@aol.com>

PR-Reviewer-Ant

Automated review by pr-reviewer | model: gpt-5.3-codex | commit: a8090b3b

[Automated Review by pr-reviewer] I found a blocking rollback-state logic issue that can make subsequent rollbacks fail incorrectly.

Confidence: High [sufficient_diff_evidence, targeted_context_included] - In executeProviderRollback, rollback-generated transitions are appended as normal unconsumed entries (providerTransitions) with a non-null from, and future selection logic treats any unrolled entry with from as rollback-eligible. Those entries have source api/config/provider-safety/rollback, which resolveRollbackFilePath rejects with 404 because it does not map to a config filename. This creates a reproducible stuck state after a successful rollback.

…lection Rollback-generated audit entries (source: api/config/provider-safety/rollback) were selectable as rollback targets. resolveRollbackFilePath can't resolve that source to a config file, causing 404 stuck state on subsequent rollbacks. Now excludes rollback-sourced entries from selection. Added regression test. Co-authored-by: Aaf2tbz <averyfelts@aol.com>

PR-Reviewer-Ant

Automated review by pr-reviewer | model: gpt-5.3-codex | commit: b9340a8f

[Automated Review by pr-reviewer] I found one blocking correctness issue in rollback source-file resolution that can strand valid transitions and make rollback fail on case-sensitive filesystems.

Confidence: High [sufficient_diff_evidence, targeted_context_included] - The bug is directly visible in the new resolution logic: transitions from guarded saves can record mixed-case filenames (e.g., Agent.yaml), but resolveRollbackFilePath canonicalizes to a candidate name (agent.yaml) before checking existsSync. On Linux/case-sensitive FS this yields a false 404 even though the recorded source file exists.

Case-insensitive match resolved to candidate casing (e.g. agent.yaml) instead of the actual recorded filename (e.g. Agent.yaml), causing false 404 on case-sensitive filesystems. Now extracts the actual filename from match.source preserving original casing. Co-authored-by: Aaf2tbz <averyfelts@aol.com>

PR-Reviewer-Ant

Automated review by pr-reviewer | model: gpt-5.3-codex | commit: d22149c3

[Automated Review by pr-reviewer] Found one blocking correctness issue: rollback target selection still includes rollback-generated audit entries in the route path, which can cause false 404s and prevent valid subsequent rollbacks.

Confidence: High [sufficient_diff_evidence, targeted_context_included] - The diff shows executeProviderRollback explicitly excludes source === "api/config/provider-safety/rollback", but resolveRollbackFilePath does not. The route calls resolveRollbackFilePath first and will fail early with 404 on rollback-sourced entries before executeProviderRollback can apply its skip logic. This is directly provable from the changed code paths.

…ath too R21 only excluded rollback-sourced entries in executeProviderRollback, but the route calls resolveRollbackFilePath first which had its own independent selection. Extracted shared isRollbackEligible() predicate used by both functions to prevent drift. Co-authored-by: Aaf2tbz <averyfelts@aol.com>

PR-Reviewer-Ant

Automated review by pr-reviewer | model: gpt-5.3-codex | commit: 2384e48f

[Automated Review by pr-reviewer] I found one correctness issue in the new rollback endpoint request handling; otherwise the implementation largely matches the PR description and adds the claimed provider-safety controls.

Confidence: Medium [sufficient_diff_evidence, targeted_context_included, missing_runtime_repro] - The rollback route now accepts malformed JSON by swallowing parse errors and defaulting to an empty body, which can trigger an unintended rollback despite an invalid client payload. This is directly visible in the changed rollback handler logic, but I did not run the endpoint to reproduce behavior end-to-end.

Rollback endpoint swallowed JSON parse errors as empty object, allowing malformed requests to execute rollback with default role selection. Now returns 400 for invalid JSON on this state-mutating admin endpoint. Co-authored-by: Aaf2tbz <averyfelts@aol.com>

NicholaiVogel · 2026-04-14T08:30:40Z

avery, thank you for putting this together. there’s a lot of thoughtful work in here, and i can tell you really chased down the edge cases. i do think it’s a useful implementation, and i’m glad you explored it.

i think my hesitation is mostly about timing and overlap, not that i think this is bad work.

with #462 already in flight, this feels like it starts solving some of the same area from a different angle, and i worry merging this now would just add more moving parts right before we’re trying to simplify that whole provider/config path. i think it would probably slow #462 down even more, and then we’d likely end up reshaping parts of this again afterward anyway.

so my preference would be to leave this open for now, get #462 merged first, and then come back to this idea once that foundation is in place. i still think there’s useful stuff here, and i’d expect we can absolutely pull pieces from this PR when we revisit it.

so this is more “not yet” than “no”. i appreciate the work here a lot.

aaf2tbz marked this pull request as ready for review April 12, 2026 21:17