From 28aac0dc38afa5950acdd4b66cb3f7dd04e0c02a Mon Sep 17 00:00:00 2001 From: user <2662304+ktsaou@users.noreply.github.com> Date: Mon, 8 Jun 2026 00:14:20 +0300 Subject: [PATCH 1/5] Complete SOW-0055 ingest write-model cleanup Reduce ingest write-model complexity with behavior-preserving helper extraction and characterization coverage. Move SOW-0055 to completed and track residual writer file-size debt in SOW-0060. --- ...-ingest-write-model-residual-complexity.md | 286 +++++++++++ ...-ingest-write-model-residual-complexity.md | 150 ------ ...-0060-20260607-ingest-writer-file-split.md | 166 +++++++ internal/ingest/catalog_migrate.go | 235 ++++++--- internal/ingest/resolver.go | 62 ++- internal/ingest/rollup_backfill.go | 65 ++- internal/ingest/rollup_refresh.go | 126 +++-- .../sow55_characterization_catalog_test.go | 115 +++++ .../sow55_characterization_helpers_test.go | 218 +++++++++ .../sow55_characterization_resolver_test.go | 130 +++++ .../sow55_characterization_rollup_test.go | 124 +++++ .../sow55_characterization_writer_test.go | 153 ++++++ internal/ingest/writer.go | 452 +++++++++++++----- 13 files changed, 1866 insertions(+), 416 deletions(-) create mode 100644 .agents/sow/done/SOW-0055-20260607-ingest-write-model-residual-complexity.md delete mode 100644 .agents/sow/pending/SOW-0055-20260607-ingest-write-model-residual-complexity.md create mode 100644 .agents/sow/pending/SOW-0060-20260607-ingest-writer-file-split.md create mode 100644 internal/ingest/sow55_characterization_catalog_test.go create mode 100644 internal/ingest/sow55_characterization_helpers_test.go create mode 100644 internal/ingest/sow55_characterization_resolver_test.go create mode 100644 internal/ingest/sow55_characterization_rollup_test.go create mode 100644 internal/ingest/sow55_characterization_writer_test.go diff --git a/.agents/sow/done/SOW-0055-20260607-ingest-write-model-residual-complexity.md b/.agents/sow/done/SOW-0055-20260607-ingest-write-model-residual-complexity.md new file mode 100644 index 0000000..7aed2e4 --- /dev/null +++ b/.agents/sow/done/SOW-0055-20260607-ingest-write-model-residual-complexity.md @@ -0,0 +1,286 @@ +# SOW-0055 - Ingest Write-Model Residual Complexity Reduction + +## Status + +Status: completed + +Sub-state: local gates green; external review converged; ready to move to +`done/`. + +## Requirements + +### Purpose + +Reduce or justify remaining ingest writer, catalog migration, rollup, backfill, +and resolver complexity while preserving persisted canonical rows, rollups, +catalog totals, FTS state, and resolver behavior. + +### User Request + +Continue backend maintainability cleanup autonomously without weakening data +integrity, tests, benchmarks, or security posture. + +### Assistant Understanding + +Facts: + +- SOW-0050 removed the worker/notify warnings but left 9 declared backend-scope + ingest warnings in writer, catalog migration, rollup refresh/backfill, and + resolver paths. +- These paths own canonical row application, rollup materialization, + catalog-total migration, and orphan parent-link repair. +- SOW-0050 added characterization coverage for worker shutdown and idle-refresh + transaction boundaries that this SOW must preserve. + +Inferences: + +- The writer/catalog/rollup warnings are higher data-integrity risk than + store/pricing/CLI helpers and should stay in a dedicated SOW. + +Unknowns: + +- Some anonymous warning locations in `writer.go` may be dense inline SQL/helper + closures; each needs file/line review before deciding to refactor or justify. + +### Acceptance Criteria + +- Remaining ingest warnings are ranked by data-integrity and performance risk. +- Every selected refactor has characterization tests before implementation. +- Rollup, catalog, source-progress, FTS, resolver, and idempotency behavior + remain unchanged. +- Hot-path changes include benchmark evidence. +- Remaining warnings are justified or split further. +- Full gates and external review converge before completion. + +## Analysis + +Sources checked: + +- SOW-0050 declared backend/CLI strict Lizard scan after the worker/notify slice. + +Current state: + +- Worker orchestration is decomposed. Remaining ingest complexity is inside the + write model and repair/rollup helpers. + +Risks: + +- Ingest regressions can corrupt persisted rows, undercount rollups, drift + catalog totals, break FTS search, or lose parent/child lineage. + +## Pre-Implementation Gate + +Status: completed for implementation. + +Problem / root-cause model: + +- The ingest write model contains several semantically dense functions that + encode transaction-local state, carry-forward rollup behavior, and migration + math. Splitting must preserve exact data contracts. + +Evidence reviewed: + +- SOW-0050 warning inventory: + `internal/ingest/catalog_migrate.go`, `rollup_backfill.go`, + `rollup_refresh.go`, `writer.go`, and `resolver.go`. +- Current strict warning inventory: + `lizard -l go -C 8 -L 50 -a 8 -w internal/ingest/catalog_migrate.go internal/ingest/rollup_backfill.go internal/ingest/rollup_refresh.go internal/ingest/writer.go internal/ingest/resolver.go` + reports exactly 9 warnings: + - `internal/ingest/catalog_migrate.go:140` `removeOpContribution`. + - `internal/ingest/catalog_migrate.go:208` `addMigratedTotals`. + - `internal/ingest/rollup_backfill.go:38` `BackfillRollups`. + - `internal/ingest/rollup_refresh.go:51` `refreshRollups`. + - `internal/ingest/writer.go:420` `apply`. + - `internal/ingest/writer.go:639` `markSessionRollupBucketsDirty`. + - `internal/ingest/writer.go:754` `applyOpStarted`. + - `internal/ingest/writer.go:1290` `applyLogEntry`. + - `internal/ingest/resolver.go:90` `linkOrphans`. +- The `writer.go` warnings previously described as anonymous are Lizard display + labels for named receiver methods, not true anonymous functions. +- Focused coverage command: + `go test -coverprofile=/tmp/cov.out ./internal/ingest` plus + `go tool cover -func=/tmp/cov.out`. + Current relevant coverage includes `writer.apply` 100%, + `writer.applyLogEntry` 92.6%, `writer.applyOpStarted` 85.3%, + `catalog_migrate.removeOpContribution` 92.3%, + `rollup_refresh.refreshRollups` 92.3%, and + `rollup_backfill.BackfillRollups` 80.0%. + +Affected contracts and surfaces: + +- SQLite canonical tables, rollup tables, catalog totals, source progress, FTS, + resolver updates, health visibility for ingest errors, and benchmarked ingest + write paths. + +Existing patterns to reuse: + +- Existing writer, catalog, rollup, resolver, and benchmark tests. +- SOW-0047/SOW-0050 helper extraction style with transaction order preserved. +- SOW-0054 helper extraction style: package-local helpers, public contracts + unchanged, SQL ordering unchanged, and tests covering extracted behavior. + +Risk and blast radius: + +- High within ingest data integrity; no REST/frontend/source-adapter behavior + change is expected. + +Sensitive data handling plan: + +- Use synthetic events and committed sanitized fixtures only. Do not add raw + prompts, tool output, source IDs, private paths, secrets, personal data, or + private endpoints. + +Spec deltas to land before tests/code: + +- No runtime contract change is targeted. The implementation must preserve the + current `ingester.md`, `data-model.md`, and `sse-protocol.md` contracts. +- `.agents/sow/specs/ingester.md`: unchanged. It already records the batch + write order, rollup refresh order, FTS refresh behavior, source-progress + persistence, notify atomicity, post-commit promotion, resolver pass, and + shutdown-drain contracts this SOW must preserve. +- `.agents/sow/specs/data-model.md`: unchanged. It already records catalog, + rollup, log, payload, source-progress, and FTS table semantics. +- `.agents/sow/specs/sse-protocol.md`: unchanged. Resolver linkage and notify + rows must still commit atomically so open subscribers refetch changed sessions. +- Characterization tests added by this SOW are expected to pass before and after + implementation because the target behavior already exists; they are regression + pins for behavior-preserving decomposition, not new-feature tests. + +Implementation plan: + +1. Add characterization tests before production refactors: + - catalog migration: cascading identity correction and explicit cost movement; + - writer op-start: nil-extras re-emit preserving only resolver stash; + - log-entry FTS: duplicate replay does not create duplicate `fts_logs`; + - rollup refresh: carried hour/day buckets remain independent across + multiple refreshes; + - resolver: linkage failure rolls back without notify rows. +2. Refactor catalog migration helpers so remove/add migration paths share small + table-specific helper functions while preserving exact SQL column sets and + key predicates. +3. Refactor `BackfillRollups` and `refreshRollups` with orchestration/logging + and shared dirty-bucket materialization helpers. Keep the open-bucket and + carried-set rules unchanged. +4. Refactor writer helpers: + - replace the event-kind dispatch switch with a package-local dispatch table + or equivalent small helpers that preserve unknown-kind errors and + `batchMaxSeq` updates; + - split `markSessionRollupBucketsDirty` cursor draining from in-memory + marking while preserving single-connection cursor-drain discipline; + - split `applyOpStarted` into preparation, SQL upsert, and post-upsert + marking/catalog calls without changing the SQL semantics; + - split `applyLogEntry` reference resolution and FTS indexing while + preserving `INSERT ... RETURNING id` conflict behavior. +5. Refactor `linkOrphans` with a transaction runner and link-step sequence only + if strict complexity remains. Preserve the single transaction containing all + linkage updates and notify rows. +6. Re-run strict Lizard on all target files. Remaining warnings are acceptable + only if explicitly justified in this SOW and not reported by local Codacy on + changed files. +7. Validate focused tests, race tests, local Codacy, benchmark gate, full gates, + and external review. + +Validation plan: + +- Pre-code characterization: + `go test ./internal/ingest -run 'TestCatalog_|TestWriter_|TestRefreshRollups_|TestResolver_' -count=1`. +- Focused package: + `go test ./internal/ingest -count=1`. +- Focused race: + `go test -race ./internal/ingest -count=1`. +- Strict target complexity: + `lizard -l go -C 8 -L 50 -a 8 -w internal/ingest/catalog_migrate.go internal/ingest/rollup_backfill.go internal/ingest/rollup_refresh.go internal/ingest/writer.go internal/ingest/resolver.go`. +- Local Codacy on changed files. +- Benchmark gate because writer/rollup hot paths are touched: + `scripts/check-bench.sh`. +- Full local aggregate: + `./scripts/gates.sh`. +- External second-opinion review on the final staged state until convergence. + +Artifact impact plan: + +- Specs: `ingester.md`, `data-model.md`, `pricing.md`, and related specs only + if behavior contracts change; otherwise record unchanged attestations. +- Runtime project skills: likely unaffected unless a new ingest decomposition + pattern emerges. +- End-user docs: likely unaffected. +- SOW lifecycle: move to `current/` when activated. + +Open-source reference evidence: + +- This is internal data-model maintainability work. External references are + required only if a selected slice changes a source-format or protocol claim. + +Open decisions: + +- None for the operator. + +## Outcome + +Completed. + +Implemented changes: + +- Added 9 characterization tests covering catalog migration cascades, LLM cost + migration, op-start re-emits, log-entry FTS idempotency and gating, rollup + carried-set behavior, and resolver transaction atomicity. +- Split the characterization tests across small SOW-local files so the tests do + not add new strict complexity debt. +- Refactored the 9 targeted ingest write-model warnings: + - catalog migration totals now use table-specific helpers with unchanged SQL + columns, predicates, and bind order. + - rollup backfill and refresh orchestration now use smaller pass/helper + functions while preserving open-bucket carry-forward behavior. + - writer event dispatch, session rollup marking, op-start writes, and + log-entry/FTS writes now use smaller helpers with the same state updates and + error paths. + - resolver orphan linkage now runs through a transaction helper and ordered + step list while preserving the single transaction containing linkage and + notify rows. + +Spec outcome: + +- No runtime spec changed. The implementation preserved the existing + `ingester.md`, `data-model.md`, and `sse-protocol.md` contracts listed in the + Pre-Implementation Gate. + +Validation evidence: + +- `gofmt -l` on touched Go files: clean. +- `git diff --check`: clean. +- Strict target complexity: + `lizard -l go -C 8 -L 50 -a 8 -w internal/ingest/catalog_migrate.go internal/ingest/rollup_backfill.go internal/ingest/rollup_refresh.go internal/ingest/writer.go internal/ingest/resolver.go` + produced no warnings. +- Strict complexity on the SOW-0055 characterization tests produced no warnings. +- `go test ./internal/ingest -run 'TestCatalog_|TestWriter_|TestRefreshRollups_|TestResolver_' -count=1`: pass. +- `go test ./internal/ingest -count=1`: pass. +- `go test -race ./internal/ingest -count=1 -timeout=10m`: pass. +- `golangci-lint run --timeout=5m ./internal/ingest/...`: pass. +- Local Codacy on changed files: Semgrep clean, Trivy clean, and all function + complexity findings cleared. One Lizard file-length finding remains for + `internal/ingest/writer.go` at 1048 LOC; this is pre-existing residual debt + because `HEAD:internal/ingest/writer.go` had 1529 physical lines before this + SOW. It is tracked separately in SOW-0060. +- `./scripts/gates.sh`: pass in 809s. Covered lint, standalone security tools, + secrets scan, attribution scan, spec drift, build, bundle-size gate, + benchmark regression gate, Go race/coverage tests, adapter fuzz seed corpus, + frontend unit tests, Playwright E2E, and accessibility checks. + +Reviews: + +- External reviewer 1: no actionable correctness, security, race, or behavioral + findings. Noted only non-blocking residual risks: future resolver-step + allocation and the pre-existing writer file-length debt. +- External reviewer 2: no actionable findings. Noted that the SOW closeout + still needed final evidence; this section addresses that. +- One approved reviewer attempt failed before producing a review because the + provider quota was exhausted. It was not counted. +- Replacement external reviewer 3: no actionable correctness, security, race, + performance, test-coverage, or separation-of-concerns findings. Confirmed the + only remaining gap was this SOW outcome update. + +Residual debt: + +- `internal/ingest/writer.go` is still too large for Codacy/Lizard's file-NLOC + threshold even after this SOW reduced it from its previous size and cleared the + targeted function warnings. SOW-0060 tracks a behavior-preserving file split. diff --git a/.agents/sow/pending/SOW-0055-20260607-ingest-write-model-residual-complexity.md b/.agents/sow/pending/SOW-0055-20260607-ingest-write-model-residual-complexity.md deleted file mode 100644 index 8d14be8..0000000 --- a/.agents/sow/pending/SOW-0055-20260607-ingest-write-model-residual-complexity.md +++ /dev/null @@ -1,150 +0,0 @@ -# SOW-0055 - Ingest Write-Model Residual Complexity Reduction - -## Status - -Status: open - -Sub-state: split from SOW-0050 residual backend scan. Not active yet. - -## Requirements - -### Purpose - -Reduce or justify remaining ingest writer, catalog migration, rollup, backfill, -and resolver complexity while preserving persisted canonical rows, rollups, -catalog totals, FTS state, and resolver behavior. - -### User Request - -Continue backend maintainability cleanup autonomously without weakening data -integrity, tests, benchmarks, or security posture. - -### Assistant Understanding - -Facts: - -- SOW-0050 removed the worker/notify warnings but left 9 declared backend-scope - ingest warnings in writer, catalog migration, rollup refresh/backfill, and - resolver paths. -- These paths own canonical row application, rollup materialization, - catalog-total migration, and orphan parent-link repair. -- SOW-0050 added characterization coverage for worker shutdown and idle-refresh - transaction boundaries that this SOW must preserve. - -Inferences: - -- The writer/catalog/rollup warnings are higher data-integrity risk than - store/pricing/CLI helpers and should stay in a dedicated SOW. - -Unknowns: - -- Some anonymous warning locations in `writer.go` may be dense inline SQL/helper - closures; each needs file/line review before deciding to refactor or justify. - -### Acceptance Criteria - -- Remaining ingest warnings are ranked by data-integrity and performance risk. -- Every selected refactor has characterization tests before implementation. -- Rollup, catalog, source-progress, FTS, resolver, and idempotency behavior - remain unchanged. -- Hot-path changes include benchmark evidence. -- Remaining warnings are justified or split further. -- Full gates and external review converge before completion. - -## Analysis - -Sources checked: - -- SOW-0050 declared backend/CLI strict Lizard scan after the worker/notify slice. - -Current state: - -- Worker orchestration is decomposed. Remaining ingest complexity is inside the - write model and repair/rollup helpers. - -Risks: - -- Ingest regressions can corrupt persisted rows, undercount rollups, drift - catalog totals, break FTS search, or lose parent/child lineage. - -## Pre-Implementation Gate - -Status: ready for future activation. - -Problem / root-cause model: - -- The ingest write model contains several semantically dense functions that - encode transaction-local state, carry-forward rollup behavior, and migration - math. Splitting must preserve exact data contracts. - -Evidence reviewed: - -- SOW-0050 warning inventory: - `internal/ingest/catalog_migrate.go`, `rollup_backfill.go`, - `rollup_refresh.go`, `writer.go`, and `resolver.go`. - -Affected contracts and surfaces: - -- SQLite canonical tables, rollup tables, catalog totals, source progress, FTS, - resolver updates, health visibility for ingest errors, and benchmarked ingest - write paths. - -Existing patterns to reuse: - -- Existing writer, catalog, rollup, resolver, and benchmark tests. -- SOW-0047/SOW-0050 helper extraction style with transaction order preserved. - -Risk and blast radius: - -- High within ingest data integrity; no REST/frontend/source-adapter behavior - change is expected. - -Sensitive data handling plan: - -- Use synthetic events and committed sanitized fixtures only. Do not add raw - prompts, tool output, source IDs, private paths, secrets, personal data, or - private endpoints. - -Implementation plan: - -1. Audit each residual ingest warning and current coverage. -2. Add characterization tests before touching any transaction/carry-forward - boundary. -3. Refactor one package-local slice at a time. -4. Keep SQL ordering, transaction boundaries, post-commit promotions, and error - surfacing unchanged. -5. Validate focused tests, race tests, strict Lizard, local Codacy, benchmark - gate when hot paths change, full gates, and external review. - -Validation plan: - -- Focused ingest tests selected after coverage audit. -- `go test ./internal/ingest -count=1` -- `go test -race ./internal/ingest -count=1` -- `scripts/check-bench.sh` for writer/rollup/backfill hot-path changes. -- Direct strict Lizard on changed files. -- Local Codacy analysis on changed files. -- Full `./scripts/gates.sh`. -- External second-opinion review until convergence. - -Artifact impact plan: - -- Specs: `ingester.md`, `data-model.md`, `pricing.md`, and related specs only - if behavior contracts change; otherwise record unchanged attestations. -- Runtime project skills: likely unaffected unless a new ingest decomposition - pattern emerges. -- End-user docs: likely unaffected. -- SOW lifecycle: move to `current/` when activated. - -Open-source reference evidence: - -- This is internal data-model maintainability work. External references are - required only if a selected slice changes a source-format or protocol claim. - -Open decisions: - -- None for the operator. - -## Outcome - -Pending. diff --git a/.agents/sow/pending/SOW-0060-20260607-ingest-writer-file-split.md b/.agents/sow/pending/SOW-0060-20260607-ingest-writer-file-split.md new file mode 100644 index 0000000..eca2a82 --- /dev/null +++ b/.agents/sow/pending/SOW-0060-20260607-ingest-writer-file-split.md @@ -0,0 +1,166 @@ +# SOW-0060 - Ingest Writer File Split + +## Status + +Status: open + +Sub-state: split from SOW-0055 local Codacy closeout. Not active yet. + +## Requirements + +### Purpose + +Split the oversized ingest writer implementation into small package-local files +while preserving every persisted canonical row, catalog update, rollup dirty +mark, FTS write, source-progress update, pricing-miss side effect, and notify +contract. + +### User Request + +Continue backend maintainability cleanup autonomously without weakening data +integrity, tests, benchmarks, security posture, or the read-only source +contract. + +### Assistant Understanding + +Facts: + +- SOW-0055 cleared the targeted strict function-complexity warnings in + `internal/ingest/writer.go` and related ingest files. +- Local Codacy still reports one Lizard file-length finding for + `internal/ingest/writer.go`: 1048 lines of code with a threshold of 500. +- The file-length issue is pre-existing residual debt. Before SOW-0055, + `HEAD:internal/ingest/writer.go` had 1529 physical lines. +- The writer owns transaction-local application of canonical events and is a + high data-integrity surface. + +Inferences: + +- This should be a file-boundary refactor, not a behavior refactor. The safest + target shape is multiple package-local files grouped by writer responsibility: + dispatch/session/turn/op/log/source-progress/pricing helpers. + +Unknowns: + +- The final file split boundaries must be chosen after reading the current + writer helper graph to avoid circular helper movement or less readable + locality. + +### Acceptance Criteria + +- `internal/ingest/writer.go` and every new writer-related file stay below the + Codacy/Lizard file-length threshold, unless a remaining exception is + explicitly justified here. +- No runtime behavior changes. +- Existing SOW-0055 characterization tests remain unchanged or are only renamed + mechanically. +- Strict Lizard and local Codacy on changed writer files report no new function + or file-size findings. +- Full gates and external review converge before completion. + +## Analysis + +Sources checked: + +- SOW-0055 local Codacy output on changed files. +- SOW-0055 strict Lizard output on target production and characterization test + files. + +Current state: + +- Function-level complexity in the target writer paths is cleared. +- File-level size remains above the Codacy/Lizard threshold. + +Risks: + +- Moving helpers can hide transaction ordering mistakes if edits become more + than mechanical. +- Splitting by event kind can separate shared helper state too aggressively and + reduce readability. +- Package-local names can collide if the split is careless. + +## Pre-Implementation Gate + +Status: ready for future activation. + +Problem / root-cause model: + +- `writer.go` accumulated multiple event families and helper groups in one file. + SOW-0055 reduced individual function complexity, but the file still violates + file-size maintainability gates. The next cleanup should move cohesive helper + groups into package-local files without changing method receivers, SQL, or + transaction boundaries. + +Evidence reviewed: + +- Local Codacy reported: + `internal/ingest/writer.go:1 File has 1048 lines of code (threshold: 500)`. +- Baseline evidence captured during SOW-0055 before its commit: + `git show HEAD:internal/ingest/writer.go | wc -l` returned 1529 physical + lines for the pre-SOW-0055 version. + +Affected contracts and surfaces: + +- SQLite canonical tables, writer transaction order, catalog totals, rollup + dirty sets, source progress, FTS, pricing-miss emission, notify rows, and + ingest error surfacing. + +Existing patterns to reuse: + +- SOW-0055 package-local helper extraction style. +- Existing writer tests plus SOW-0055 characterization tests. +- No public API changes; all split files remain in package `ingest`. + +Risk and blast radius: + +- Medium within ingest maintainability. Runtime blast radius should remain low + if the split is mechanical and tests/gates prove behavior preservation. + +Sensitive data handling plan: + +- No fixtures or real source snapshots are required. Do not add raw prompts, + tool output, source IDs, private paths, secrets, personal data, or private + endpoints. + +Spec deltas to land before tests/code: + +- No runtime spec change is expected. If the file split reveals a documented + writer-order mismatch, update the relevant spec before tests/code. + +Implementation plan: + +1. Read the current writer helper graph and choose cohesive file boundaries. +2. Move helpers mechanically, preserving package, receiver methods, comments, + SQL text, and call order. +3. Run `gofmt`, focused ingest tests, strict Lizard, and local Codacy after each + movement slice. +4. Run full gates and external review on the final split. + +Validation plan: + +- `go test ./internal/ingest -count=1` +- `go test -race ./internal/ingest -count=1 -timeout=10m` +- Strict Lizard on all writer-related files. +- Local Codacy on changed writer-related files. +- `./scripts/gates.sh` +- External second-opinion review until convergence. + +Artifact impact plan: + +- Specs: unchanged unless runtime contracts are corrected. +- Runtime skills: update only if a durable writer file-split pattern emerges. +- End-user docs: no change expected. +- SOW lifecycle: move to `current/` when activated. + +Open-source reference evidence: + +- This is internal file-boundary maintainability work. External references are + not required unless a runtime contract changes. + +Open decisions: + +- None for the operator. + +## Outcome + +Pending. diff --git a/internal/ingest/catalog_migrate.go b/internal/ingest/catalog_migrate.go index cf3da9e..7342a5c 100644 --- a/internal/ingest/catalog_migrate.go +++ b/internal/ingest/catalog_migrate.go @@ -137,60 +137,21 @@ func normalizeToolNamespace(ns string) string { // MAX-based, not summed, so it is intentionally NOT decremented — a stale seed on // an emptied row is harmless (no op references it) and re-derives on the next // observation. +// +// The per-kind switch dispatches to small table-specific helpers — one helper +// per (table, direction) — that own the SQL exec for that table only. Each +// helper preserves the exact column set and WHERE predicate of the original +// inline UPDATE; the dispatch carries the per-call sign convention (-1 for +// remove, +1 for add) through the migrateDir constant. func (c *catalogWriter) removeOpContribution(ctx context.Context, tx *sql.Tx, prior priorOpIdentity) error { - t := prior.totals - failure := failureInc(t.status) switch prior.kind { case string(canonical.OpLLM): - if prior.provider != "" { - if _, err := tx.ExecContext(ctx, ` -UPDATE catalog_providers SET - call_count = call_count - 1, - failure_count = failure_count - ?, - total_tokens_in = total_tokens_in - ?, - total_tokens_out = total_tokens_out - ?, - total_tokens_cache_read = total_tokens_cache_read - ?, - total_tokens_cache_write = total_tokens_cache_write - ?, - total_cost_usd = total_cost_usd - ? -WHERE name = ? AND alias = ? -`, failure, t.tokensIn, t.tokensOut, t.tokensCacheRead, t.tokensCacheWrite, t.costUSD, - prior.provider, prior.providerAlias); err != nil { - return fmt.Errorf("catalog_providers migrate-out: %w", err) - } - } - if prior.provider != "" && prior.model != "" { - if _, err := tx.ExecContext(ctx, ` -UPDATE catalog_models SET - call_count = call_count - 1, - failure_count = failure_count - ?, - total_tokens_in = total_tokens_in - ?, - total_tokens_out = total_tokens_out - ?, - total_tokens_cache_read = total_tokens_cache_read - ?, - total_tokens_cache_write = total_tokens_cache_write - ?, - total_cost_usd = total_cost_usd - ?, - total_duration_us = total_duration_us - ? -WHERE provider = ? AND name = ? -`, failure, t.tokensIn, t.tokensOut, t.tokensCacheRead, t.tokensCacheWrite, t.costUSD, t.durationUS, - prior.provider, prior.model); err != nil { - return fmt.Errorf("catalog_models migrate-out: %w", err) - } + if err := migrateProviderTotals(ctx, tx, prior.provider, prior.providerAlias, prior.totals, migrateOut); err != nil { + return err } + return migrateModelTotals(ctx, tx, prior.provider, prior.model, prior.totals, migrateOut) case string(canonical.OpTool): - if prior.name != "" { - if _, err := tx.ExecContext(ctx, ` -UPDATE catalog_tools SET - call_count = call_count - 1, - failure_count = failure_count - ?, - total_tokens_in = total_tokens_in - ?, - total_tokens_out = total_tokens_out - ?, - total_cost_usd = total_cost_usd - ?, - total_duration_us = total_duration_us - ? -WHERE namespace = ? AND name = ? -`, failure, t.tokensIn, t.tokensOut, t.costUSD, t.durationUS, - normalizeToolNamespace(prior.toolNamespace), prior.name); err != nil { - return fmt.Errorf("catalog_tools migrate-out: %w", err) - } - } + return migrateToolTotals(ctx, tx, prior.toolNamespace, prior.name, prior.totals, migrateOut) } return nil } @@ -205,17 +166,62 @@ WHERE namespace = ? AND name = ? // OpFinalized (this UPDATE only moves accumulating totals). The destination key is // the EFFECTIVE post-upsert identity so it always matches the ops row and the key // the call_count upsert booked under (SOW-0004 I1 shared-ingest edge). +// +// The per-kind switch dispatches to the SAME per-table helpers as +// removeOpContribution, passing migrateIn so the sign is +1 and call_count is +// NOT touched (the OpStarted upsert already booked the +1 under the new key). func (c *catalogWriter) addMigratedTotals(ctx context.Context, tx *sql.Tx, eff effectiveCatalogIdentity, t opPriorTotals) error { // t is the prior PERSISTED contribution; the caller only reaches here when the // op row already existed (prior.found), so t.found is always true. An op started // but never finalized has status="running" (failureInc 0) and zero tokens/cost/ // duration, so the adds below move nothing meaningful — only the call_count the // OpStarted upsert already re-booked under the new key matters in that case. - failure := failureInc(t.status) switch eff.kind { case string(canonical.OpLLM): - if eff.provider != "" { - if _, err := tx.ExecContext(ctx, ` + if err := migrateProviderTotals(ctx, tx, eff.provider, eff.providerAlias, t, migrateIn); err != nil { + return err + } + return migrateModelTotals(ctx, tx, eff.provider, eff.model, t, migrateIn) + case string(canonical.OpTool): + return migrateToolTotals(ctx, tx, eff.toolNamespace, eff.name, t, migrateIn) + } + return nil +} + +// migrateDir is the per-direction sign for catalog total migration: -1 drains a +// prior catalog key (removeOpContribution), +1 books totals onto a new key +// (addMigratedTotals). The helpers below pick their SQL template based on it so +// the column sets stay identical across directions while the sign and the +// call_count handling diverge per direction. +type migrateDir int + +const ( + migrateOut migrateDir = -1 + migrateIn migrateDir = 1 +) + +// providerTotalsOutSQL drains an op's prior contribution from catalog_providers +// (the migrateOut path). The column set mirrors onOpFinalized's provider +// branch exactly and matches providerTotalsInSQL except for the sign + the +// call_count update (call_count is only decremented on the way OUT — the +// OpStarted upsert re-books +1 under the new key, so the migrateIn path does +// NOT touch call_count). +const providerTotalsOutSQL = ` +UPDATE catalog_providers SET + call_count = call_count - 1, + failure_count = failure_count - ?, + total_tokens_in = total_tokens_in - ?, + total_tokens_out = total_tokens_out - ?, + total_tokens_cache_read = total_tokens_cache_read - ?, + total_tokens_cache_write = total_tokens_cache_write - ?, + total_cost_usd = total_cost_usd - ? +WHERE name = ? AND alias = ? +` + +// providerTotalsInSQL re-books an op's prior contribution onto a new +// catalog_providers key (the migrateIn path). Same columns as the OUT SQL but +// with addition and NO call_count touch. +const providerTotalsInSQL = ` UPDATE catalog_providers SET failure_count = failure_count + ?, total_tokens_in = total_tokens_in + ?, @@ -224,13 +230,50 @@ UPDATE catalog_providers SET total_tokens_cache_write = total_tokens_cache_write + ?, total_cost_usd = total_cost_usd + ? WHERE name = ? AND alias = ? -`, failure, t.tokensIn, t.tokensOut, t.tokensCacheRead, t.tokensCacheWrite, t.costUSD, - eff.provider, eff.providerAlias); err != nil { - return fmt.Errorf("catalog_providers migrate-in: %w", err) - } - } - if eff.provider != "" && eff.model != "" { - if _, err := tx.ExecContext(ctx, ` +` + +// migrateProviderTotals applies one direction of the catalog_providers +// migration. The empty-provider guard mirrors the original inline code — an +// LLM op with no recorded provider never contributed to catalog_providers, so +// there is nothing to migrate either way. +func migrateProviderTotals(ctx context.Context, tx *sql.Tx, provider, alias string, t opPriorTotals, dir migrateDir) error { + if provider == "" { + return nil + } + stmt := providerTotalsOutSQL + verb := "migrate-out" + if dir == migrateIn { + stmt = providerTotalsInSQL + verb = "migrate-in" + } + failure := failureInc(t.status) + if _, err := tx.ExecContext(ctx, stmt, + failure, t.tokensIn, t.tokensOut, t.tokensCacheRead, t.tokensCacheWrite, t.costUSD, + provider, alias); err != nil { + return fmt.Errorf("catalog_providers %s: %w", verb, err) + } + return nil +} + +// modelTotalsOutSQL drains an op's prior contribution from catalog_models +// (the migrateOut path). Columns mirror onOpFinalized's model branch exactly +// (provider+model PK, includes total_duration_us). +const modelTotalsOutSQL = ` +UPDATE catalog_models SET + call_count = call_count - 1, + failure_count = failure_count - ?, + total_tokens_in = total_tokens_in - ?, + total_tokens_out = total_tokens_out - ?, + total_tokens_cache_read = total_tokens_cache_read - ?, + total_tokens_cache_write = total_tokens_cache_write - ?, + total_cost_usd = total_cost_usd - ?, + total_duration_us = total_duration_us - ? +WHERE provider = ? AND name = ? +` + +// modelTotalsInSQL re-books an op's prior contribution onto a new +// catalog_models key (the migrateIn path). +const modelTotalsInSQL = ` UPDATE catalog_models SET failure_count = failure_count + ?, total_tokens_in = total_tokens_in + ?, @@ -240,14 +283,47 @@ UPDATE catalog_models SET total_cost_usd = total_cost_usd + ?, total_duration_us = total_duration_us + ? WHERE provider = ? AND name = ? -`, failure, t.tokensIn, t.tokensOut, t.tokensCacheRead, t.tokensCacheWrite, t.costUSD, t.durationUS, - eff.provider, eff.model); err != nil { - return fmt.Errorf("catalog_models migrate-in: %w", err) - } - } - case string(canonical.OpTool): - if eff.name != "" { - if _, err := tx.ExecContext(ctx, ` +` + +// migrateModelTotals applies one direction of the catalog_models migration. +// An op with no provider OR no model never contributed to catalog_models (PK +// requires both), so the guard short-circuits the same way the inline code did. +func migrateModelTotals(ctx context.Context, tx *sql.Tx, provider, model string, t opPriorTotals, dir migrateDir) error { + if provider == "" || model == "" { + return nil + } + stmt := modelTotalsOutSQL + verb := "migrate-out" + if dir == migrateIn { + stmt = modelTotalsInSQL + verb = "migrate-in" + } + failure := failureInc(t.status) + if _, err := tx.ExecContext(ctx, stmt, + failure, t.tokensIn, t.tokensOut, t.tokensCacheRead, t.tokensCacheWrite, t.costUSD, t.durationUS, + provider, model); err != nil { + return fmt.Errorf("catalog_models %s: %w", verb, err) + } + return nil +} + +// toolTotalsOutSQL drains an op's prior contribution from catalog_tools (the +// migrateOut path). Tools carry no cache split; the column set mirrors +// onOpFinalized's tool branch exactly. +const toolTotalsOutSQL = ` +UPDATE catalog_tools SET + call_count = call_count - 1, + failure_count = failure_count - ?, + total_tokens_in = total_tokens_in - ?, + total_tokens_out = total_tokens_out - ?, + total_cost_usd = total_cost_usd - ?, + total_duration_us = total_duration_us - ? +WHERE namespace = ? AND name = ? +` + +// toolTotalsInSQL re-books an op's prior contribution onto a new catalog_tools +// key (the migrateIn path). +const toolTotalsInSQL = ` UPDATE catalog_tools SET failure_count = failure_count + ?, total_tokens_in = total_tokens_in + ?, @@ -255,11 +331,26 @@ UPDATE catalog_tools SET total_cost_usd = total_cost_usd + ?, total_duration_us = total_duration_us + ? WHERE namespace = ? AND name = ? -`, failure, t.tokensIn, t.tokensOut, t.costUSD, t.durationUS, - normalizeToolNamespace(eff.toolNamespace), eff.name); err != nil { - return fmt.Errorf("catalog_tools migrate-in: %w", err) - } - } +` + +// migrateToolTotals applies one direction of the catalog_tools migration. +// Tool namespace is folded through normalizeToolNamespace exactly as the inline +// code did, so the keyed row matches the one onOpStarted booked. +func migrateToolTotals(ctx context.Context, tx *sql.Tx, namespace, name string, t opPriorTotals, dir migrateDir) error { + if name == "" { + return nil + } + stmt := toolTotalsOutSQL + verb := "migrate-out" + if dir == migrateIn { + stmt = toolTotalsInSQL + verb = "migrate-in" + } + failure := failureInc(t.status) + if _, err := tx.ExecContext(ctx, stmt, + failure, t.tokensIn, t.tokensOut, t.costUSD, t.durationUS, + normalizeToolNamespace(namespace), name); err != nil { + return fmt.Errorf("catalog_tools %s: %w", verb, err) } return nil } diff --git a/internal/ingest/resolver.go b/internal/ingest/resolver.go index 424a9f0..0f152b5 100644 --- a/internal/ingest/resolver.go +++ b/internal/ingest/resolver.go @@ -79,7 +79,7 @@ func (r *resolver) Stop() { // emit are absent; without them the SSE contract (sse-protocol.md // §session_changed) is silently broken for child-first ingestions. // -// Everything runs in ONE transaction so the two UPDATEs and the notify rows +// Everything runs in ONE transaction so the link UPDATEs and the notify rows // commit atomically: the serve poller can never observe a linkage without its // notification, nor a notification before the linked rows are visible. Each // UPDATE uses RETURNING to capture exactly the rows it changed (modernc.org/ @@ -87,10 +87,51 @@ func (r *resolver) Stop() { // changed child, its newly-linked parent, and its root. When nothing links, // the transaction makes no notify rows — an open poller is not spammed for a // no-op pass. +// +// The link passes are organized as a slice of named steps so the iteration +// stays trivial and the per-step rollback contract (any error rolls the WHOLE +// tx back) lives in one place; the resolverStep type captures both the SQL +// pass and its emit hook through the same signature. func (r *resolver) linkOrphans(ctx context.Context) error { if r.db == nil { return errors.New("resolver: nil db") } + return r.runResolverTx(ctx, func(ctx context.Context, tx *sql.Tx) error { + affected := map[string]struct{}{} + for _, step := range r.resolverSteps() { + if err := step(ctx, tx, affected); err != nil { + return err + } + } + return r.emitResolverNotify(ctx, tx, affected) + }) +} + +// resolverStep is one phase of a linkOrphans pass: it accumulates the session +// ids it changes into the shared affected set. All four link passes share this +// signature so the resolver loop is a uniform for-range over a step slice. +type resolverStep func(ctx context.Context, tx *sql.Tx, affected map[string]struct{}) error + +// resolverSteps returns the ordered list of link passes a linkOrphans run +// must execute. The order is significant: parent-link must run before root +// re-resolution so root_session_id can re-point at the just-linked parent in +// the same pass (see linkRoots's WHERE clause). Op-child passes follow the +// session passes because they read the now-resolved session tree. +func (r *resolver) resolverSteps() []resolverStep { + return []resolverStep{ + r.linkParents, + r.linkRoots, + r.linkOpChildren, + r.linkOpChildrenByToolUse, + } +} + +// runResolverTx wraps body in the resolver's single-tx contract: BeginTx, +// run body (which executes every link pass + the notify emit), Commit on +// success, Rollback on any error (including a successful body whose Commit +// fails). The defer/committed pattern is the same one the worker uses; it is +// extracted here so linkOrphans stays orchestration-only. +func (r *resolver) runResolverTx(ctx context.Context, body func(context.Context, *sql.Tx) error) error { tx, err := r.db.BeginTx(ctx, nil) if err != nil { return fmt.Errorf("resolver: begin tx: %w", err) @@ -101,24 +142,7 @@ func (r *resolver) linkOrphans(ctx context.Context) error { _ = tx.Rollback() } }() - - // affected accumulates the distinct session ids whose row is now part of - // a newly-resolved parent/child/root relationship and therefore needs a - // session_changed notification. - affected := map[string]struct{}{} - if err := r.linkParents(ctx, tx, affected); err != nil { - return err - } - if err := r.linkRoots(ctx, tx, affected); err != nil { - return err - } - if err := r.linkOpChildren(ctx, tx, affected); err != nil { - return err - } - if err := r.linkOpChildrenByToolUse(ctx, tx, affected); err != nil { - return err - } - if err := r.emitResolverNotify(ctx, tx, affected); err != nil { + if err := body(ctx, tx); err != nil { return err } if err := tx.Commit(); err != nil { diff --git a/internal/ingest/rollup_backfill.go b/internal/ingest/rollup_backfill.go index d4810d8..ad91975 100644 --- a/internal/ingest/rollup_backfill.go +++ b/internal/ingest/rollup_backfill.go @@ -35,15 +35,35 @@ type BackfillStats struct { // processed day-by-day, and each day's ops are read and written inside that // day's own transaction (the writer store pins a single connection, so a // read cursor must never straddle a separate write transaction). +// +// Orchestration is split out of the per-day fold + the post-run logging so +// each helper has one job; the helper sequence here is the durable contract. func BackfillRollups(ctx context.Context, db *sql.DB, now int64, logger *slog.Logger) (BackfillStats, error) { start := time.Now() + bf, days, err := prepareBackfill(ctx, db, now) + if err != nil { + return BackfillStats{}, err + } + if err := runBackfillDays(ctx, bf, days); err != nil { + return BackfillStats{}, err + } + stats := bf.stats(start) + logBackfillResult(logger, bf, stats) + return stats, nil +} + +// prepareBackfill clears the rollup tables, loads session starts, computes the +// sorted list of closed UTC days, and constructs the per-run backfiller state. +// Splitting this out of BackfillRollups makes the run a 3-call orchestration +// (prepare → run → log) so the per-day fold body has no extra control flow. +func prepareBackfill(ctx context.Context, db *sql.DB, now int64) (*backfiller, []int64, error) { openHourStart := rollups.BucketTS(now, rollups.Hourly) openDayStart := rollups.BucketTS(now, rollups.Daily) // Wipe both rollup tables before rebuilding so this run is a TRUE full // recompute, not an overwrite. The backfill is the recovery path "when // rollups are missing OR STALE" (ingester.md §One-shot backfill), but the - // per-day upserts below (ON CONFLICT DO UPDATE) can only refresh rows the + // per-day upserts (ON CONFLICT DO UPDATE) can only refresh rows the // recompute still produces — they cannot remove a row it no longer produces // (e.g. a dimension value that has since collapsed into __other__, or any row // from data that has since changed). Deleting first makes the backfill able @@ -55,39 +75,58 @@ func BackfillRollups(ctx context.Context, db *sql.DB, now int64, logger *slog.Lo // commits before the per-day loop, keeping the single writer connection // (SetMaxOpenConns(1)) uncontended. if err := truncateRollups(ctx, db); err != nil { - return BackfillStats{}, err + return nil, nil, err } startsByDay, err := loadSessionStarts(ctx, db, openHourStart) if err != nil { - return BackfillStats{}, err + return nil, nil, err } days, err := closedDays(ctx, db, openHourStart, startsByDay) if err != nil { - return BackfillStats{}, err - } + return nil, nil, err + } + return &backfiller{ + db: db, + openHourStart: openHourStart, + openDayStart: openDayStart, + startsByDay: startsByDay, + }, days, nil +} - bf := &backfiller{db: db, openHourStart: openHourStart, openDayStart: openDayStart, startsByDay: startsByDay} +// runBackfillDays materializes every closed day in arrival order, accumulating +// per-day counts on the backfiller. Each day runs in its own transaction (see +// flushDay), so a failure in one day rolls only that day back. +func runBackfillDays(ctx context.Context, bf *backfiller, days []int64) error { for _, day := range days { if err := bf.flushDay(ctx, day); err != nil { - return BackfillStats{}, err + return err } } + return nil +} - stats := BackfillStats{ +// stats returns the accumulated per-run counters wrapped with the elapsed +// wall-clock time. Kept as a method so BackfillRollups stays orchestration-only. +func (bf *backfiller) stats(start time.Time) BackfillStats { + return BackfillStats{ HourlyRows: bf.hourlyRows, DailyRows: bf.dailyRows, DaysProcessed: bf.daysProcessed, Elapsed: time.Since(start), } +} + +// logBackfillResult emits a single structured log line summarising the run. +// Split out so the orchestrator never carries the two-branch log formatting. +func logBackfillResult(logger *slog.Logger, bf *backfiller, stats BackfillStats) { if stats.DaysProcessed == 0 { logger.Info("rollups-backfill: no closed ops or session starts; nothing to materialize", - "open_hour_start", openHourStart, "open_day_start", openDayStart) - } else { - logger.Info("rollups-backfill: materialized closed-bucket rollups", - "days", stats.DaysProcessed, "hourly_rows", stats.HourlyRows, "daily_rows", stats.DailyRows) + "open_hour_start", bf.openHourStart, "open_day_start", bf.openDayStart) + return } - return stats, nil + logger.Info("rollups-backfill: materialized closed-bucket rollups", + "days", stats.DaysProcessed, "hourly_rows", stats.HourlyRows, "daily_rows", stats.DailyRows) } // truncateRollups empties rollup_hourly and rollup_daily in one short diff --git a/internal/ingest/rollup_refresh.go b/internal/ingest/rollup_refresh.go index 38be6ea..90c8d0f 100644 --- a/internal/ingest/rollup_refresh.go +++ b/internal/ingest/rollup_refresh.go @@ -48,63 +48,99 @@ const hourSpanUS = int64(3_600_000_000) // Single-writer discipline (store.OpenWriter pins SetMaxOpenConns(1)): every // read fully drains its cursor into a slice BEFORE any write on this tx, so a // read cursor never straddles a write on the one pinned connection. +// +// The two passes (HOUR + DAY) share an identical structure — snapshot the +// carried set, iterate, skip-or-materialize, stage post-commit removal — so +// each pass runs through refreshRollupBucketPass with its own granularity and +// carried/pending maps. func (w *writer) refreshRollups(ctx context.Context, tx *sql.Tx) error { if len(w.dirtyRollupBuckets) == 0 && len(w.dirtyRollupDays) == 0 { return nil } now := w.now() - openHourStart := rollups.BucketTS(now, rollups.Hourly) - openDayStart := rollups.BucketTS(now, rollups.Daily) - - // HOUR pass. Snapshot the dirty hours BEFORE iterating: the loop stages removal - // of materialized (closed) hours, and mutating the map during range over it is a - // Go hazard. The snapshot also lets a hour skipped-because-open be RETAINED (left - // in the map) and carried to the next batch/refresh — without it, a bucket whose - // ops all arrive during its own open hour would be skipped while open and then - // never re-marked, leaving the closed bucket permanently un-materialized - // (round-7 P1). - dirtyHours := make([]int64, 0, len(w.dirtyRollupBuckets)) - for h := range w.dirtyRollupBuckets { - dirtyHours = append(dirtyHours, h) - } - for _, h := range dirtyHours { - if h >= openHourStart { - continue // open hour: never materialized; RETAINED in the dirty set. + if err := w.refreshRollupBucketPass(ctx, tx, w.hourPass(rollups.BucketTS(now, rollups.Hourly))); err != nil { + return err + } + return w.refreshRollupBucketPass(ctx, tx, w.dayPass(rollups.BucketTS(now, rollups.Daily))) +} + +// rollupPass is one granularity's worth of state and SQL plumbing the refresh +// loop needs: the carried dirty set, the post-commit pending set, the bucket +// granularity, the bucket span, the wall-clock cutoff at which a bucket is +// considered closed, and a human-readable label for error wrapping. The HOUR +// and DAY passes share the exact same loop body — they differ ONLY in these +// fields — so a typed bundle keeps the dispatch trivially boring. +type rollupPass struct { + dirty map[int64]struct{} + pending map[int64]struct{} + bucket rollups.Bucket + spanUS int64 + openFrom int64 + label string +} + +// hourPass returns the HOUR pass configured against this writer's HOUR +// carried/pending sets and the supplied open-hour cutoff. The open hour is +// never materialized — the query layer live-folds it. +func (w *writer) hourPass(openHourStart int64) rollupPass { + return rollupPass{ + dirty: w.dirtyRollupBuckets, + pending: w.pendingMaterializedBuckets, + bucket: rollups.Hourly, + spanUS: hourSpanUS, + openFrom: openHourStart, + label: "hourly", + } +} + +// dayPass returns the DAY pass configured against this writer's DAY +// carried/pending sets. The open day is never derived — even from its closed +// hours — per data-model.md §"Open-bucket rule". +func (w *writer) dayPass(openDayStart int64) rollupPass { + return rollupPass{ + dirty: w.dirtyRollupDays, + pending: w.pendingMaterializedDays, + bucket: rollups.Daily, + spanUS: daySpanUS, + openFrom: openDayStart, + label: "daily", + } +} + +// refreshRollupBucketPass materializes every CLOSED bucket from pass.dirty, +// skipping (and retaining) any bucket still in the open window. Materialized +// buckets are STAGED for removal in pass.pending, applied only after the tx +// commits (promoteMaterializedRollupBuckets). The snapshot loop mirrors the +// original inline code exactly: snapshot first so removal is safe across the +// range, and the open-bucket skip leaves the carried entry in place. +func (w *writer) refreshRollupBucketPass(ctx context.Context, tx *sql.Tx, pass rollupPass) error { + // Snapshot the dirty set BEFORE iterating: the loop stages removal of + // materialized (closed) buckets, and mutating the map during range over it + // is a Go hazard. The snapshot also lets a bucket skipped-because-open be + // RETAINED (left in the map) and carried to the next batch/refresh — + // without it, a bucket whose ops all arrive during its own open window + // would be skipped while open and then never re-marked, leaving the closed + // bucket permanently un-materialized (round-7/round-8 P1). + snapshot := make([]int64, 0, len(pass.dirty)) + for b := range pass.dirty { + snapshot = append(snapshot, b) + } + for _, b := range snapshot { + if b >= pass.openFrom { + continue // open bucket: never materialized; RETAINED in the dirty set. } - if err := w.recomputeBucket(ctx, tx, h, h+hourSpanUS, rollups.Hourly); err != nil { - return fmt.Errorf("rollups-refresh: hourly bucket %d: %w", h, err) + if err := w.recomputeBucket(ctx, tx, b, b+pass.spanUS, pass.bucket); err != nil { + return fmt.Errorf("rollups-refresh: %s bucket %d: %w", pass.label, b, err) } // Closed and materialized IN THIS TX — stage its removal from the carried // set, applied only AFTER the tx commits (promoteMaterializedRollupBuckets). // Deferring the delete keeps a materialized-then-rolled-back bucket CARRIED // (resetBatch discards the staged removal on rollback) so it is retried — // without this, an idle refresh-only pass whose commit fails would lose a - // closed bucket whose DB row never landed (round-7 P1 undercount). The - // carried set then holds only open/pending hours, so memory stays bounded. - w.pendingMaterializedBuckets[h] = struct{}{} - w.rollupMaterializedThisRefresh = true - } - - // DAY pass — mirrors the hour pass exactly, over the INDEPENDENTLY-carried - // dirtyRollupDays set (NOT derived from the dirty hours). Marking the day in - // markDirtyRollupBucket and carrying it here means a day stays tracked even after - // all of its hours have materialized and left the hour set, so a day that closes - // during a lull is still materialized by a later refresh/idle pass — the daily - // fix for the round-7 open→closed gap (round-8 P1). A day still open at refresh - // time is RETAINED; a closed day is materialized and its removal staged for - // post-commit promotion (rollback-safe, identical to hours). - dirtyDays := make([]int64, 0, len(w.dirtyRollupDays)) - for d := range w.dirtyRollupDays { - dirtyDays = append(dirtyDays, d) - } - for _, d := range dirtyDays { - if d >= openDayStart { - continue // open day: never materialized in rollup_daily; RETAINED. - } - if err := w.recomputeBucket(ctx, tx, d, d+daySpanUS, rollups.Daily); err != nil { - return fmt.Errorf("rollups-refresh: daily bucket %d: %w", d, err) - } - w.pendingMaterializedDays[d] = struct{}{} + // closed bucket whose DB row never landed (round-7/round-8 P1 undercount). + // The carried set then holds only open/pending buckets, so memory stays + // bounded. + pass.pending[b] = struct{}{} w.rollupMaterializedThisRefresh = true } return nil diff --git a/internal/ingest/sow55_characterization_catalog_test.go b/internal/ingest/sow55_characterization_catalog_test.go new file mode 100644 index 0000000..c0328e1 --- /dev/null +++ b/internal/ingest/sow55_characterization_catalog_test.go @@ -0,0 +1,115 @@ +package ingest + +import ( + "database/sql" + "testing" +) + +// SOW-0055 characterization pins for catalog_migrate.go cascade behaviour. + +// TestCatalog_CharacterizationCascadingIdentityCorrection pins the +// catalog_migrate.go cascade contract: a chain of OpStarted re-emits on the +// SAME (turn, seq) must drain the prior identity and re-book totals onto the +// new identity, with OpFinalized totals migrating through the chain. +func TestCatalog_CharacterizationCascadingIdentityCorrection(t *testing.T) { + t.Parallel() + const src = "codex:/tmp" + w, db, ctx := sow55SetupWriter(t, src, "codex", "/tmp") + tx := sow55Begin(t, ctx, db) + sow55ApplyBatch(t, ctx, w, tx, sow55ToolCascadeEvents(src)) + sow55CommitTx(t, tx) + + assertToolCascadeIntermediateDrained(t, db) + assertToolCascadeFinalIdentity(t, db) + assertToolCascadeCostMigrated(t, db) +} + +// TestCatalog_CharacterizationLLMCostMigratesAcrossCascade pins explicit cost +// movement on catalog_models AND catalog_providers when the LLM identity +// switches provider/model. The SOW-0055 split of addMigratedTotals / +// removeOpContribution must keep the cost column and its sign correct. +func TestCatalog_CharacterizationLLMCostMigratesAcrossCascade(t *testing.T) { + t.Parallel() + const src = "codex:/tmp" + w, db, ctx := sow55SetupWriter(t, src, "codex", "/tmp") + tx := sow55Begin(t, ctx, db) + sow55ApplyBatch(t, ctx, w, tx, sow55LLMCascadeEvents(src)) + sow55CommitTx(t, tx) + + assertLLMProviderDrained(t, db) + assertLLMProviderMigrated(t, db) + assertLLMModelMigrated(t, db) +} + +// ----- catalog cascade assertions ----- + +func assertToolCascadeIntermediateDrained(t *testing.T, db *sql.DB) { + t.Helper() + if got := scanInt(t, db, `SELECT COALESCE(call_count,0) FROM catalog_tools WHERE namespace='custom' AND name='search'`); got != 0 { + t.Errorf("placeholder custom/search call_count = %d, want 0 (cascade drained)", got) + } + if got := scanInt(t, db, `SELECT COALESCE(call_count,0) FROM catalog_tools WHERE namespace='mcp:files' AND name='files.read'`); got != 0 { + t.Errorf("intermediate mcp:files/files.read call_count = %d, want 0 (cascade drained)", got) + } +} + +func assertToolCascadeFinalIdentity(t *testing.T, db *sql.DB) { + t.Helper() + if got := scanInt(t, db, `SELECT call_count FROM catalog_tools WHERE namespace='mcp:files-v2' AND name='files.read_v2'`); got != 1 { + t.Errorf("final mcp:files-v2/files.read_v2 call_count = %d, want 1 (one physical op = one row)", got) + } + if got := scanInt(t, db, `SELECT failure_count FROM catalog_tools WHERE namespace='mcp:files-v2' AND name='files.read_v2'`); got != 1 { + t.Errorf("final tool failure_count = %d, want 1 (failure migrated through both corrections)", got) + } + if got := scanInt(t, db, `SELECT total_tokens_in FROM catalog_tools WHERE namespace='mcp:files-v2' AND name='files.read_v2'`); got != 7 { + t.Errorf("final tool total_tokens_in = %d, want 7 (tokens migrated, not duplicated)", got) + } +} + +func assertToolCascadeCostMigrated(t *testing.T, db *sql.DB) { + t.Helper() + if got := scanFloat(t, db, `SELECT total_cost_usd FROM catalog_tools WHERE namespace='mcp:files-v2' AND name='files.read_v2'`); !floatNear(got, 0.5) { + t.Errorf("final tool total_cost_usd = %v, want 0.5 (cost migrated through cascade)", got) + } + if got := scanFloat(t, db, `SELECT COALESCE(total_cost_usd,0) FROM catalog_tools WHERE namespace='custom' AND name='search'`); got != 0 { + t.Errorf("stale tool total_cost_usd on placeholder = %v, want 0 (cascade drained)", got) + } +} + +// ----- LLM cascade assertions ----- + +func assertLLMProviderDrained(t *testing.T, db *sql.DB) { + t.Helper() + if got := scanInt(t, db, `SELECT COALESCE(call_count,0) FROM catalog_providers WHERE name='anthropic'`); got != 0 { + t.Errorf("old provider call_count = %d, want 0 (cost migration must drain anthropic)", got) + } + if got := scanFloat(t, db, `SELECT COALESCE(total_cost_usd,0) FROM catalog_providers WHERE name='anthropic'`); got != 0 { + t.Errorf("old provider total_cost_usd = %v, want 0 (cost must be migrated off)", got) + } +} + +func assertLLMProviderMigrated(t *testing.T, db *sql.DB) { + t.Helper() + if got := scanFloat(t, db, `SELECT total_cost_usd FROM catalog_providers WHERE name='openai'`); !floatNear(got, 1.25) { + t.Errorf("new provider total_cost_usd = %v, want 1.25 (migrated, not duplicated)", got) + } + if got := scanInt(t, db, `SELECT call_count FROM catalog_providers WHERE name='openai'`); got != 1 { + t.Errorf("new provider call_count = %d, want 1 (one physical op)", got) + } +} + +func assertLLMModelMigrated(t *testing.T, db *sql.DB) { + t.Helper() + if got := scanFloat(t, db, `SELECT COALESCE(total_cost_usd,0) FROM catalog_models WHERE provider='anthropic' AND name='claude-opus-4.7'`); got != 0 { + t.Errorf("old model total_cost_usd = %v, want 0", got) + } + if got := scanFloat(t, db, `SELECT total_cost_usd FROM catalog_models WHERE provider='openai' AND name='gpt-5.5'`); !floatNear(got, 1.25) { + t.Errorf("new model total_cost_usd = %v, want 1.25", got) + } + if got := scanInt(t, db, `SELECT call_count FROM catalog_models WHERE provider='openai' AND name='gpt-5.5'`); got != 1 { + t.Errorf("new model call_count = %d, want 1", got) + } + if got := scanInt(t, db, `SELECT total_duration_us FROM catalog_models WHERE provider='openai' AND name='gpt-5.5'`); got != 900 { + t.Errorf("new model total_duration_us = %d, want 900 (migrated)", got) + } +} diff --git a/internal/ingest/sow55_characterization_helpers_test.go b/internal/ingest/sow55_characterization_helpers_test.go new file mode 100644 index 0000000..3bb9a1d --- /dev/null +++ b/internal/ingest/sow55_characterization_helpers_test.go @@ -0,0 +1,218 @@ +package ingest + +import ( + "context" + "database/sql" + "errors" + "testing" + + "github.com/netdata/ai-viewer/internal/canonical" +) + +// SOW-0055 characterization shared helpers — setup, event builders, utilities. +// Used by the per-area characterization test files. + +// ----- shared setup helpers ----- + +func sow55SetupWriter(t *testing.T, src, format, loc string) (*writer, *sql.DB, context.Context) { + t.Helper() + _, db := openTestStore(t) + ctx := context.Background() + if err := ensureSourceRowDirect(ctx, db, src, format, loc); err != nil { + t.Fatalf("ensure source: %v", err) + } + w := newWriter(src, format, loc, NopPricer{}) + return w, db, ctx +} + +func sow55Begin(t *testing.T, ctx context.Context, db *sql.DB) *sql.Tx { + t.Helper() + tx, err := db.BeginTx(ctx, nil) + if err != nil { + t.Fatalf("BeginTx: %v", err) + } + return tx +} + +func sow55ApplyEvent(t *testing.T, ctx context.Context, w *writer, tx *sql.Tx, ev canonical.Event) { + t.Helper() + if err := w.apply(ctx, tx, ev); err != nil { + t.Fatalf("apply %T: %v", ev, err) + } +} + +func sow55ApplyBatch(t *testing.T, ctx context.Context, w *writer, tx *sql.Tx, evs []canonical.Event) { + t.Helper() + for _, ev := range evs { + sow55ApplyEvent(t, ctx, w, tx, ev) + } +} + +func sow55CommitTx(t *testing.T, tx *sql.Tx) { + t.Helper() + if err := tx.Commit(); err != nil { + t.Fatalf("Commit: %v", err) + } +} + +// ----- event builders ----- + +func sow55ToolCascadeEvents(src string) []canonical.Event { + return []canonical.Event{ + canonical.SessionStartedEvent{ + EventBase: canonical.EventBase{SourceID: src, SourceSeq: 1, Ts: 1000}, + NativeID: "s", RootNativeID: "s", Kind: canonical.KindRoot, + }, + canonical.TurnStartedEvent{ + EventBase: canonical.EventBase{SourceID: src, SourceSeq: 2, Ts: 1000}, + SessionNativeID: "s", Seq: 1, + }, + canonical.OpStartedEvent{ + EventBase: canonical.EventBase{SourceID: src, SourceSeq: 3, Ts: 1100}, + SessionNativeID: "s", TurnSeq: 1, Seq: 1, ParentOpSeq: -1, + Kind: canonical.OpTool, Name: "search", ToolNamespace: "custom", + }, + canonical.OpFinalizedEvent{ + EventBase: canonical.EventBase{SourceID: src, SourceSeq: 4, Ts: 1500}, + SessionNativeID: "s", TurnSeq: 1, Seq: 1, Status: "failed", ErrorClass: "boom", + EndTs: 1500, TokensIn: 7, TokensOut: 11, CostUSD: 0.5, + }, + canonical.OpStartedEvent{ + EventBase: canonical.EventBase{SourceID: src, SourceSeq: 5, Ts: 1100}, + SessionNativeID: "s", TurnSeq: 1, Seq: 1, ParentOpSeq: -1, + Kind: canonical.OpTool, Name: "files.read", ToolNamespace: "mcp:files", + }, + canonical.OpFinalizedEvent{ + EventBase: canonical.EventBase{SourceID: src, SourceSeq: 6, Ts: 1500}, + SessionNativeID: "s", TurnSeq: 1, Seq: 1, Status: "failed", ErrorClass: "boom", + EndTs: 1500, TokensIn: 7, TokensOut: 11, CostUSD: 0.5, + }, + canonical.OpStartedEvent{ + EventBase: canonical.EventBase{SourceID: src, SourceSeq: 7, Ts: 1100}, + SessionNativeID: "s", TurnSeq: 1, Seq: 1, ParentOpSeq: -1, + Kind: canonical.OpTool, Name: "files.read_v2", ToolNamespace: "mcp:files-v2", + }, + canonical.OpFinalizedEvent{ + EventBase: canonical.EventBase{SourceID: src, SourceSeq: 8, Ts: 1500}, + SessionNativeID: "s", TurnSeq: 1, Seq: 1, Status: "failed", ErrorClass: "boom", + EndTs: 1500, TokensIn: 7, TokensOut: 11, CostUSD: 0.5, + }, + } +} + +func sow55LLMCascadeEvents(src string) []canonical.Event { + return []canonical.Event{ + canonical.SessionStartedEvent{ + EventBase: canonical.EventBase{SourceID: src, SourceSeq: 1, Ts: 1000}, + NativeID: "s", RootNativeID: "s", Kind: canonical.KindRoot, + }, + canonical.TurnStartedEvent{ + EventBase: canonical.EventBase{SourceID: src, SourceSeq: 2, Ts: 1000}, + SessionNativeID: "s", Seq: 1, + }, + canonical.OpStartedEvent{ + EventBase: canonical.EventBase{SourceID: src, SourceSeq: 3, Ts: 1100}, + SessionNativeID: "s", TurnSeq: 1, Seq: 1, ParentOpSeq: -1, + Kind: canonical.OpLLM, Name: "chat", Provider: "anthropic", Model: "claude-opus-4.7", + }, + canonical.OpFinalizedEvent{ + EventBase: canonical.EventBase{SourceID: src, SourceSeq: 4, Ts: 2000}, + SessionNativeID: "s", TurnSeq: 1, Seq: 1, Status: "completed", + EndTs: 2000, TokensIn: 12, TokensOut: 34, CostUSD: 1.25, + }, + canonical.OpStartedEvent{ + EventBase: canonical.EventBase{SourceID: src, SourceSeq: 5, Ts: 1100}, + SessionNativeID: "s", TurnSeq: 1, Seq: 1, ParentOpSeq: -1, + Kind: canonical.OpLLM, Name: "chat", Provider: "openai", Model: "gpt-5.5", + }, + canonical.OpFinalizedEvent{ + EventBase: canonical.EventBase{SourceID: src, SourceSeq: 6, Ts: 2000}, + SessionNativeID: "s", TurnSeq: 1, Seq: 1, Status: "completed", + EndTs: 2000, TokensIn: 12, TokensOut: 34, CostUSD: 1.25, + }, + } +} + +func sow55NilExtrasPhase1Events(src string) []canonical.Event { + return []canonical.Event{ + canonical.SessionStartedEvent{ + EventBase: canonical.EventBase{SourceID: src, SourceSeq: 1, Ts: 1000}, + NativeID: "parent", RootNativeID: "parent", Kind: canonical.KindRoot, + }, + canonical.TurnStartedEvent{ + EventBase: canonical.EventBase{SourceID: src, SourceSeq: 2, Ts: 1000}, + SessionNativeID: "parent", Seq: 1, + }, + canonical.OpStartedEvent{ + EventBase: canonical.EventBase{SourceID: src, SourceSeq: 3, Ts: 1100}, + SessionNativeID: "parent", TurnSeq: 1, Seq: 1, ParentOpSeq: -1, + Kind: canonical.OpSession, Name: "explore", + Extras: map[string]any{ + "attr.x": "should-be-dropped", + "aiViewer": map[string]any{"toolUseId": "toolu_abc", "childNativeId": "parent:agent:xyz"}, + }, + }, + } +} + +func sow55ReEmitRollupEvents(src string) []canonical.Event { + return []canonical.Event{ + canonical.SessionStartedEvent{ + EventBase: canonical.EventBase{SourceID: src, SourceSeq: 1, Ts: 1000}, + NativeID: "s", RootNativeID: "s", Kind: canonical.KindRoot, + }, + canonical.TurnStartedEvent{ + EventBase: canonical.EventBase{SourceID: src, SourceSeq: 2, Ts: 1000}, + SessionNativeID: "s", Seq: 1, + }, + canonical.OpStartedEvent{ + EventBase: canonical.EventBase{SourceID: src, SourceSeq: 3, Ts: 3_600_000_005}, + SessionNativeID: "s", TurnSeq: 1, Seq: 1, ParentOpSeq: -1, + Kind: canonical.OpLLM, Name: "chat", Provider: "openai", Model: "gpt-5.5", + }, + canonical.OpStartedEvent{ + EventBase: canonical.EventBase{SourceID: src, SourceSeq: 4, Ts: 3_600_000_005}, + SessionNativeID: "s", TurnSeq: 1, Seq: 1, ParentOpSeq: -1, + Kind: canonical.OpLLM, Name: "chat", Provider: "openai", Model: "gpt-5.5", + }, + } +} + +func sow55LogBatch(src string, seqBase uint64, ts int64, message string) []canonical.Event { + return []canonical.Event{ + canonical.SessionStartedEvent{ + EventBase: canonical.EventBase{SourceID: src, SourceSeq: seqBase, Ts: 500}, + NativeID: "s", RootNativeID: "s", Kind: canonical.KindRoot, + }, + canonical.LogEntryEvent{ + EventBase: canonical.EventBase{SourceID: src, SourceSeq: seqBase + 1, Ts: ts}, + SessionNativeID: "s", + Severity: "INF", Source: "agent", Message: message, + }, + } +} + +// ----- utility functions ----- + +// scanFloat returns a single float64 from a one-row SELECT. +func scanFloat(t *testing.T, db *sql.DB, query string, args ...any) float64 { + t.Helper() + var v float64 + if err := db.QueryRowContext(context.Background(), query, args...).Scan(&v); err != nil { + if errors.Is(err, sql.ErrNoRows) { + return 0 + } + t.Fatalf("query %s: %v", query, err) + } + return v +} + +// floatNear reports whether a and b agree to 6 decimal places. +func floatNear(a, b float64) bool { + const eps = 1e-6 + d := a - b + if d < 0 { + d = -d + } + return d < eps +} diff --git a/internal/ingest/sow55_characterization_resolver_test.go b/internal/ingest/sow55_characterization_resolver_test.go new file mode 100644 index 0000000..1da7be6 --- /dev/null +++ b/internal/ingest/sow55_characterization_resolver_test.go @@ -0,0 +1,130 @@ +package ingest + +import ( + "context" + "database/sql" + "strings" + "testing" + "time" +) + +// SOW-0055 characterization pins for resolver.go linkOrphans transactional behaviour. + +// TestResolver_CharacterizationLinkageFailureRollsBackWithoutNotifyRows pins +// the resolver.go linkOrphans contract: when a downstream UPDATE/notify INSERT +// fails inside the resolver's single transaction, neither the linkage UPDATEs +// nor the notify rows are visible after the failure. A helper-runner refactor +// that committed in pieces would break the SSE contract. +func TestResolver_CharacterizationLinkageFailureRollsBackWithoutNotifyRows(t *testing.T) { + t.Parallel() + const src = "aiagent_v3:/tmp" + db, childID, parentID := seedOrphanChildThenParent(t, src) + ctx := context.Background() + + assertOrphanPrecondition(t, db, childID) + sow55InstallNotifyAbortTrigger(t, db) + defer sow55DropNotifyAbortTrigger(t, db) + + r := newResolver(db, silentLogger(), time.Minute) + err := r.linkOrphans(ctx) + if err == nil { + t.Fatal("linkOrphans succeeded despite forced notify abort") + } + if !strings.Contains(err.Error(), "sow55 forced notify abort") { + t.Fatalf("error does not contain the forced abort marker: %v", err) + } + + assertResolverRollbackLinkage(t, db, childID, parentID) + assertResolverRollbackNotify(t, db) +} + +// TestResolver_CharacterizationLinkOrphansEmitsAllPassesInOneTx pins the +// complementary linkOrphans property: every link pass AND the notify emission +// run inside the SAME transaction. The happy path is exercised so the contract +// — pass results visible at commit AND notify rows landed in the same commit — +// is pinned. +func TestResolver_CharacterizationLinkOrphansEmitsAllPassesInOneTx(t *testing.T) { + t.Parallel() + const src = "aiagent_v3:/tmp" + db, childID, parentID := seedOrphanChildThenParent(t, src) + + r := newResolver(db, silentLogger(), time.Minute) + if err := r.linkOrphans(context.Background()); err != nil { + t.Fatalf("linkOrphans: %v", err) + } + + assertResolverHappyPathLinkage(t, db, childID, parentID) + assertResolverHappyPathNotify(t, db) +} + +// ----- resolver helpers ----- + +func assertOrphanPrecondition(t *testing.T, db *sql.DB, childID string) { + t.Helper() + if got := scanString(t, db, `SELECT IFNULL(parent_session_id,'') FROM sessions WHERE id=?`, childID); got != "" { + t.Fatalf("child unexpectedly linked before resolver: %q", got) + } + if got := scanInt(t, db, `SELECT COUNT(*) FROM notify`); got != 0 { + t.Fatalf("notify table non-empty before test: %d rows", got) + } +} + +func sow55InstallNotifyAbortTrigger(t *testing.T, db *sql.DB) { + t.Helper() + ctx := context.Background() + if _, err := db.ExecContext(ctx, ` +CREATE TRIGGER abort_notify BEFORE INSERT ON notify +BEGIN + SELECT RAISE(ABORT, 'sow55 forced notify abort'); +END`); err != nil { + t.Fatalf("create trigger: %v", err) + } +} + +func sow55DropNotifyAbortTrigger(t *testing.T, db *sql.DB) { + t.Helper() + if _, err := db.ExecContext(context.Background(), `DROP TRIGGER IF EXISTS abort_notify`); err != nil { + t.Logf("drop trigger: %v", err) + } +} + +func assertResolverRollbackLinkage(t *testing.T, db *sql.DB, childID, parentID string) { + t.Helper() + if got := scanString(t, db, `SELECT IFNULL(parent_session_id,'') FROM sessions WHERE id=?`, childID); got != "" { + t.Errorf("child parent_session_id = %q, want empty (linkage must roll back on notify failure)", got) + } + if got := scanString(t, db, `SELECT IFNULL(root_session_id,'') FROM sessions WHERE id=?`, childID); got == parentID { + t.Errorf("child root_session_id = %q (= parent), want still self (linkage rolled back)", got) + } +} + +func assertResolverRollbackNotify(t *testing.T, db *sql.DB) { + t.Helper() + if got := scanInt(t, db, `SELECT COUNT(*) FROM notify`); got != 0 { + t.Errorf("notify rows = %d, want 0 (whole tx must roll back together)", got) + } +} + +func assertResolverHappyPathLinkage(t *testing.T, db *sql.DB, childID, parentID string) { + t.Helper() + if got := scanString(t, db, `SELECT IFNULL(parent_session_id,'') FROM sessions WHERE id=?`, childID); got != parentID { + t.Errorf("child parent_session_id = %q, want %q (linkage must commit)", got, parentID) + } +} + +func assertResolverHappyPathNotify(t *testing.T, db *sql.DB) { + t.Helper() + rows := readNotify(t, db) + if len(rows) < 2 { + t.Errorf("notify rows = %d, want ≥ 2 (resolver must emit linkage notify rows)", len(rows)) + } + hasStats := false + for _, r := range rows { + if r.kind == "stats_invalidated" { + hasStats = true + } + } + if !hasStats { + t.Error("missing stats_invalidated notify row from resolver pass") + } +} diff --git a/internal/ingest/sow55_characterization_rollup_test.go b/internal/ingest/sow55_characterization_rollup_test.go new file mode 100644 index 0000000..e790114 --- /dev/null +++ b/internal/ingest/sow55_characterization_rollup_test.go @@ -0,0 +1,124 @@ +package ingest + +import ( + "context" + "database/sql" + "testing" + + "github.com/netdata/ai-viewer/internal/canonical" + "github.com/netdata/ai-viewer/internal/rollups" +) + +// SOW-0055 characterization pins for rollup_refresh.go carry-forward behaviour. + +// TestRefreshRollups_CharacterizationCarriedSetAcrossMultipleRefreshes pins +// the rollup_refresh.go carry-forward invariant: dirty HOUR and DAY buckets +// carried across multiple refresh passes remain independently tracked. Three +// refreshes are run: open hour+day (both skipped), hour closed (hour +// materialized, day retained), day closed (day materialized). +func TestRefreshRollups_CharacterizationCarriedSetAcrossMultipleRefreshes(t *testing.T) { + const src, format = "claude_code:/tmp", "claude_code" + hourH := ts(0, 10, 0) + nowOpenBoth := ts(0, 10, 15) + nowHourClosed := ts(0, 11, 5) + nowDayClosed := ts(1, 0, 30) + + wr, db, clk := sow55SetupRollupRefreshCarried(t, src, format, hourH, nowOpenBoth) + + assertCarriedState1BothOpen(t, db, wr, hourH, format) + + clk.now = nowHourClosed + mustRefreshRollups(t, db, wr) + assertCarriedState2HourClosed(t, db, wr, hourH, format) + + clk.now = nowDayClosed + mustRefreshRollups(t, db, wr) + assertCarriedState3DayClosed(t, db, wr, hourH, format) +} + +// ----- rollup refresh helpers ----- + +func sow55SetupRollupRefreshCarried(t *testing.T, src, format string, hourH, nowOpenBoth int64) (*writer, *sql.DB, *mutableClock) { + t.Helper() + _, db := openTestStore(t) + seedSource(t, db, src, format) + clk := &mutableClock{now: nowOpenBoth} + wr := newWriter(src, format, "/loc", NopPricer{}) + wr.now = clk.Now + + hourHEnd := ts(0, 10, 30) + batch := []canonical.Event{ + sessionStartEvent(src, "s", "claude", "/w", hourH, 1), + canonical.TurnStartedEvent{ + EventBase: canonical.EventBase{SourceID: src, SourceSeq: 2, Ts: hourH}, + SessionNativeID: "s", Seq: 1, + }, + } + batch = append(batch, llmOpEvents(src, "s", 1, 1, hourH, hourHEnd, "m", "p", 5, 5, 0, false)...) + flushBatchReuse(t, db, wr, src, format, batch) + return wr, db, clk +} + +// mustRefreshRollups opens a tx, runs wr.refreshRollups, commits, promotes the +// staged carry-set removals, and resets per-batch state — mirroring +// worker.refreshRollupsOnly's production semantics. +func mustRefreshRollups(t *testing.T, db *sql.DB, wr *writer) { + t.Helper() + ctx := context.Background() + tx, err := db.BeginTx(ctx, nil) + if err != nil { + t.Fatalf("BeginTx: %v", err) + } + if err := wr.refreshRollups(ctx, tx); err != nil { + _ = tx.Rollback() + t.Fatalf("refreshRollups: %v", err) + } + if err := tx.Commit(); err != nil { + t.Fatalf("Commit: %v", err) + } + wr.promoteMaterializedRollupBuckets() + wr.resetBatch() +} + +func assertCarriedState1BothOpen(t *testing.T, db *sql.DB, wr *writer, hourH int64, format string) { + t.Helper() + if _, ok := wr.dirtyRollupBuckets[rollups.BucketTS(hourH, rollups.Hourly)]; !ok { + t.Fatal("expected hour H carried after first flush (open hour)") + } + if _, ok := wr.dirtyRollupDays[rollups.BucketTS(hourH, rollups.Daily)]; !ok { + t.Fatal("expected day0 carried after first flush (open day)") + } + if _, ok := readRollups(t, db, "rollup_hourly")[rollupKey{hourH, format, "total", ""}]; ok { + t.Fatal("premise broken: open hour materialized") + } +} + +func assertCarriedState2HourClosed(t *testing.T, db *sql.DB, wr *writer, hourH int64, format string) { + t.Helper() + if _, ok := readRollups(t, db, "rollup_hourly")[rollupKey{hourH, format, "total", ""}]; !ok { + t.Fatal("hour H not materialized after it closed (SOW-0055 hour pass split must keep this)") + } + if _, ok := wr.dirtyRollupBuckets[rollups.BucketTS(hourH, rollups.Hourly)]; ok { + t.Fatal("hour H still in carried HOUR set after materialization (must be removed)") + } + if _, ok := wr.dirtyRollupDays[rollups.BucketTS(hourH, rollups.Daily)]; !ok { + t.Fatal("day0 dropped from carried DAY set while still open (round-8 P1 regression)") + } + if _, ok := readRollups(t, db, "rollup_daily")[rollupKey{rollups.BucketTS(hourH, rollups.Daily), format, "total", ""}]; ok { + t.Fatal("open day materialized in rollup_daily (open-day cutoff broken)") + } +} + +func assertCarriedState3DayClosed(t *testing.T, db *sql.DB, wr *writer, hourH int64, format string) { + t.Helper() + dayBucket := rollups.BucketTS(hourH, rollups.Daily) + if _, ok := readRollups(t, db, "rollup_daily")[rollupKey{dayBucket, format, "total", ""}]; !ok { + t.Fatal("day0 not materialized after it closed (SOW-0055 day pass split must keep this)") + } + if _, ok := wr.dirtyRollupDays[dayBucket]; ok { + t.Fatal("day0 still in carried DAY set after materialization (must be removed)") + } + if wr.hasPendingRollupBuckets() { + t.Fatal("carried sets not empty after both hour and day materialized") + } +} diff --git a/internal/ingest/sow55_characterization_writer_test.go b/internal/ingest/sow55_characterization_writer_test.go new file mode 100644 index 0000000..25c5074 --- /dev/null +++ b/internal/ingest/sow55_characterization_writer_test.go @@ -0,0 +1,153 @@ +package ingest + +import ( + "database/sql" + "testing" + + "github.com/netdata/ai-viewer/internal/canonical" + "github.com/netdata/ai-viewer/internal/rollups" +) + +// SOW-0055 characterization pins for writer.go apply-op and apply-log behaviour. + +// TestWriter_CharacterizationOpStartNilExtrasReEmitPreservesResolverStash pins +// the applyOpStarted contract: a re-emit with nil extras must yield a +// stash-only extras_json — non-aiViewer keys are dropped, but resolver join +// keys (childNativeId / toolUseId) survive under $.aiViewer. +func TestWriter_CharacterizationOpStartNilExtrasReEmitPreservesResolverStash(t *testing.T) { + t.Parallel() + const src = "claude-code:/tmp" + w, db, ctx := sow55SetupWriter(t, src, "claude-code", "/tmp") + + tx1 := sow55Begin(t, ctx, db) + sow55ApplyBatch(t, ctx, w, tx1, sow55NilExtrasPhase1Events(src)) + sow55CommitTx(t, tx1) + + tx2 := sow55Begin(t, ctx, db) + sow55ApplyEvent(t, ctx, w, tx2, canonical.OpStartedEvent{ + EventBase: canonical.EventBase{SourceID: src, SourceSeq: 4, Ts: 1100}, + SessionNativeID: "parent", TurnSeq: 1, Seq: 1, ParentOpSeq: -1, + Kind: canonical.OpSession, Name: "explore", + }) + sow55CommitTx(t, tx2) + + turnID := canonicalTurnID(canonicalSessionID(src, "parent"), 1) + opID := canonicalOpID(turnID, 1) + assertNilExtrasReEmitStash(t, db, opID) +} + +// TestWriter_CharacterizationOpStartReEmitMarksRollupBucketAndCatalog pins the +// applyOpStarted post-upsert hooks: the op's start bucket is added to the +// dirty rollup set, the dirty op id is recorded for FTS, and catalog call_count +// does NOT double-count a same-identity re-emit (SOW-0004 H1a). +func TestWriter_CharacterizationOpStartReEmitMarksRollupBucketAndCatalog(t *testing.T) { + t.Parallel() + const src = "codex:/tmp" + w, db, ctx := sow55SetupWriter(t, src, "codex", "/tmp") + tx := sow55Begin(t, ctx, db) + sow55ApplyBatch(t, ctx, w, tx, sow55ReEmitRollupEvents(src)) + sow55CommitTx(t, tx) + + turnID := canonicalTurnID(canonicalSessionID(src, "s"), 1) + wantOpID := canonicalOpID(turnID, 1) + assertReEmitRollupAndCatalog(t, db, w, wantOpID) +} + +// TestWriter_CharacterizationLogEntryDuplicateReplayNoDuplicateFTS pins the +// applyLogEntry contract: a byte-identical log_entries replay must leave +// log_entries with exactly one row AND fts_logs with exactly one row. The +// INSERT ... ON CONFLICT DO NOTHING ... RETURNING id returns sql.ErrNoRows on +// conflict; that signal gates the fts_logs insert. +func TestWriter_CharacterizationLogEntryDuplicateReplayNoDuplicateFTS(t *testing.T) { + t.Parallel() + const src = "claude_code:/tmp" + w, db, ctx := sow55SetupWriter(t, src, "claude_code", "/tmp") + w.fts5IndexLogs = true + + tx1 := sow55Begin(t, ctx, db) + sow55ApplyBatch(t, ctx, w, tx1, sow55LogBatch(src, 1, 1000, "replayed line")) + sow55CommitTx(t, tx1) + + tx2 := sow55Begin(t, ctx, db) + sow55ApplyBatch(t, ctx, w, tx2, sow55LogBatch(src, 3, 1000, "replayed line")) + sow55CommitTx(t, tx2) + + assertLogDupDedup(t, db) +} + +// TestWriter_CharacterizationLogEntryFTSGateRespectsFlag pins the second half +// of the applyLogEntry FTS hook: when fts5IndexLogs=false no fts_logs row is +// inserted even for a first-seen log row, while log_entries still receives the +// row. +func TestWriter_CharacterizationLogEntryFTSGateRespectsFlag(t *testing.T) { + t.Parallel() + const src = "claude_code:/tmp" + w, db, ctx := sow55SetupWriter(t, src, "claude_code", "/tmp") + w.fts5IndexLogs = false + + tx := sow55Begin(t, ctx, db) + sow55ApplyBatch(t, ctx, w, tx, sow55LogBatch(src, 1, 1000, "gated message")) + sow55CommitTx(t, tx) + + assertLogFTSGated(t, db, "gated message") +} + +// ----- writer extras / rollup / log assertions ----- + +func assertNilExtrasReEmitStash(t *testing.T, db *sql.DB, opID string) { + t.Helper() + gotTU := scanString(t, db, `SELECT IFNULL(json_extract(extras_json,'$.aiViewer.toolUseId'),'') FROM ops WHERE id=?`, opID) + if gotTU != "toolu_abc" { + t.Errorf("toolUseId stash = %q, want %q (must survive nil-extras re-emit)", gotTU, "toolu_abc") + } + gotCN := scanString(t, db, `SELECT IFNULL(json_extract(extras_json,'$.aiViewer.childNativeId'),'') FROM ops WHERE id=?`, opID) + if gotCN != "parent:agent:xyz" { + t.Errorf("childNativeId stash = %q, want %q (must survive nil-extras re-emit)", gotCN, "parent:agent:xyz") + } + gotAttr := scanString(t, db, `SELECT IFNULL(json_extract(extras_json,'$.attr.x'),'') FROM ops WHERE id=?`, opID) + if gotAttr != "" { + t.Errorf("attr.x present after nil-extras re-emit = %q, want empty (graft must drop non-aiViewer keys)", gotAttr) + } +} + +func assertReEmitRollupAndCatalog(t *testing.T, db *sql.DB, w *writer, wantOpID string) { + t.Helper() + wantBucket := rollups.BucketTS(3_600_000_005, rollups.Hourly) + if _, ok := w.dirtyRollupBuckets[wantBucket]; !ok { + t.Fatalf("dirtyRollupBuckets missing bucket %d after applyOpStarted (post-upsert marking lost)", wantBucket) + } + wantDay := rollups.BucketTS(3_600_000_005, rollups.Daily) + if _, ok := w.dirtyRollupDays[wantDay]; !ok { + t.Fatalf("dirtyRollupDays missing day %d after applyOpStarted (daily marking lost)", wantDay) + } + if _, ok := w.dirtyOpIDs[wantOpID]; !ok { + t.Fatalf("dirtyOpIDs missing %s after applyOpStarted (FTS marking lost)", wantOpID) + } + if got := scanInt(t, db, `SELECT call_count FROM catalog_models WHERE provider='openai' AND name='gpt-5.5'`); got != 1 { + t.Errorf("catalog_models call_count = %d, want 1 (one physical op, re-emit is UPDATE)", got) + } + if got := scanInt(t, db, `SELECT call_count FROM catalog_providers WHERE name='openai'`); got != 1 { + t.Errorf("catalog_providers call_count = %d, want 1 (one physical op, re-emit is UPDATE)", got) + } +} + +func assertLogDupDedup(t *testing.T, db *sql.DB) { + t.Helper() + const msg = "replayed line" + if got := scanInt(t, db, `SELECT COUNT(*) FROM log_entries WHERE message=?`, msg); got != 1 { + t.Errorf("log_entries rows = %d, want 1 (replay must dedup via idx_log_entries_identity)", got) + } + if got := scanInt(t, db, `SELECT COUNT(*) FROM fts_logs WHERE message=?`, msg); got != 1 { + t.Errorf("fts_logs rows = %d, want 1 (replay must not duplicate the FTS row — ErrNoRows gate must hold)", got) + } +} + +func assertLogFTSGated(t *testing.T, db *sql.DB, message string) { + t.Helper() + if got := scanInt(t, db, `SELECT COUNT(*) FROM log_entries WHERE message=?`, message); got != 1 { + t.Errorf("log_entries rows = %d, want 1 (log_entries always written)", got) + } + if got := scanInt(t, db, `SELECT COUNT(*) FROM fts_logs WHERE message=?`, message); got != 0 { + t.Errorf("fts_logs rows = %d, want 0 (fts5IndexLogs=false must skip indexing)", got) + } +} diff --git a/internal/ingest/writer.go b/internal/ingest/writer.go index 6610ae3..79749a3 100644 --- a/internal/ingest/writer.go +++ b/internal/ingest/writer.go @@ -416,38 +416,86 @@ func (w *writer) drainObservabilityErrs() []error { return out } -// apply dispatches one event to its kind-specific writer. +// apply dispatches one event to its kind-specific writer. The per-kind +// switch is the body of dispatchEvent; apply itself is only responsible for +// updating the batch's max-seq observability counter and delegating to the +// dispatcher. Unknown kinds (events with no registered handler) return an +// explicit error so a future canonical event added without a writer wiring +// fails closed instead of silently dropping. func (w *writer) apply(ctx context.Context, tx *sql.Tx, ev canonical.Event) error { - seq := ev.EventSourceSeq() - if seq > w.batchMaxSeq { + if seq := ev.EventSourceSeq(); seq > w.batchMaxSeq { w.batchMaxSeq = seq } + return w.dispatchEvent(ctx, tx, ev) +} + +// dispatchEvent forwards ev to its concrete apply* writer method. The 11 +// canonical event kinds are split across three sub-dispatchers (session, +// turn-or-op, and ancillary) so no single function carries the full CCN of +// the per-kind switch while the behaviour-preserving contract — same +// handlers, same call shape, same unknown-kind error — is maintained. +func (w *writer) dispatchEvent(ctx context.Context, tx *sql.Tx, ev canonical.Event) error { + if handled, err := w.dispatchSessionEvent(ctx, tx, ev); handled { + return err + } + if handled, err := w.dispatchTurnOrOpEvent(ctx, tx, ev); handled { + return err + } + if handled, err := w.dispatchAncillaryEvent(ctx, tx, ev); handled { + return err + } + return fmt.Errorf("writer: unknown event kind %s", ev.EventKind()) +} + +// dispatchSessionEvent matches the three SessionXxx event kinds and forwards +// to the corresponding apply method. handled=false means ev is not a session +// event — the caller falls through to the next sub-dispatcher. +func (w *writer) dispatchSessionEvent(ctx context.Context, tx *sql.Tx, ev canonical.Event) (handled bool, err error) { switch e := ev.(type) { case canonical.SessionStartedEvent: - return w.applySessionStarted(ctx, tx, e) + return true, w.applySessionStarted(ctx, tx, e) case canonical.SessionUpdatedEvent: - return w.applySessionUpdated(ctx, tx, e) + return true, w.applySessionUpdated(ctx, tx, e) case canonical.SessionFinalizedEvent: - return w.applySessionFinalized(ctx, tx, e) + return true, w.applySessionFinalized(ctx, tx, e) + } + return false, nil +} + +// dispatchTurnOrOpEvent matches the four Turn/Op start/finalize event kinds +// — the turns and ops table writers — and forwards to the corresponding +// apply method. handled=false means ev is not one of these. +func (w *writer) dispatchTurnOrOpEvent(ctx context.Context, tx *sql.Tx, ev canonical.Event) (handled bool, err error) { + switch e := ev.(type) { case canonical.TurnStartedEvent: - return w.applyTurnStarted(ctx, tx, e) + return true, w.applyTurnStarted(ctx, tx, e) case canonical.TurnFinalizedEvent: - return w.applyTurnFinalized(ctx, tx, e) + return true, w.applyTurnFinalized(ctx, tx, e) case canonical.OpStartedEvent: - return w.applyOpStarted(ctx, tx, e) + return true, w.applyOpStarted(ctx, tx, e) case canonical.OpFinalizedEvent: - return w.applyOpFinalized(ctx, tx, e) + return true, w.applyOpFinalized(ctx, tx, e) + } + return false, nil +} + +// dispatchAncillaryEvent matches the four non-session/non-turn/non-op event +// kinds: payload refs, log entries, source progress, and source errors. +// handled=false means ev matches none of these — dispatchEvent then surfaces +// the unknown-kind error. +func (w *writer) dispatchAncillaryEvent(ctx context.Context, tx *sql.Tx, ev canonical.Event) (handled bool, err error) { + switch e := ev.(type) { case canonical.PayloadRefEvent: - return w.applyPayloadRef(ctx, tx, e) + return true, w.applyPayloadRef(ctx, tx, e) case canonical.LogEntryEvent: - return w.applyLogEntry(ctx, tx, e) + return true, w.applyLogEntry(ctx, tx, e) case canonical.SourceProgressEvent: - return w.applySourceProgress(e) + w.applySourceProgress(e) + return true, nil case canonical.SourceErrorEvent: - return w.applySourceError(ctx, tx, e) - default: - return fmt.Errorf("writer: unknown event kind %s", ev.EventKind()) + return true, w.applySourceError(ctx, tx, e) } + return false, nil } // applySessionStarted inserts or updates a sessions row. Parent linkage @@ -636,50 +684,88 @@ WHERE id = ? // applySessionUpdated when agent_name or cwd may have changed. Single-connection // discipline: each cursor is fully drained into a slice BEFORE any marking, and // marking is in-memory only (no SQL write follows the cursor on the batch tx). +// +// Implementation is two phases that mirror the discipline above exactly: +// (1) drainSessionRollupBuckets reads the session-start ts + every op-start +// ts into slices, fully closing each cursor before returning; (2) the +// receiver then walks those slices and applies the idempotent in-memory +// markDirtyRollupBucket marks. The split keeps the cursor-close error paths +// in one place and the marking loop trivially small. func (w *writer) markSessionRollupBucketsDirty(ctx context.Context, tx *sql.Tx, sessionID string) error { - // Session start hour: session_starts is attributed to total/agent/cwd of the - // session's start bucket, so a changed agent/cwd must re-attribute it. + startTs, opStarts, err := drainSessionRollupBuckets(ctx, tx, sessionID) + if err != nil { + return err + } + if startTs.Valid { + w.markDirtyRollupBucket(startTs.Int64) + } + for _, opStart := range opStarts { + w.markDirtyRollupBucket(opStart) + } + return nil +} + +// drainSessionRollupBuckets reads the session's start_ts and every op_start_ts +// for sessionID into slices, fully draining and closing each cursor before +// returning. Single-connection discipline (store.OpenWriter pins SetMaxOpenConns=1): +// no SQL write may straddle an open read on the pinned connection, so the +// drain happens here BEFORE the caller marks. A NULL start_ts (no sessions +// row yet) is returned as an invalid NullInt64 — the caller skips that mark. +func drainSessionRollupBuckets(ctx context.Context, tx *sql.Tx, sessionID string) (sql.NullInt64, []int64, error) { + startTs, err := loadSessionStartTs(ctx, tx, sessionID) + if err != nil { + return sql.NullInt64{}, nil, err + } + opStarts, err := loadSessionOpStartTs(ctx, tx, sessionID) + if err != nil { + return sql.NullInt64{}, nil, err + } + return startTs, opStarts, nil +} + +// loadSessionStartTs reads the session's start_ts into a NullInt64. A missing +// session row is NOT an error here — it can happen during out-of-order ingest +// when a session-update arrives before the matching SessionStarted. The +// caller treats a NULL start_ts as "nothing to mark on the start bucket". +func loadSessionStartTs(ctx context.Context, tx *sql.Tx, sessionID string) (sql.NullInt64, error) { var startTs sql.NullInt64 if err := tx.QueryRowContext(ctx, `SELECT start_ts FROM sessions WHERE id = ?`, sessionID).Scan(&startTs); err != nil { - if !errors.Is(err, sql.ErrNoRows) { - return fmt.Errorf("writer: read session start_ts %s: %w", sessionID, err) + if errors.Is(err, sql.ErrNoRows) { + return sql.NullInt64{}, nil } + return sql.NullInt64{}, fmt.Errorf("writer: read session start_ts %s: %w", sessionID, err) } + return startTs, nil +} - // Distinct op-start hour buckets for this session. Drain the whole cursor - // into a slice first (no write may straddle an open read on the single - // pinned connection), THEN mark — bucket math via the same rollups helper - // markDirtyRollupBucket uses, so the dirty set matches the op-write paths. +// loadSessionOpStartTs reads every op's start_ts for sessionID, fully draining +// the cursor before returning. Errors from Scan / Err / Close are all wrapped +// with the session id so they surface contextually in the logs. The cursor +// must NOT outlive this call — single-connection discipline requires every +// read to finish before any write follows on the batch tx. +func loadSessionOpStartTs(ctx context.Context, tx *sql.Tx, sessionID string) ([]int64, error) { rows, err := tx.QueryContext(ctx, `SELECT start_ts FROM ops WHERE session_id = ?`, sessionID) if err != nil { - return fmt.Errorf("writer: query session op buckets %s: %w", sessionID, err) + return nil, fmt.Errorf("writer: query session op buckets %s: %w", sessionID, err) } var opStarts []int64 for rows.Next() { var opStart int64 if scanErr := rows.Scan(&opStart); scanErr != nil { _ = rows.Close() - return fmt.Errorf("writer: scan session op start %s: %w", sessionID, scanErr) + return nil, fmt.Errorf("writer: scan session op start %s: %w", sessionID, scanErr) } opStarts = append(opStarts, opStart) } if iterErr := rows.Err(); iterErr != nil { _ = rows.Close() - return fmt.Errorf("writer: iterate session op buckets %s: %w", sessionID, iterErr) + return nil, fmt.Errorf("writer: iterate session op buckets %s: %w", sessionID, iterErr) } if closeErr := rows.Close(); closeErr != nil { - return fmt.Errorf("writer: close session op buckets %s: %w", sessionID, closeErr) - } - - // All reads drained — now mark (idempotent in-memory map inserts). - if startTs.Valid { - w.markDirtyRollupBucket(startTs.Int64) + return nil, fmt.Errorf("writer: close session op buckets %s: %w", sessionID, closeErr) } - for _, opStart := range opStarts { - w.markDirtyRollupBucket(opStart) - } - return nil + return opStarts, nil } func (w *writer) applySessionFinalized(ctx context.Context, tx *sql.Tx, ev canonical.SessionFinalizedEvent) error { @@ -752,70 +838,149 @@ ON CONFLICT (session_id, seq) DO UPDATE SET } func (w *writer) applyOpStarted(ctx context.Context, tx *sql.Tx, ev canonical.OpStartedEvent) error { - sessionID, err := w.requireSessionID(ctx, tx, ev.SessionNativeID, ev.Ts) + prep, err := w.prepareOpStarted(ctx, tx, ev) if err != nil { return err } + if err := w.upsertOpStarted(ctx, tx, ev, prep); err != nil { + return err + } + return w.afterOpStarted(ctx, tx, ev, prep) +} + +// opStartedPrep carries the resolved-row state applyOpStarted needs to feed the +// SQL upsert and the post-upsert hooks. It captures the canonical session/turn/ +// op ids, the optional parent op id, the (possibly-stashed) child session id, +// the resolver-grafted extras blob, and the op's prior persisted catalog +// identity — read BEFORE the upsert so the catalog migrate logic sees the OLD +// row (SOW-0004 H1a / I1). opInserted mirrors the original `!prior.found` +// signal so onOpStarted knows whether this is a genuine new insert or a +// same/changed-identity re-emit. +type opStartedPrep struct { + sessionID string + turnID string + opID string + parentOpID sql.NullString + childSessionID sql.NullString + extras any + prior priorOpIdentity + opInserted bool +} + +// prepareOpStarted does every read the upsert needs: resolves the session/turn +// ids, synthesizes the parent turn if no TurnStarted arrived first, captures +// the op's prior persisted catalog identity (so the catalog upsert can migrate +// contributions on an identity-changed re-emit — SOW-0004 I1), and computes +// the parent-op id + child-session link + marshaled extras blob. No mutating +// SQL beyond the turn ON CONFLICT DO NOTHING synth — the ops UPSERT runs in +// upsertOpStarted. +func (w *writer) prepareOpStarted(ctx context.Context, tx *sql.Tx, ev canonical.OpStartedEvent) (opStartedPrep, error) { + sessionID, turnID, opID, prior, err := w.opStartedIdentity(ctx, tx, ev) + if err != nil { + return opStartedPrep{}, err + } + parentOpID := resolveOpStartedParent(turnID, ev.ParentOpSeq) + childSessionID, opExtras, err := w.resolveOpStartedChild(ctx, tx, ev) + if err != nil { + return opStartedPrep{}, err + } + extras, err := marshalExtras(opExtras) + if err != nil { + return opStartedPrep{}, fmt.Errorf("writer: marshal op extras: %w", err) + } + return opStartedPrep{ + sessionID: sessionID, + turnID: turnID, + opID: opID, + parentOpID: parentOpID, + childSessionID: childSessionID, + extras: extras, + prior: prior, + opInserted: !prior.found, + }, nil +} + +// opStartedIdentity resolves the canonical session/turn/op ids for this +// OpStarted (creating stub session/turn rows if no SessionStarted/TurnStarted +// arrived first) and reads the op's PRIOR persisted catalog identity. The +// prior identity is captured BEFORE the ops UPSERT so the catalog upsert can +// (a) count the call ONCE per distinct op (SOW-0004 H1a) and (b) MIGRATE the +// op's contribution off its old key when this re-emit CHANGES the catalog +// identity (codex MCP enrichment re-stamping tool_namespace/name on the same +// (turn,seq) — SOW-0004 I1). ON CONFLICT DO UPDATE returns RowsAffected=1 for +// both insert and update under modernc/sqlite, so reading the row first is +// the authoritative insert-vs-update signal: prior.found=false (sql.ErrNoRows) +// ⇒ genuine new insert. The OpStarted upsert touches only identity columns + +// start_ts/extras — never the status/tokens/cost/duration columns — so the +// totals read here are exactly the contribution onOpFinalized already booked +// under the old identity. +func (w *writer) opStartedIdentity(ctx context.Context, tx *sql.Tx, ev canonical.OpStartedEvent) (string, string, string, priorOpIdentity, error) { + sessionID, err := w.requireSessionID(ctx, tx, ev.SessionNativeID, ev.Ts) + if err != nil { + return "", "", "", priorOpIdentity{}, err + } turnID := canonicalTurnID(sessionID, ev.TurnSeq) // Ensure parent turn row exists — synthesize a running turn if no - // TurnStarted arrived first. This matches the spec: turns may be - // implicit when the source format doesn't emit explicit start - // records. + // TurnStarted arrived first. This matches the spec: turns may be implicit + // when the source format doesn't emit explicit start records. if _, err := tx.ExecContext(ctx, ` INSERT INTO turns (id, session_id, seq, start_ts, status) VALUES (?, ?, ?, ?, ?) ON CONFLICT (session_id, seq) DO NOTHING `, turnID, sessionID, ev.TurnSeq, ev.Ts, string(canonical.StatusRunning)); err != nil { - return fmt.Errorf("writer: synthesize turn for op: %w", err) + return "", "", "", priorOpIdentity{}, fmt.Errorf("writer: synthesize turn for op: %w", err) } opID := canonicalOpID(turnID, ev.Seq) - // Read the op's PRIOR persisted catalog identity + terminal totals BEFORE the - // upsert so the catalog can (a) count the call ONCE per distinct op (SOW-0004 - // H1a) and (b) MIGRATE the op's contribution off its old key when this re-emit - // CHANGES the catalog identity (codex MCP enrichment re-stamping - // tool_namespace/name on the same (turn,seq) — SOW-0004 I1). A re-emitted - // OpStarted (late enrichment on the same (turn,seq), as the codex/claude_code - // replay-from-0 + enrichment design emits) is an UPDATE, not a new call. ON - // CONFLICT DO UPDATE returns RowsAffected=1 for both insert and update under - // modernc/sqlite, so reading the row first is the authoritative - // insert-vs-update signal: found=false (sql.ErrNoRows) ⇒ genuine new insert. - // The OpStarted upsert below touches only identity columns + start_ts/extras — - // never the status/tokens/cost/duration columns — so the totals read here are - // exactly the contribution onOpFinalized already booked under the old identity. prior, err := w.opPriorIdentity(ctx, tx, opID) if err != nil { - return err - } - opInserted := !prior.found - var parentOpID sql.NullString - if ev.ParentOpSeq >= 0 { - parentOpID = sql.NullString{String: canonicalOpID(turnID, ev.ParentOpSeq), Valid: true} - } - var childSessionID sql.NullString - opExtras := ev.Extras - if ev.ChildSessionNativeID != "" { - // Only point child_session_id at another session when that row - // is already in the store; otherwise the FK fires. When the child - // has not landed yet (the parent transcript is read before, or in a - // different batch than, the child sidechain), leave the link NULL and - // stash the child native id in ops.extras_json.aiViewer.childNativeId. - // The resolver re-links child_session_id from that stash once the - // child session lands (P1a) — mirroring the session - // extras_json.aiViewer.parentNativeId stash + resolver pass. - if cid, err := w.lookupSessionID(ctx, tx, ev.ChildSessionNativeID); err == nil { - childSessionID = sql.NullString{String: cid, Valid: true} - } else if errors.Is(err, sql.ErrNoRows) { - opExtras = mergeExtras(ev.Extras, map[string]any{ - "aiViewer": map[string]any{"childNativeId": ev.ChildSessionNativeID}, - }) - } else { - return err - } + return "", "", "", priorOpIdentity{}, err } - extras, err := marshalExtras(opExtras) - if err != nil { - return fmt.Errorf("writer: marshal op extras: %w", err) + return sessionID, turnID, opID, prior, nil +} + +// resolveOpStartedParent maps an OpStartedEvent.ParentOpSeq into a NullString +// op id, mirroring the original inline ParentOpSeq>=0 guard. A negative +// ParentOpSeq stays as SQL NULL (the op has no parent op). +func resolveOpStartedParent(turnID string, parentOpSeq int) sql.NullString { + if parentOpSeq < 0 { + return sql.NullString{} + } + return sql.NullString{String: canonicalOpID(turnID, parentOpSeq), Valid: true} +} + +// resolveOpStartedChild looks up the child session row (when the event names +// one) and either returns its id directly OR stashes the child native id +// into the op's extras_json.aiViewer.childNativeId so the resolver can re-link +// it once the child session lands (P1a). A SQL error other than ErrNoRows is +// propagated unchanged. When the event carries no ChildSessionNativeID, the +// child link stays NULL and the extras blob is the event's untouched extras. +func (w *writer) resolveOpStartedChild(ctx context.Context, tx *sql.Tx, ev canonical.OpStartedEvent) (sql.NullString, map[string]any, error) { + if ev.ChildSessionNativeID == "" { + return sql.NullString{}, ev.Extras, nil + } + cid, err := w.lookupSessionID(ctx, tx, ev.ChildSessionNativeID) + if err == nil { + return sql.NullString{String: cid, Valid: true}, ev.Extras, nil } + if !errors.Is(err, sql.ErrNoRows) { + return sql.NullString{}, nil, err + } + // Child has not landed yet (parent transcript is read before, or in a + // different batch than, the child sidechain) — leave the link NULL and + // stash the child native id in ops.extras_json.aiViewer.childNativeId. + // The resolver re-links child_session_id from that stash once the child + // session lands (P1a) — mirroring the session aiViewer.parentNativeId + // stash + resolver pass. + stashed := mergeExtras(ev.Extras, map[string]any{ + "aiViewer": map[string]any{"childNativeId": ev.ChildSessionNativeID}, + }) + return sql.NullString{}, stashed, nil +} + +// upsertOpStarted runs the ops UPSERT exactly as before — same column list, +// same ON CONFLICT clause, same graftAiViewerExtras wiring. Pulled out so +// applyOpStarted is orchestration-only and the dense SQL lives in one place. +func (w *writer) upsertOpStarted(ctx context.Context, tx *sql.Tx, ev canonical.OpStartedEvent, prep opStartedPrep) error { // extras_json is REPLACED wholesale on conflict EXCEPT the resolver's // `aiViewer` stash sub-object, which is grafted back when a stash-free re-emit // omits it (SOW-0003 P2.6d/P2.7c). A re-emit of the same op whose extras lack @@ -849,15 +1014,26 @@ ON CONFLICT (turn_id, seq) DO UPDATE SET child_session_id = COALESCE(ops.child_session_id, excluded.child_session_id), extras_json = `+graftAiViewerExtras("ops.extras_json")+` `, - opID, turnID, sessionID, parentOpID, ev.Seq, + prep.opID, prep.turnID, prep.sessionID, prep.parentOpID, ev.Seq, string(ev.Kind), ev.Name, nullIfEmpty(ev.ToolNamespace), nullIfEmpty(ev.Model), nullIfEmpty(ev.Provider), nullIfEmpty(ev.ProviderAlias), nullIfEmpty(ev.ReasoningKind), ev.Ts, string(canonical.StatusRunning), - childSessionID, extras, + prep.childSessionID, prep.extras, ); err != nil { return fmt.Errorf("writer: insert op: %w", err) } - w.markDirtyTurn(turnID) - w.markDirtySession(sessionID) + return nil +} + +// afterOpStarted applies the post-upsert side effects: dirty turn/session +// marking for the aggregate refresh, dirty rollup-bucket marking for the +// hourly/daily rollup recompute (ev.Ts IS the op start, the bucket the op +// rolls into), dirty fts_ops marking so refreshFTS rebuilds the search row, +// and the catalog upsert (which uses prep.prior to decide between count-once +// new-insert vs migrate-contribution identity change). Behaviour is byte-for- +// byte identical to the original inline tail of applyOpStarted. +func (w *writer) afterOpStarted(ctx context.Context, tx *sql.Tx, ev canonical.OpStartedEvent, prep opStartedPrep) error { + w.markDirtyTurn(prep.turnID) + w.markDirtySession(prep.sessionID) // The op's start hour gains an op, so that hour's rollup must be // recomputed. ev.Ts IS the op start (the OpStarted timestamp), which is // the bucket the op rolls up into. ASSUMPTION (catalog + SOW-0027): an op @@ -867,11 +1043,8 @@ ON CONFLICT (turn_id, seq) DO UPDATE SET // The op's indexed text (name/model/provider/tool_namespace) may have // changed; refreshFTS rebuilds its fts_ops row from the final persisted // columns at flush time (fts_ops is always indexed — no flag gate). - w.markDirtyOp(opID) - if err := w.catalog.onOpStarted(ctx, tx, ev, opInserted, prior); err != nil { - return err - } - return nil + w.markDirtyOp(prep.opID) + return w.catalog.onOpStarted(ctx, tx, ev, prep.opInserted, prep.prior) } // opPriorIdentity reads an op's persisted catalog identity (kind/name/namespace/ @@ -1288,10 +1461,40 @@ VALUES (NULL, ?, NULL, NULL, ?, 'ERR', ?, ?, ?) } func (w *writer) applyLogEntry(ctx context.Context, tx *sql.Tx, ev canonical.LogEntryEvent) error { - sessionID, err := w.requireSessionID(ctx, tx, ev.SessionNativeID, ev.Ts) + refs, err := w.resolveLogEntryRefs(ctx, tx, ev) if err != nil { return err } + logID, inserted, err := insertLogEntry(ctx, tx, refs, ev) + if err != nil { + return err + } + w.markDirtySession(refs.sessionID) + return w.indexLogEntryFTS(ctx, tx, refs, ev, logID, inserted) +} + +// logEntryRefs carries the canonical session/turn/op linkage + the marshaled +// extras + the normalized severity that applyLogEntry's insert and FTS hooks +// both consume. Computed once in resolveLogEntryRefs so the downstream steps +// stay parameter-light. +type logEntryRefs struct { + sessionID string + turnID sql.NullString + opID sql.NullString + extras any + severity string +} + +// resolveLogEntryRefs resolves the canonical session/turn/op linkage for the +// LogEntryEvent and marshals its extras + normalizes its severity (default +// "INF" when missing). turnID/opID are kept as sql.NullString so the insert +// binds SQL NULL when the event has no turn or op context — preserving the +// exact COALESCE behaviour idx_log_entries_identity relies on. +func (w *writer) resolveLogEntryRefs(ctx context.Context, tx *sql.Tx, ev canonical.LogEntryEvent) (logEntryRefs, error) { + sessionID, err := w.requireSessionID(ctx, tx, ev.SessionNativeID, ev.Ts) + if err != nil { + return logEntryRefs{}, err + } var turnID sql.NullString if ev.TurnSeq > 0 { turnID = sql.NullString{String: canonicalTurnID(sessionID, ev.TurnSeq), Valid: true} @@ -1302,41 +1505,57 @@ func (w *writer) applyLogEntry(ctx context.Context, tx *sql.Tx, ev canonical.Log } extras, err := marshalExtras(ev.Extras) if err != nil { - return fmt.Errorf("writer: marshal log extras: %w", err) + return logEntryRefs{}, fmt.Errorf("writer: marshal log extras: %w", err) } severity := strings.ToUpper(strings.TrimSpace(ev.Severity)) if severity == "" { severity = "INF" } - // RETURNING id yields the new log_entries.id on a genuine insert and - // sql.ErrNoRows when the ON CONFLICT DO NOTHING fires (a replayed, - // byte-identical log row). That signal drives fts_logs: index ONCE per - // first-seen log (append-only), skip on replay so no duplicate fts_logs row - // is created — matching BackfillFTS, which reads the already-deduped - // log_entries. logID stays 0 (invalid) on conflict. + return logEntryRefs{ + sessionID: sessionID, + turnID: turnID, + opID: opID, + extras: extras, + severity: severity, + }, nil +} + +// insertLogEntry runs the INSERT … ON CONFLICT DO NOTHING … RETURNING id. On +// a genuine first-seen insert it returns the new log_entries.id and inserted=true; +// on a replayed byte-identical row (idx_log_entries_identity collides) the ON +// CONFLICT DO NOTHING fires and RETURNING yields no row → sql.ErrNoRows, which +// this helper translates to (0, false, nil) so the caller can skip the FTS +// insert without surfacing an error. Any other SQL failure propagates. +func insertLogEntry(ctx context.Context, tx *sql.Tx, refs logEntryRefs, ev canonical.LogEntryEvent) (int64, bool, error) { var logID int64 insertErr := tx.QueryRowContext(ctx, ` INSERT INTO log_entries (session_id, source_id, turn_id, op_id, ts, severity, source, message, extras_json) VALUES (?, NULL, ?, ?, ?, ?, ?, ?, ?) `+logEntryOnConflict+` -RETURNING id`, sessionID, turnID, opID, ev.Ts, severity, ev.Source, ev.Message, extras).Scan(&logID) - insertedNewLog := true +RETURNING id`, refs.sessionID, refs.turnID, refs.opID, ev.Ts, refs.severity, ev.Source, ev.Message, refs.extras).Scan(&logID) if insertErr != nil { - if !errors.Is(insertErr, sql.ErrNoRows) { - return fmt.Errorf("writer: insert log_entry: %w", insertErr) + if errors.Is(insertErr, sql.ErrNoRows) { + return 0, false, nil // duplicate: the fts_logs row already exists. } - insertedNewLog = false // duplicate: the fts_logs row already exists. + return 0, false, fmt.Errorf("writer: insert log_entry: %w", insertErr) } - w.markDirtySession(sessionID) - // fts_logs is gated on the source's fts5_index_logs flag (fts_ops is not). - // Index only newly-inserted logs from THIS source; replays are no-ops. - if insertedNewLog && w.fts5IndexLogs { - if _, err := tx.ExecContext(ctx, ` + return logID, true, nil +} + +// indexLogEntryFTS inserts a matching fts_logs row when this batch wrote a +// NEW log_entries row AND the source has fts5_index_logs=true. fts_logs is +// append-only (no DELETE) — re-emitting the same log_entries row is a no-op +// here too, so an open scan + a Tail re-read of the same line never +// duplicates the FTS row. fts_ops has no flag gate; this hook is fts_logs-only. +func (w *writer) indexLogEntryFTS(ctx context.Context, tx *sql.Tx, refs logEntryRefs, ev canonical.LogEntryEvent, logID int64, insertedNewLog bool) error { + if !insertedNewLog || !w.fts5IndexLogs { + return nil + } + if _, err := tx.ExecContext(ctx, ` INSERT INTO fts_logs (message, log_id, session_id, op_id, severity, ts) VALUES (?, ?, ?, ?, ?, ?)`, - ev.Message, logID, sessionID, opID, severity, ev.Ts); err != nil { - return fmt.Errorf("writer: index log_entry in fts_logs: %w", err) - } + ev.Message, logID, refs.sessionID, refs.opID, refs.severity, ev.Ts); err != nil { + return fmt.Errorf("writer: index log_entry in fts_logs: %w", err) } return nil } @@ -1344,10 +1563,9 @@ VALUES (?, ?, ?, ?, ?, ?)`, // applySourceProgress records the cursor checkpoint and the wall-clock // timestamp the adapter emitted at. The actual write to source_progress // is deferred to the worker's flush so it happens once per batch. -func (w *writer) applySourceProgress(ev canonical.SourceProgressEvent) error { +func (w *writer) applySourceProgress(ev canonical.SourceProgressEvent) { w.lastCursor = ev.Cursor w.hasCursor = true - return nil } // applySourceError records a parse error against the source: increments From 07b536d75b6787695ea36364f633bcc33496dcaf Mon Sep 17 00:00:00 2001 From: user <2662304+ktsaou@users.noreply.github.com> Date: Wed, 10 Jun 2026 10:36:20 +0300 Subject: [PATCH 2/5] Adopt Production-Grade Loop: minimax implementer + 5-reviewer cycle Encode the operator's new operating model in the project contract: - AGENTS.md: new "Production-Grade Loop" section, Hard Rule #11, updated branch-protection + merge workflow. - project-workflow SKILL: roles table (CTO / implementer=minimax / 5 reviewers), updated cycle, Step 4/6/7/8 reflect the new protocol. - project-delegation SKILL: "The implementer is minimax" section, backup rotation order, new forbidden block wording. - project-second-opinions SKILL: the 5-reviewer set (glm, mimo, minimax, qwen, deepseek), PRODUCTION GRADE vote, P0/P1/P2/P3 stop conditions, claim-verification steps, deprecation note for codex/gemini on the production loop. This makes the protocol survive compactions: AGENTS.md is loaded at startup, and the three skills are the runtime enforcement. No production code changed. No SOW affected. Pure contract update. --- .agents/skills/project-delegation/SKILL.md | 15 ++- .../skills/project-second-opinions/SKILL.md | 119 ++++++++++++------ .agents/skills/project-workflow/SKILL.md | 50 +++++--- AGENTS.md | 116 +++++++++++++++-- 4 files changed, 234 insertions(+), 66 deletions(-) diff --git a/.agents/skills/project-delegation/SKILL.md b/.agents/skills/project-delegation/SKILL.md index 48e22c8..e12c6ea 100644 --- a/.agents/skills/project-delegation/SKILL.md +++ b/.agents/skills/project-delegation/SKILL.md @@ -7,14 +7,23 @@ description: Orchestration patterns for ai-viewer — when and how to spawn suba ## The Hard Rule -The master assistant is the **orchestrator**, **QA lead**, **integrator**, and **reviewer**. The master assistant does **not** write production code. Code is produced by spawned subagents working from a written spec and failing tests. +The master assistant (CTO) is the **orchestrator**, **QA lead**, **integrator**, and the **only role that runs reviewers**. The master assistant does **not** write production code. Code is produced by spawned subagents working from a written spec and failing tests. -This rule exists because: +### The implementer is `minimax` + +Per the Production-Grade Loop (see `AGENTS.md`), the CTO delegates all code production to **`minimax`** — the current stable minimax variant on litellm (default `nova/minimax-m2.5`). The CTO is the only role that knows the project context; the implementer is a fresh-context subagent that receives a self-contained prompt (spec excerpt, failing tests, constraints, deliverable). + +- The implementer is **not** the same instance as the `minimax` review pass. Two different invocations, two different contexts, two different jobs. +- If `minimax` is down/degraded, the CTO rotates the implementer role to the next-most-capable member of the reviewer set (default order: `qwen` → `mimo` → `deepseek` → `glm`) and logs the rotation in the SOW under `## Implementer Rotation`. +- The CTO pins to the current stable model at time of work, per the project's "always pin to latest stable" policy. Major-version upgrades require a brief SOW; minor/patch upgrades are autonomous. + +### Why this rule exists - The master assistant's context is finite. Code-writing fills it with raw output that displaces decision history. - Subagent output gets independently verified by the master before being trusted. Master-written code skips that verification step. - Compaction destroys the master's working memory; subagents start with a fresh, self-contained context every time. - Parallel subagents finish faster than serial master-context editing. +- Splitting "writer" and "reviewer" across different model families produces a more honest, less self-confirming codebase. If the master assistant ever finds itself about to call `Edit` or `Write` on a production source file, stop and delegate. @@ -186,7 +195,7 @@ Conversely, do not spawn a subagent for a one-line typo fix. Trivial verified ed ## Cross-References -- Contract: `AGENTS.md` (Delegation Protocol section) +- Contract: `AGENTS.md` "Production-Grade Loop" section (the single source of truth). - Workflow: `.agents/skills/project-workflow/SKILL.md` - Coding rules: `.agents/skills/project-coding/SKILL.md` - Gates: `.agents/skills/project-quality-gates/SKILL.md` diff --git a/.agents/skills/project-second-opinions/SKILL.md b/.agents/skills/project-second-opinions/SKILL.md index ac15f14..20aedc1 100644 --- a/.agents/skills/project-second-opinions/SKILL.md +++ b/.agents/skills/project-second-opinions/SKILL.md @@ -1,15 +1,17 @@ --- name: project-second-opinions -description: Invoke external LLMs (codex, gemini, glm, kimi, mimo, minimax, qwen, deepseek) for code review, SOW review, design validation, and second opinions on ai-viewer work. Use before marking any non-trivial SOW completed and after major architectural changes. +description: Invoke the 5-reviewer Production-Grade Loop (glm, mimo, minimax, qwen, deepseek) for code review, SOW review, design validation, and second opinions on ai-viewer work. The CTO runs reviewers; the implementer never does. Use before marking any non-trivial SOW completed and after major architectural changes. --- -# Second Opinions +# Second Opinions — the 5-Reviewer Production-Grade Loop + +This skill is the runtime enforcement of the **Production-Grade Loop** defined in `AGENTS.md`. The contract lives in `AGENTS.md`; this file is the implementation. If the two ever disagree, `AGENTS.md` wins. ## When To Run External second-opinion review is **mandatory** — not "encouraged" — for any non-trivial work. The assistant does not trust itself; review converges before "done" is uttered. -**The orchestrator (master) runs review — exactly once per iteration, on the final integrated state — never the implementation subagent.** Review is the master's QA gate on code it did not author; an implementation subagent running reviewers on its own work both duplicates the master's mandatory round (the master will run it again → 2× the slow, costly review of identical code) and collapses the author/reviewer separation. Because spawned subagents inherit `AGENTS.md` (which mandates this review), the master MUST explicitly forbid reviewers in every implementation delegation prompt — see `project-delegation` skill, the `[FORBIDDEN]` block. If a subagent reports it "ran reviewers," that round does not substitute for the master's: treat the subagent's findings as a useful head start, then run the one official round on the final state and do not re-run beyond convergence. +**The orchestrator (CTO / master assistant) runs review — exactly once per iteration, on the final integrated state — never the implementation subagent.** Review is the master's QA gate on code it did not author; an implementation subagent running reviewers on its own work both duplicates the master's mandatory round (the master will run it again → 2× the slow, costly review of identical code) and collapses the author/reviewer separation. Because spawned subagents inherit `AGENTS.md` (which mandates this review), the master MUST explicitly forbid reviewers in every implementation delegation prompt — see `project-delegation` skill, the `[FORBIDDEN]` block. If a subagent reports it "ran reviewers," that round does not substitute for the master's: treat the subagent's findings as a useful head start, then run the one official round on the final state and do not re-run beyond convergence. Mandatory before marking any of these SOWs `completed`: @@ -21,20 +23,56 @@ Mandatory before marking any of these SOWs `completed`: - Any SOW spanning > 3 files of non-trivial logic. - Any SOW the operator flags as important. -Mandatory minimum standard: +### The 5-reviewer set (CTO only) + +The CTO runs **exactly these five reviewers in parallel** on every non-trivial code-producing PR: + +| # | Reviewer | Invocation | +|---|---|---| +| 1 | `glm` | `timeout 1800 opencode run -m "llm-netdata-cloud/glm-5.1" --agent code-reviewer "PROMPT"` | +| 2 | `mimo` | `timeout 1800 opencode run -m "llm-netdata-cloud/mimo-v2.5-pro" --agent code-reviewer "PROMPT"` | +| 3 | `minimax` (fresh-context review pass; **never** the implementer instance) | `timeout 1800 opencode run -m "llm-netdata-cloud/minimax-m2.7-coder" --agent code-reviewer "PROMPT"` | +| 4 | `qwen` | `timeout 1800 opencode run -m "llm-netdata-cloud/qwen3.6-plus" --agent code-reviewer "PROMPT"` | +| 5 | `deepseek` | `timeout 1800 opencode run -m "deepseek/deepseek-v4-pro" --agent code-reviewer "PROMPT"` | + +All five run in parallel (one Bash invocation each, batched in a single assistant turn). Foreground, with `timeout 1800`. The CTO is the only role that runs them. + +`codex` and `gemini` from the previous default set are **deprecated** for production-grade review on this project; they are kept in the invocation table for ad-hoc SOW/spec review only. + +### PRODUCTION GRADE vote + +Each reviewer responds with one of two outcomes: + +- `PRODUCTION GRADE` — ship it, no actionable findings. +- `NEEDS WORK` — one or more findings, each with file:line, severity (P0–P3), and a concrete fix proposal. + +The CTO does not merge until 5/5 PRODUCTION GRADE, **or** until only P3 noise remains AND the CTO has recorded the P3 findings in the SOW under `## Reviews` with a disposition. P0/P1 always block. P2 always blocks unless explicitly waived by the CTO with a documented reason (rare). -- **At least three reviewers in parallel** for code review (default set: codex + gemini + glm + qwen). -- **Same prompt across iterations**; never narrow scope on follow-up rounds. -- **Iterate until reviewers converge** with no new actionable findings. -- **Record every round in the SOW** under `## Reviews` with reviewer attribution and resolution. +### Stop conditions (P0/P1/P2/P3) -Skip only: +- **5/5 PRODUCTION GRADE, gates green, CI green** → CTO merges. +- **Any P0/P1 NEEDS WORK** → fix, push, re-trigger full 5-reviewer cycle. Iterate. +- **P2/P3 NEEDS WORK** → fix in this PR; merge when 5/5 PG or only P3 noise remains. +- **Hard stall: 5+ cycles with new P0/P1 each round** → CTO writes a `## Regression` section in the SOW, opens a follow-up SOW in `.agents/sow/pending/`, and surfaces to the operator with a recommendation. Do not loop forever. + +### Claim verification (CRITICAL — CTO's job) + +Reviewer findings are **claims, not findings**. The CTO verifies every claim before acting on it. Verification steps: + +1. **Read the file:line the reviewer cited.** Does the code actually do what the reviewer said? +2. **Run the repro.** If the reviewer says "this race fires under X", construct X and run it. +3. **Cross-check with the spec.** If the reviewer says "violates SPEC Y §3.2", open the spec and confirm. +4. **Decide**: real bug (fix), false positive (reject with evidence in the SOW `## Reviews`), disputed (escalate). + +Acting on unverified claims causes two failure modes: (a) implementing phantom bugs that don't exist, (b) ignoring real bugs because the reviewer "sounded uncertain". Verify, then act. The CTO is the only one who decides. + +Skip the 5-reviewer cycle only: - Typo / format-only changes the assistant has visually verified. - Mechanical renames with no behavior change. - Doc-only updates with no spec/runtime impact. -The bar to skip is high. When in doubt, run reviewers. +The bar to skip is high. When in doubt, run the cycle. ## Safety Rule (CRITICAL) @@ -48,33 +86,36 @@ Detection signals (any of these → do not run reviewers): When detected, complete the review work directly and return. -## Invocation Patterns +**This rule applies to the `minimax` review pass too**: when running as a reviewer, `minimax` is a fresh-context read-only session. If the prompt contains any of the above signals (it always does — the review prompt is "YOU ARE RUNNING BY ANOTHER ASSISTANT, FOR A SECOND OPINION"), the reviewer MUST NOT spawn further reviewers. The CTO is the only role that may spawn the 5-reviewer cycle. + +## Invocation Patterns (legacy / ad-hoc) -All commands use `timeout 1800` (30 minutes max wait). Run multiple reviewers in parallel (one Bash invocation per reviewer, batched in one assistant turn). Run in foreground (no `&`, no `run_in_background`). +The Production-Grade Loop supersedes the previous default set. The 5 reviewers above are mandatory for any non-trivial code-producing PR. The reviewers below remain available for **ad-hoc SOW/spec review** (one-off, off the production loop), where the CTO may pick a smaller subset: | Reviewer | Command | |---|---| -| codex | `timeout 1800 codex exec "PROMPT" --skip-git-repo-check` | -| gemini | `timeout 1800 gemini -p "PROMPT"` | -| claude (Anthropic) | `CLAUDECODE="" timeout 1800 claude -p "PROMPT"` | +| codex (ad-hoc only) | `timeout 1800 codex exec "PROMPT" --skip-git-repo-check` | +| gemini (ad-hoc only) | `timeout 1800 gemini -p "PROMPT"` | +| claude (Anthropic) (ad-hoc only) | `CLAUDECODE="" timeout 1800 claude -p "PROMPT"` | | glm | `timeout 1800 opencode run -m "llm-netdata-cloud/glm-5.1" --agent code-reviewer "PROMPT"` | | kimi | `timeout 1800 opencode run -m "llm-netdata-cloud/kimi-k2.6" --agent code-reviewer "PROMPT"` | | mimo | `timeout 1800 opencode run -m "llm-netdata-cloud/mimo-v2.5-pro" --agent code-reviewer "PROMPT"` | | qwen | `timeout 1800 opencode run -m "llm-netdata-cloud/qwen3.6-plus" --agent code-reviewer "PROMPT"` | -| minimax | `timeout 1800 opencode run -m "llm-netdata-cloud/minimax-m2.7-coder" --agent code-reviewer "PROMPT"` | +| minimax (review; never the implementer instance) | `timeout 1800 opencode run -m "llm-netdata-cloud/minimax-m2.7-coder" --agent code-reviewer "PROMPT"` | | deepseek | `timeout 1800 opencode run -m "deepseek/deepseek-v4-pro" --agent code-reviewer "PROMPT"` | `cd` into the project root before running. Use relative paths (some reviewers stumble on arbitrary absolute paths). -## Default Reviewer Set +## Ad-hoc Reviewer Set (off the production loop) -- **Code review** (PR-style): codex + gemini + glm + qwen in parallel. Four reviewers triangulate well; more becomes noise. - **SOW / design review**: codex + gemini + mimo in parallel. -- **Security-focused review**: add minimax + deepseek. +- **Security-focused review**: minimax + deepseek. + +The Production-Grade Loop's five reviewers (glm, mimo, minimax, qwen, deepseek) are reserved for the production PR cycle and are not used for ad-hoc SOW/spec pre-review. ## Prompt Templates -### Code review +### Code review (Production-Grade Loop) ``` YOU ARE RUNNING BY ANOTHER ASSISTANT, FOR A SECOND OPINION: @@ -83,6 +124,10 @@ Please review the following change in this repository: +Vote ONE of: +- PRODUCTION GRADE — ship it, no actionable findings. +- NEEDS WORK — list findings below, each with file:line, severity (P0/P1/P2/P3), and a concrete fix proposal. + Look for: - Correctness bugs - Race conditions @@ -101,7 +146,7 @@ MANDATORY RULES (FOLLOW ALWAYS): THIS IS A READ-ONLY REQUEST. PROVIDE YOUR REVIEW. ``` -### SOW review +### SOW review (ad-hoc) ``` YOU ARE RUNNING BY ANOTHER ASSISTANT, FOR A SECOND OPINION: @@ -124,7 +169,7 @@ MANDATORY RULES: THIS IS A READ-ONLY REQUEST. ``` -### Spec/design review +### Spec/design review (ad-hoc) ``` YOU ARE RUNNING BY ANOTHER ASSISTANT, FOR A SECOND OPINION: @@ -146,32 +191,34 @@ THIS IS A READ-ONLY REQUEST. ## Workflow -1. Decide the review type (code / SOW / spec). -2. Pick the reviewer set. +1. Decide the review type (production PR cycle / ad-hoc SOW / ad-hoc spec). +2. Pick the reviewer set: 5 reviewers for production PR; smaller subset for ad-hoc. 3. Compose the prompt (use a template; keep neutral, no embedded conclusions). -4. **Show the prompt to the user before running.** -5. Run all reviewers in parallel (multiple Bash tool calls in one turn). -6. Wait for all to return. -7. Synthesize findings: +4. Run all reviewers in parallel (multiple Bash tool calls in one turn). +5. Wait for all to return. +6. Synthesize findings: - List each unique finding once, attributed to which reviewer flagged it. - - Classify: bug, design concern, style, false positive. - - Decide which to act on. -8. Address findings (code changes + new tests where applicable). -9. **Re-run the same reviewers with the same scope** plus a short note of fixes applied. -10. Repeat until no actionable findings remain. -11. Record the review history in the SOW under `## Reviews`. + - Classify: P0/P1/P2/P3. + - **Verify every claim** (read the file:line, run the repro, cross-check spec). +7. Address findings (code changes + new tests where applicable). Reject false positives with evidence. +8. **Re-run the same reviewers with the same scope** plus a short note of fixes applied. +9. Repeat until no actionable P0/P1 findings remain. P2 fixed in PR. P3 documented in SOW. +10. Record the review history in the SOW under `## Reviews` with reviewer attribution, the CTO's claim-verification verdict, the fix applied (or "rejected — false positive" with evidence), and the final PRODUCTION GRADE count (e.g. `5/5 PG after fix`). +11. CTO merges via `gh pr merge --merge --delete-branch` once gates green, CI green, and 5/5 PG (or only P3 noise remains). ## Anti-Patterns - **Narrowing scope on follow-up reviews.** Leaves the rest unreviewed. Always use the same prompt. -- **One reviewer only for important work.** Single-reviewer blind spots are real. Minimum three for code-producing SOWs. +- **Fewer than 5 reviewers on a production PR.** Single-reviewer or 3-reviewer blind spots are real. The 5-reviewer cycle is mandatory. - **Editing the prompt to be less neutral after a reviewer disagreed.** The disagreement is data, not something to argue with. - **Running reviewers in background and forgetting.** Use foreground. The harness handles parallelism. - **Pre-screening: "skip review because I'm confident".** That's exactly when you need the review. -- **Reporting work "done" before review convergence.** The honest phrasing while review is pending is "code written, gates green, review pending". +- **Acting on unverified claims.** Always run the verification steps before implementing a fix. +- **Reporting work "done" before review convergence.** The honest phrasing while review is pending is "code written, gates green, review pending (X/5 PG)". ## Cross-References +- Contract: `AGENTS.md` "Production-Grade Loop" section (the single source of truth). - Workflow: `.agents/skills/project-workflow/SKILL.md` - Coding: `.agents/skills/project-coding/SKILL.md` - Delegation: `.agents/skills/project-delegation/SKILL.md` diff --git a/.agents/skills/project-workflow/SKILL.md b/.agents/skills/project-workflow/SKILL.md index 162eb03..7b7fb2f 100644 --- a/.agents/skills/project-workflow/SKILL.md +++ b/.agents/skills/project-workflow/SKILL.md @@ -1,13 +1,23 @@ --- name: project-workflow -description: Master orchestration cycle for ai-viewer work — the spec→test→code→review→gates→commit sequence enforced on every non-trivial task. Use at the start of any SOW work, before any Edit/Write in the project, after every milestone, and whenever the assistant catches itself about to skip a step. The single source of truth for "how we work here". +description: Master orchestration cycle for ai-viewer work — the spec→test→code→review→gates→commit sequence enforced on every non-trivial task, operating under the Production-Grade Loop (implementer=minimax, 5 reviewers=glm/mimo/minimax/qwen/deepseek). Use at the start of any SOW work, before any Edit/Write in the project, after every milestone, and whenever the assistant catches itself about to skip a step. The single source of truth for "how we work here". --- # Workflow ## Purpose -This skill is the assistant's runtime checklist. The contract lives in `AGENTS.md`; the durable spec lives in `.agents/sow/specs/workflow.md`; this file is the operational pattern the assistant follows every time. Load it at the start of every meaningful task. If the assistant ever notices it has started writing code without consulting this skill, stop and restart from the top. +This skill is the assistant's runtime checklist. The contract lives in `AGENTS.md` (the "Production-Grade Loop" section is the operating model); the durable spec lives in `.agents/sow/specs/workflow.md`; this file is the operational pattern the assistant follows every time. Load it at the start of every meaningful task. If the assistant ever notices it has started writing code without consulting this skill, stop and restart from the top. + +## Roles (per the Production-Grade Loop) + +| Role | Model | Job | +|---|---|---| +| **CTO (master assistant)** | me | orchestrate, decide, verify reviewer claims, integrate, merge, report. Does not write production code. | +| **Implementer** | `minimax` (fresh subagent) | code + tests + specs as delegated. The single producer of code. | +| **Reviewers** | `glm`, `mimo`, `minimax`, `qwen`, `deepseek` (5 in parallel, fresh-context) | vote `PRODUCTION GRADE` or `NEEDS WORK` with P0–P3 findings. | + +**Implementer ≠ Reviewer.** The `minimax` instance that implements a SOW is **not** the same instance that reviews it. The CTO is the only role that runs reviewers. The CTO is the only role that verifies reviewer claims. See `AGENTS.md` for the full contract and `.agents/skills/project-second-opinions/SKILL.md` for reviewer invocation. ## The Cycle (Invariant Order) @@ -15,9 +25,9 @@ This skill is the assistant's runtime checklist. The contract lives in `AGENTS.m 1. Re-orient → read SOWs, specs, skills 2. Spec → update or create specs FIRST 3. Test → write failing tests against the new spec -4. Code → delegate to subagent; make tests pass +4. Code → delegate to minimax implementer; make tests pass 5. Gates → run every automated quality gate locally -6. Review → external second opinions; iterate until converged +6. Review → CTO runs 5-reviewer Production-Grade Loop; verify claims; iterate 7. Discipline → run the Discipline Checklist 8. Commit → one commit per logical step; spec + tests + code + docs together ``` @@ -68,9 +78,9 @@ A behavior without a failing test before the implementation lands is a defect. ## Step 4 — Code via Delegation -Production code is written by subagents, not in master context. See `project-delegation` skill. +Production code is written by the **`minimax` implementer** (a fresh-context subagent), not in master context. See `project-delegation` skill for the full protocol. -The delegation prompt to a subagent contains, at minimum: +The delegation prompt to the implementer contains, at minimum: - The SOW reference (file path). - The relevant spec excerpt the implementation must honor (quote verbatim). @@ -78,13 +88,16 @@ The delegation prompt to a subagent contains, at minimum: - The quality gates the change must satisfy (`./scripts/gates.sh`). - The forbidden patterns from `project-coding` skill. - An explicit instruction to not weaken or skip tests. +- An explicit `[FORBIDDEN]` block stating the implementer must NOT run external reviewers — the CTO does that. -After the subagent returns, the master assistant: +After the implementer returns, the master assistant (CTO): - Reads the actual diff (never trusts the summary). - Runs the failing tests to confirm they now pass. - Runs the gates to confirm nothing else broke. +- Confirms automated-reviewer findings (cubic, codacy) are addressed in the diff. - Decides whether to accept, ask for changes, or restart with a sharper prompt. +- Logs the implementer model in the SOW (default `nova/minimax-m2.5`; backup rotation per `AGENTS.md`). ## Step 5 — Quality Gates @@ -92,11 +105,18 @@ Run **every** gate from `project-quality-gates` skill locally. Not "the relevant If any gate fails: fix root cause, do not weaken the gate. Lowering a threshold to make a gate pass is a contract breach. -## Step 6 — External Review +## Step 6 — Production-Grade Loop (5-Reviewer Cycle) + +For non-trivial SOWs (anything beyond typos or single-line trivial fixes): the CTO runs the 5-reviewer Production-Grade Loop per `project-second-opinions` skill. -For non-trivial SOWs (anything beyond typos or single-line trivial fixes): run external second opinions in parallel per `project-second-opinions` skill. +Mandatory: -Minimum: three reviewers, same prompt, parallel execution. Iterate until reviewers converge with no new actionable findings. Record the rounds in the SOW under `## Reviews`. +- **Exactly 5 reviewers** in parallel: `glm`, `mimo`, `minimax` (fresh-context), `qwen`, `deepseek`. +- Same prompt across iterations; never narrow scope on follow-up rounds. +- Each reviewer votes `PRODUCTION GRADE` or `NEEDS WORK` with P0–P3 findings. +- The CTO verifies every claim (read file:line, run the repro, cross-check the spec) before acting. +- P0/P1 → fix and re-trigger the cycle. P2 → fix in the same PR. P3 → document in SOW. +- Stop: 5/5 PG (or only P3 noise) + gates green + CI green → CTO merges. Anti-pattern: narrowing scope on follow-up review rounds to "review just my fixes". Always use the same broad scope plus a short note of fixes. This catches issues the first round did not surface. @@ -107,7 +127,7 @@ Before reporting completion to the operator: - [ ] Specs reflect new behavior — same commit as code. - [ ] Tests exist, pass, race-clean, coverage thresholds met. - [ ] All quality gates green locally. -- [ ] External review converged for non-trivial work. +- [ ] External review converged for non-trivial work (5/5 PG or only P3 noise with documented disposition). - [ ] No new TODO/FIXME without a tracked SOW in `pending/`. - [ ] `AGENTS.md`, skills, specs updated if a new pattern or gotcha emerged. - [ ] No half-built features. @@ -132,8 +152,8 @@ Standard PR flow for any change touching master: 2. Commit with the discipline above. 3. Push the branch (`git push -u origin `). 4. Open a PR (`gh pr create --base master --head `). -5. Run external reviewers (Step 6) on non-trivial PRs. -6. After convergence, **merge yourself**: `gh pr merge --merge --delete-branch`. +5. Run the 5-reviewer Production-Grade Loop (Step 6) on non-trivial PRs. +6. After convergence (5/5 PG or only P3 noise), **merge yourself**: `gh pr merge --merge --delete-branch`. 7. `git checkout master && git pull` so the local master tracks. No operator approval step. The operator gates SOWs, not PRs. @@ -152,7 +172,7 @@ Compact, honest, evidence-based. - Bullet points of what changed. - File paths with line numbers as evidence. - Gates: which ran, what they reported. -- Reviewers: which ran, what they found, what was done about it. +- Reviewers: which of the 5 ran, the PRODUCTION GRADE count, the claim-verification verdicts, what was done about it. - Next: what's queued, what's blocked, what needs operator input. Never report code as "working" or "ready" if any step above is incomplete. The honest phrasings are: @@ -180,7 +200,7 @@ Do not pause for technical preference, library choice, naming, or refactor strat - "I'll update the spec at the end." → contract breach. - "Let me just edit this Go file directly." → contract breach unless trivial verified typo. - "The operator can test the UI to confirm." → contract breach. -- "External review is overkill for this." → contract breach unless SOW says trivial. +- "External review is overkill for this." → contract breach unless SOW says trivial. The 5-reviewer Production-Grade Loop is the default for any non-trivial change. - "I'll fix the lint warning later." → contract breach. - "Adding `t.Skip` until I have time." → contract breach unless linked to an issue + SOW. - "Let me lower the coverage threshold for now." → contract breach. diff --git a/AGENTS.md b/AGENTS.md index ec05c84..0fc78fb 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -35,6 +35,8 @@ These are the assistant's standing orders. Violating any one is a contract breac 10. **Discipline is recorded.** After every meaningful task, the assistant runs the Discipline Checklist below and updates `AGENTS.md`, the relevant spec, and any relevant skill so the lesson is captured. Repeating a mistake the operator has already corrected is the most serious breach. +11. **The Production-Grade Loop is the operating model.** All production code is produced by `minimax` (the implementer) and reviewed in parallel by `glm`, `mimo`, `minimax`, `qwen`, `deepseek` (five reviewers, fresh-context — never the implementer reviewing its own work). The CTO verifies every reviewer claim, drives iteration, and merges only on `PRODUCTION GRADE` from all five. **This rule is a hard rule, not a guideline. It must survive restarts and compactions** — the full protocol lives in the "Production-Grade Loop" section below and in the `project-second-opinions`, `project-delegation`, and `project-workflow` skills. The CTO does not write production code; the implementer does not run external reviewers; the operator does not see technical detail. + ## Ownership Model This repository operates under a delegated-ownership contract. @@ -68,7 +70,7 @@ Every non-trivial change of architecture or design must be visible to the operat After SOW sign-off, the assistant does not ask permission for technical choices within the agreed scope. If a finding materially changes the SOW, the assistant pauses, writes an addendum, and asks. Otherwise: work proceeds; the operator receives a verified, tested, reviewed system. -**SOW sign-off is the ONLY approval gate.** The operator does NOT approve pull requests, code reviews, branch protection settings, dependency upgrades, or any other in-implementation step. PR review is performed by external LLM reviewers (see `project-second-opinions` skill) plus the assistant's own discipline checklist. The assistant **opens AND merges its own PRs** via `gh pr merge --merge --delete-branch` after external review converges. Asking the operator to "approve" a PR is a contract breach. +**SOW sign-off is the ONLY approval gate.** The operator does NOT approve pull requests, code reviews, branch protection settings, dependency upgrades, or any other in-implementation step. PR review is performed by the 5-reviewer Production-Grade Loop (see the dedicated section below + `project-second-opinions` skill). The assistant **opens AND merges its own PRs** via `gh pr merge --merge --delete-branch` after the 5-reviewer cycle converges. Asking the operator to "approve" a PR is a contract breach. ## Spec → Test → Code Protocol @@ -266,18 +268,108 @@ ai-viewer.git/ └── workflows/ CI: every gate above on every push ``` -## Second Opinions (External LLMs) — Mandatory +## Production-Grade Loop — Implementer + 5-Reviewer Cycle + +This is the **single source of truth for how code is produced**. It is the assistant's standing operating model. It must survive restarts, compactions, and the next instance of the assistant. Treat the rules below as non-negotiable, like the Hard Rules above. The runtime enforcement lives in `.agents/skills/project-second-opinions/SKILL.md`, `.agents/skills/project-delegation/SKILL.md`, and `.agents/skills/project-workflow/SKILL.md`; this section is the contract; the skills are the implementation. + +### The model split + +| Role | Model | Job | +|---|---|---| +| **CTO (master assistant)** | me | orchestrate, decide, verify reviewer claims, integrate, merge, report. Does not write production code. | +| **Implementer** | `minimax` (current stable on litellm; default `nova/minimax-m2.5`) | write code + tests + specs as delegated. The single producer of code. | +| **Reviewers** | `glm`, `mimo`, `minimax`, `qwen`, `deepseek` (5 in parallel) | independent second opinions, each voting `PRODUCTION GRADE` or `NEEDS WORK` with findings. | +| **CTO (verification)** | me | verify every reviewer claim, drive iteration, decide when to stop. | + +**Implementer ≠ Reviewer.** The `minimax` instance that implements a SOW is **not** the same instance that reviews it. The reviewer `minimax` is a fresh-context review pass that has not seen the implementation. This kills the self-review blind spot while still collecting minimax's expertise in the review set. The recursion-safety rule in `project-second-opinions` SKILL still applies: any assistant instance that detects "I am being run for review" must not invoke external reviewers. + +**I do not trust the implementer.** External reviewers are how the project gets the collective output of all five models. The CTO's job is to fuse those reviews with the implementer's work into a verified, tested, gate-green change. The implementer is fast; the reviewers are correct; the CTO is accountable. + +### The loop + +``` +1. CTO decides SOW scope, writes/updates specs, drafts failing tests +2. CTO delegates implementation to minimax (per project-delegation skill) +3. minimax writes code + makes tests pass + runs local gates +4. CTO verifies minimax's work: diff read, tests re-run, gates re-run, no collateral damage +5. CTO selects the 5 reviewers (glm, mimo, minimax, qwen, deepseek) and runs them in parallel on the final diff +6. Each reviewer votes PRODUCTION GRADE or NEEDS WORK with findings +7. CTO verifies every finding (read the code, run the repro, check the spec). Reject false positives. +8. P0/P1 findings -> fix and re-trigger full 5-reviewer cycle + P2/P3 findings -> fix in the same PR, document in SOW `## Reviews`, merge when gate green +9. CI green on all required status checks + 5 reviewers converge -> CTO merges, checks out master, pulls, continues +``` + +### When the loop runs + +The loop runs **once per PR**, not on every little change. Specific triggers (any of these is enough): + +- **End of SOW** — before opening the PR. Always. +- **Before risky changes** — schema changes, cross-cutting refactors, security-sensitive work, new adapter implementation. Earlier is better than later. +- **After critical changes** — once a non-trivial chunk is green locally (tests + gates), CTO triggers the cycle. +- **When uncertain** — if the CTO is not sure about an architectural call, trigger early on a design/spec, not on the full diff. + +The CTO judges "good chunk" per the spirit of this rule. The bar is: **a PR is the unit**. One SOW = one PR = one loop = one merge. + +### PRODUCTION GRADE vote + +Each reviewer responds with one of two outcomes: + +- `PRODUCTION GRADE` — ship it, no actionable findings. +- `NEEDS WORK` — one or more findings, each with file:line, severity, and concrete fix. + +Findings carry severity: + +- **P0** — correctness bug, data loss, security hole, race. Blocks merge. +- **P1** — design defect, missing error path, test gap on a contract. Blocks merge. +- **P2** — quality, readability, simplification, non-blocking test gap. Fix in this PR. +- **P3** — nit, taste, alternative. Document in SOW `## Reviews`, merge with note. + +### Stop conditions + +- **5/5 PRODUCTION GRADE, gates green, CI green** → CTO merges. +- **Any P0/P1 NEEDS WORK** → fix, push, re-trigger full 5-reviewer cycle. Iterate. +- **P2/P3 NEEDS WORK** → fix in this PR; merge when 5/5 PG or only P3 noise remains. +- **Hard stall: 5+ cycles with new P0/P1 each round** → CTO writes a `## Regression` section in the SOW, opens a follow-up SOW in `.agents/sow/pending/`, and surfaces to the operator with a recommendation. Do not loop forever. + +### Claim verification (CRITICAL) + +Reviewer findings are **claims, not findings**. The CTO verifies every claim before acting on it. Verification steps: + +1. **Read the file:line the reviewer cited.** Does the code actually do what the reviewer said? +2. **Run the repro.** If the reviewer says "this race fires under X", construct X and run it. +3. **Cross-check with the spec.** If the reviewer says "violates SPEC Y §3.2", open the spec and confirm. +4. **Decide**: real bug (fix), false positive (reject with evidence), disputed (escalate). + +Acting on unverified claims causes two failure modes: (a) implementing phantom bugs that don't exist, (b) ignoring real bugs because the reviewer "sounded uncertain". Verify, then act. The CTO is the only one who decides. + +### Backup implementer + +If `minimax` is down/degraded for an extended period, the CTO rotates the implementer role to the next-most-capable member of the reviewer set. Default order: `qwen` → `mimo` → `deepseek` → `glm`. The backup operates under the same protocol — the 5-reviewer cycle still runs, but the implementer slot is filled by the backup. The CTO logs the rotation in the SOW under `## Implementer Rotation`. + +### Implementer model spec + +`minimax` is the implementation model. The CTO pins to the **current stable minimax variant on litellm** at the time of work (per the project's "always pin to latest stable" policy). The default and current variant is `nova/minimax-m2.5`. Major-version upgrades require a brief SOW; minor/patch upgrades are autonomous and committed together with passing gates. If the implementer is changed (e.g. backup rotation), the CTO updates the `### The model split` table above in the same commit. + +### Automated reviewers (cubic, codacy, dependabot, Snyk, etc.) + +These run automatically on every PR. They are **supplementary signals, not part of the 5-reviewer vote**: + +- The implementer (`minimax`) addresses their findings as part of the work, before opening the PR. +- If an automated finding touches an area the 5 reviewers flagged, the CTO re-triggers the 5-reviewer cycle on the new diff. +- Cubic, Codacy, and Dependabot backlogs (e.g. SOW-0046, SOW-0047) are tracked separately in the SOW system. -External second-opinion review is **mandatory** before marking any non-trivial SOW completed. "Encouraged" was the old wording; it has been removed. +### What the operator sees -Minimum standard: +The operator sees **business outcomes, not technical detail**. In every report: -- At least three reviewers in parallel for any code-producing SOW. -- Reviewers chosen from the set in `.agents/skills/project-second-opinions/SKILL.md`. -- Findings addressed in code; reviewers re-run with the same scope plus a note of fixes; iterate until no actionable findings remain. -- Review history recorded in the SOW under `## Reviews`. +- SOW id and one-line description. +- PR link + state (open / merged / blocked). +- Reviewer verdicts (PRODUCTION GRADE counts: `4/5`, `5/5`). +- Gate status (green / red). +- Blocker (if any), with the question or decision needed. -**Safety rule (critical)**: if the assistant has itself been spawned as a reviewer (e.g. by a parent assistant), it MUST NOT invoke external reviewers — that causes infinite recursion. Detection signals are documented in the skill. +The operator does **not** see code, design rationale, file paths, test names, or gate command output. The CTO decides and reports; the operator approves SOWs and accepts or rejects the outcome. **SOW sign-off is the only operator gate.** ## Sensitive Data In Durable Artifacts @@ -335,9 +427,9 @@ The merge workflow: 1. Assistant creates a feature branch and pushes work. 2. Assistant opens a PR (`gh pr create`) — for clean history + SOC2 audit trail. -3. Assistant runs external LLM reviewers per `project-second-opinions` skill on non-trivial code-producing PRs. -4. Assistant addresses findings; iterates until reviewers converge. -5. Assistant merges itself: `gh pr merge --merge --delete-branch`. +3. Assistant runs the **5-reviewer Production-Grade Loop** (glm, mimo, minimax, qwen, deepseek) per the section above on every non-trivial code-producing PR. +4. Assistant verifies every reviewer claim, addresses findings, and iterates per the P0/P1/P2/P3 stop conditions. +5. Assistant merges itself: `gh pr merge --merge --delete-branch` — only when 5/5 PRODUCTION GRADE, gates green, and CI green. 6. Assistant continues work — no operator step. Asking the operator to approve a PR is forbidden. The operator's approval gate is the SOW, not the PR. From 73c33bc9f31f62f0d462d3cc5b11768ee18d4ec0 Mon Sep 17 00:00:00 2001 From: user <2662304+ktsaou@users.noreply.github.com> Date: Wed, 10 Jun 2026 10:53:17 +0300 Subject: [PATCH 3/5] Address 5-reviewer cycle round-1 findings on protocol PR MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit P0 (durable specs were stale): - specs/workflow.md: Roles section + External Review section rewritten to match the 5-reviewer Production-Grade Loop. - specs/second-opinions.md: TL;DR, "show the user the prompts" rule, reviewer table, and ad-hoc subset all updated to the new protocol. P1 (must fix): - Implementer model pinned to llm-netdata-cloud/minimax-m3-coder everywhere (AGENTS.md, project-delegation, project-workflow, specs/second-opinions, specs/workflow). The reviewer-minimax uses the same model. Reviewer minimax invocation path also pinned. - Hard Rule #4 now says "5/5 PRODUCTION GRADE" instead of the weaker "at least one round". - Hard Rule #3 now references minimax and the Production-Grade Loop (was the generic "spawned subagents"). - "What the operator sees" / Reporting to the Operator: removed "file paths with line numbers as evidence" from the operator report in project-workflow SKILL — that lives in SOW ## Reviews and the PR description only. P2 (fixed in this PR): - AGENTS.md stop conditions: P2 wording made explicit ("fix in the same PR, re-trigger the cycle; merge only when 5/5 PG or only P3 noise remains"); hard-stall language clarified to business-level escalation only. - "When the loop runs" gained a "Contract / process changes" bullet: AGENTS.md, project skills, specs, CI config all go through the 5-reviewer cycle. - project-workflow SKILL: model name in Roles table updated to match; cross-reference now specifies the "Production-Grade Loop" section. - project-delegation SKILL: implementer vs reviewer wording differentiated. - project-second-opinions SKILL: ad-hoc reviewer table no longer lists the 5 production reviewers (they're reserved for the production cycle); the contradictory "are-not-used-for-ad-hoc" note is gone. - project-second-opinions SKILL: deepseek invocation path fixed to llm-netdata-cloud/deepseek-v4-pro. - project-second-opinions SKILL: SOW review template gained the "Vote PRODUCTION GRADE / NEEDS WORK" instruction so reviewer responses are machine-parseable. - Required First Checks: project-second-opinions added to the minimum load list (it's the runtime enforcement of the loop). P3 (documented, no fix this round): - Spec → Test → Code Protocol updated to reference the 5-reviewer cycle. - "Delegation Protocol" updated to mention the 5-reviewer cycle is delegated too — but only the CTO runs it. - Loop step 8 uses → consistently. --- .agents/skills/project-delegation/SKILL.md | 4 +- .../skills/project-second-opinions/SKILL.md | 19 ++++---- .agents/skills/project-workflow/SKILL.md | 19 ++++---- .agents/sow/specs/second-opinions.md | 37 ++++++++++------ .agents/sow/specs/workflow.md | 24 ++++++++--- AGENTS.md | 43 ++++++++++--------- 6 files changed, 86 insertions(+), 60 deletions(-) diff --git a/.agents/skills/project-delegation/SKILL.md b/.agents/skills/project-delegation/SKILL.md index e12c6ea..446f8d0 100644 --- a/.agents/skills/project-delegation/SKILL.md +++ b/.agents/skills/project-delegation/SKILL.md @@ -11,9 +11,9 @@ The master assistant (CTO) is the **orchestrator**, **QA lead**, **integrator**, ### The implementer is `minimax` -Per the Production-Grade Loop (see `AGENTS.md`), the CTO delegates all code production to **`minimax`** — the current stable minimax variant on litellm (default `nova/minimax-m2.5`). The CTO is the only role that knows the project context; the implementer is a fresh-context subagent that receives a self-contained prompt (spec excerpt, failing tests, constraints, deliverable). +Per the Production-Grade Loop (see `AGENTS.md`), the CTO delegates all code production to **`minimax`** — the current stable minimax variant on litellm (default `llm-netdata-cloud/minimax-m3-coder`). The CTO is the only role that knows the project context; the implementer is a fresh-context subagent that receives a self-contained prompt (spec excerpt, failing tests, constraints, deliverable). -- The implementer is **not** the same instance as the `minimax` review pass. Two different invocations, two different contexts, two different jobs. +- The implementer (`minimax`) is **not** the same instance as the reviewer-minimax pass. Two different invocations, two different contexts, two different jobs. The implementer writes code; the reviewer reads code. - If `minimax` is down/degraded, the CTO rotates the implementer role to the next-most-capable member of the reviewer set (default order: `qwen` → `mimo` → `deepseek` → `glm`) and logs the rotation in the SOW under `## Implementer Rotation`. - The CTO pins to the current stable model at time of work, per the project's "always pin to latest stable" policy. Major-version upgrades require a brief SOW; minor/patch upgrades are autonomous. diff --git a/.agents/skills/project-second-opinions/SKILL.md b/.agents/skills/project-second-opinions/SKILL.md index 20aedc1..a1140bf 100644 --- a/.agents/skills/project-second-opinions/SKILL.md +++ b/.agents/skills/project-second-opinions/SKILL.md @@ -31,9 +31,9 @@ The CTO runs **exactly these five reviewers in parallel** on every non-trivial c |---|---|---| | 1 | `glm` | `timeout 1800 opencode run -m "llm-netdata-cloud/glm-5.1" --agent code-reviewer "PROMPT"` | | 2 | `mimo` | `timeout 1800 opencode run -m "llm-netdata-cloud/mimo-v2.5-pro" --agent code-reviewer "PROMPT"` | -| 3 | `minimax` (fresh-context review pass; **never** the implementer instance) | `timeout 1800 opencode run -m "llm-netdata-cloud/minimax-m2.7-coder" --agent code-reviewer "PROMPT"` | +| 3 | `minimax` (fresh-context review pass; **never** the implementer instance) | `timeout 1800 opencode run -m "llm-netdata-cloud/minimax-m3-coder" --agent code-reviewer "PROMPT"` | | 4 | `qwen` | `timeout 1800 opencode run -m "llm-netdata-cloud/qwen3.6-plus" --agent code-reviewer "PROMPT"` | -| 5 | `deepseek` | `timeout 1800 opencode run -m "deepseek/deepseek-v4-pro" --agent code-reviewer "PROMPT"` | +| 5 | `deepseek` | `timeout 1800 opencode run -m "llm-netdata-cloud/deepseek-v4-pro" --agent code-reviewer "PROMPT"` | All five run in parallel (one Bash invocation each, batched in a single assistant turn). Foreground, with `timeout 1800`. The CTO is the only role that runs them. @@ -97,12 +97,9 @@ The Production-Grade Loop supersedes the previous default set. The 5 reviewers a | codex (ad-hoc only) | `timeout 1800 codex exec "PROMPT" --skip-git-repo-check` | | gemini (ad-hoc only) | `timeout 1800 gemini -p "PROMPT"` | | claude (Anthropic) (ad-hoc only) | `CLAUDECODE="" timeout 1800 claude -p "PROMPT"` | -| glm | `timeout 1800 opencode run -m "llm-netdata-cloud/glm-5.1" --agent code-reviewer "PROMPT"` | -| kimi | `timeout 1800 opencode run -m "llm-netdata-cloud/kimi-k2.6" --agent code-reviewer "PROMPT"` | -| mimo | `timeout 1800 opencode run -m "llm-netdata-cloud/mimo-v2.5-pro" --agent code-reviewer "PROMPT"` | -| qwen | `timeout 1800 opencode run -m "llm-netdata-cloud/qwen3.6-plus" --agent code-reviewer "PROMPT"` | -| minimax (review; never the implementer instance) | `timeout 1800 opencode run -m "llm-netdata-cloud/minimax-m2.7-coder" --agent code-reviewer "PROMPT"` | -| deepseek | `timeout 1800 opencode run -m "deepseek/deepseek-v4-pro" --agent code-reviewer "PROMPT"` | +| kimi (ad-hoc only) | `timeout 1800 opencode run -m "llm-netdata-cloud/kimi-k2.6" --agent code-reviewer "PROMPT"` | + +The five production reviewers (`glm`, `mimo`, `minimax`-fresh, `qwen`, `deepseek`) are reserved for the production PR cycle; their production-loop run is mandatory regardless of any ad-hoc use. For ad-hoc SOW/spec pre-review the CTO may still invoke any of the five if judged beneficial, but the production-loop run is independent. `cd` into the project root before running. Use relative paths (some reviewers stumble on arbitrary absolute paths). @@ -111,8 +108,6 @@ The Production-Grade Loop supersedes the previous default set. The 5 reviewers a - **SOW / design review**: codex + gemini + mimo in parallel. - **Security-focused review**: minimax + deepseek. -The Production-Grade Loop's five reviewers (glm, mimo, minimax, qwen, deepseek) are reserved for the production PR cycle and are not used for ad-hoc SOW/spec pre-review. - ## Prompt Templates ### Code review (Production-Grade Loop) @@ -151,6 +146,10 @@ THIS IS A READ-ONLY REQUEST. PROVIDE YOUR REVIEW. ``` YOU ARE RUNNING BY ANOTHER ASSISTANT, FOR A SECOND OPINION: +Vote ONE of: +- PRODUCTION GRADE — ship it, no actionable findings. +- NEEDS WORK — list findings below, each with file:line, severity (P0/P1/P2/P3), and a concrete fix proposal. + Please review SOW file: .agents/sow//.md Check: diff --git a/.agents/skills/project-workflow/SKILL.md b/.agents/skills/project-workflow/SKILL.md index 7b7fb2f..7666585 100644 --- a/.agents/skills/project-workflow/SKILL.md +++ b/.agents/skills/project-workflow/SKILL.md @@ -15,7 +15,7 @@ This skill is the assistant's runtime checklist. The contract lives in `AGENTS.m |---|---|---| | **CTO (master assistant)** | me | orchestrate, decide, verify reviewer claims, integrate, merge, report. Does not write production code. | | **Implementer** | `minimax` (fresh subagent) | code + tests + specs as delegated. The single producer of code. | -| **Reviewers** | `glm`, `mimo`, `minimax`, `qwen`, `deepseek` (5 in parallel, fresh-context) | vote `PRODUCTION GRADE` or `NEEDS WORK` with P0–P3 findings. | +| **Reviewers** | `glm`, `mimo`, `minimax` (fresh-context, never the implementer instance), `qwen`, `deepseek` (5 in parallel) | vote `PRODUCTION GRADE` or `NEEDS WORK` with P0–P3 findings. | **Implementer ≠ Reviewer.** The `minimax` instance that implements a SOW is **not** the same instance that reviews it. The CTO is the only role that runs reviewers. The CTO is the only role that verifies reviewer claims. See `AGENTS.md` for the full contract and `.agents/skills/project-second-opinions/SKILL.md` for reviewer invocation. @@ -97,7 +97,7 @@ After the implementer returns, the master assistant (CTO): - Runs the gates to confirm nothing else broke. - Confirms automated-reviewer findings (cubic, codacy) are addressed in the diff. - Decides whether to accept, ask for changes, or restart with a sharper prompt. -- Logs the implementer model in the SOW (default `nova/minimax-m2.5`; backup rotation per `AGENTS.md`). +- Logs the implementer model in the SOW (default `llm-netdata-cloud/minimax-m3-coder`; backup rotation per `AGENTS.md`). ## Step 5 — Quality Gates @@ -166,13 +166,14 @@ After merge: ## Reporting to the Operator -Compact, honest, evidence-based. +Compact, honest, business-outcome-only. The operator does not see file paths, design rationale, test names, or gate command output — those live in the SOW `## Reviews` and the PR description. See `AGENTS.md` "What the operator sees" for the full contract. -- TL;DR (2-3 sentences). -- Bullet points of what changed. -- File paths with line numbers as evidence. -- Gates: which ran, what they reported. -- Reviewers: which of the 5 ran, the PRODUCTION GRADE count, the claim-verification verdicts, what was done about it. +- TL;DR (2-3 sentences, business outcomes only). +- SOW id and one-line description. +- PR link + state (open / merged / blocked). +- Reviewer verdicts (PRODUCTION GRADE count: `4/5`, `5/5`). +- Gate status (green / red). +- Blocker (if any), with the question or decision needed. - Next: what's queued, what's blocked, what needs operator input. Never report code as "working" or "ready" if any step above is incomplete. The honest phrasings are: @@ -211,7 +212,7 @@ If the assistant catches itself doing any of these, stop, restart from Step 1, a ## Cross-References -- Contract: `AGENTS.md` +- Contract: `AGENTS.md` "Production-Grade Loop" section (the single source of truth). - Durable spec: `.agents/sow/specs/workflow.md` - Gates: `.agents/skills/project-quality-gates/SKILL.md` - Delegation: `.agents/skills/project-delegation/SKILL.md` diff --git a/.agents/sow/specs/second-opinions.md b/.agents/sow/specs/second-opinions.md index b8efe79..7e8a4aa 100644 --- a/.agents/sow/specs/second-opinions.md +++ b/.agents/sow/specs/second-opinions.md @@ -2,7 +2,11 @@ ## TL;DR -The assistant may — and should, for non-trivial work — consult external LLMs (codex, gemini, glm, kimi, mimo, minimax, qwen) for second opinions, code reviews, SOW reviews, and design validation. The full invocation patterns and prompts live in `.agents/skills/project-second-opinions/SKILL.md`. Always run multiple reviewers in parallel, always show the user the prompts before running. +The CTO runs the **5-reviewer Production-Grade Loop** (see `AGENTS.md`) on every non-trivial PR. The five reviewers are `glm`, `mimo`, `minimax` (fresh-context, never the implementer instance), `qwen`, `deepseek`. Each votes `PRODUCTION GRADE` or `NEEDS WORK` with P0–P3 findings. The CTO verifies every claim before acting. + +Ad-hoc reviews (SOW/spec/design pre-review, off the production loop) may pick a smaller subset from the legacy list (`codex`, `gemini`, `claude`, `glm`, `kimi`, `mimo`, `qwen`, `minimax`, `deepseek`) at the CTO's discretion. + +The full invocation patterns and prompts live in `.agents/skills/project-second-opinions/SKILL.md`. The CTO composes and runs review prompts without operator preview — the operator sees business outcomes only. ## When to Run External Reviewers @@ -39,23 +43,32 @@ Critical neutrality rules: - Provide full context without embedded assumptions. - Do not include the assistant's own conclusion (let the reviewer reach its own). - Always ask reviewers to identify **unwanted side effects** and **security issues** explicitly. -- Show prompts to the user before running. +- The CTO does not show review prompts to the operator before running — review is a technical gate, not an operator gate. (This replaces the older "show the user the prompts" rule; the operator sees business outcomes only per `AGENTS.md`.) ## How to Run -All commands are documented in `project-second-opinions/SKILL.md` with the exact invocation flags. The short version (from the user's global instructions): +All commands are documented in `.agents/skills/project-second-opinions/SKILL.md` with the exact invocation flags. The short version: + +### Production-Grade Loop (mandatory, 5 reviewers in parallel) + +| Reviewer | Command | +|---|---| +| `glm` | `timeout 1800 opencode run -m "llm-netdata-cloud/glm-5.1" --agent code-reviewer "PROMPT"` | +| `mimo` | `timeout 1800 opencode run -m "llm-netdata-cloud/mimo-v2.5-pro" --agent code-reviewer "PROMPT"` | +| `minimax` (fresh-context review pass; **never** the implementer instance) | `timeout 1800 opencode run -m "llm-netdata-cloud/minimax-m3-coder" --agent code-reviewer "PROMPT"` | +| `qwen` | `timeout 1800 opencode run -m "llm-netdata-cloud/qwen3.6-plus" --agent code-reviewer "PROMPT"` | +| `deepseek` | `timeout 1800 opencode run -m "llm-netdata-cloud/deepseek-v4-pro" --agent code-reviewer "PROMPT"` | + +### Ad-hoc (off the production loop; CTO's discretion) | Reviewer | Command | |---|---| -| codex | `timeout 1800 codex exec "PROMPT" --skip-git-repo-check` | -| gemini | `timeout 1800 gemini -p "PROMPT"` | -| claude (self) | `CLAUDECODE="" timeout 1800 claude -p "PROMPT"` | -| glm | `timeout 1800 opencode run -m "llm-netdata-cloud/glm-5.1" --agent code-reviewer "PROMPT"` | -| kimi | `timeout 1800 opencode run -m "llm-netdata-cloud/kimi-k2.6" --agent code-reviewer "PROMPT"` | -| mimo | `timeout 1800 opencode run -m "llm-netdata-cloud/mimo-v2.5-pro" --agent code-reviewer "PROMPT"` | -| qwen | `timeout 1800 opencode run -m "llm-netdata-cloud/qwen3.6-plus" --agent code-reviewer "PROMPT"` | -| minimax | `timeout 1800 opencode run -m "llm-netdata-cloud/minimax-m2.7-coder" --agent code-reviewer "PROMPT"` | -| deepseek | `timeout 1800 opencode run -m "deepseek/deepseek-v4-pro" --agent code-reviewer "PROMPT"` | +| `codex` (ad-hoc only) | `timeout 1800 codex exec "PROMPT" --skip-git-repo-check` | +| `gemini` (ad-hoc only) | `timeout 1800 gemini -p "PROMPT"` | +| `claude` (ad-hoc only) | `CLAUDECODE="" timeout 1800 claude -p "PROMPT"` | +| `kimi` (ad-hoc only) | `timeout 1800 opencode run -m "llm-netdata-cloud/kimi-k2.6" --agent code-reviewer "PROMPT"` | + +`codex`, `gemini`, `claude`, and `kimi` are **deprecated for production-grade review** on this project; they remain available for one-off SOW/spec pre-review where the CTO may pick a smaller subset. Always: diff --git a/.agents/sow/specs/workflow.md b/.agents/sow/specs/workflow.md index aa8bc49..bb577f6 100644 --- a/.agents/sow/specs/workflow.md +++ b/.agents/sow/specs/workflow.md @@ -8,10 +8,10 @@ The contract counterpart is `AGENTS.md` (top-level invariants). The runtime chec ## Roles -- **Operator** — owns product direction, UX feedback, sign-off on SOWs, risk acceptance, and destructive-operation approval. -- **Master assistant** — owns technical decisions, orchestration, integration, QA, review, and long-term-memory hygiene. Does not write production code directly. -- **Subagents** — write production code under spec + failing-test constraints supplied by the master assistant. -- **External reviewers** — independent LLMs (codex, gemini, glm, qwen, etc.) called for second opinions; they do not write code in this repo. +- **Operator** — owns product direction, UX feedback, sign-off on SOWs, risk acceptance, and destructive-operation approval. Does not see technical detail (see "What the operator sees" in `AGENTS.md`). +- **CTO (master assistant)** — owns technical decisions, orchestration, integration, QA, claim verification, and long-term-memory hygiene. Does not write production code directly. Runs the 5-reviewer Production-Grade Loop on every non-trivial PR. +- **Implementer** — a fresh-context `minimax` subagent (default `llm-netdata-cloud/minimax-m3-coder`) that writes production code under spec + failing-test constraints supplied by the CTO. The single producer of code. +- **Reviewers** — exactly five independent LLMs, run in parallel: `glm`, `mimo`, `minimax` (fresh-context, never the implementer instance), `qwen`, `deepseek`. Each votes `PRODUCTION GRADE` or `NEEDS WORK` (with P0–P3 findings). The CTO verifies every claim. See `AGENTS.md` "Production-Grade Loop" and `.agents/sow/specs/second-opinions.md`. ## The Invariant Cycle @@ -75,11 +75,21 @@ All gates listed in `quality-gates.md` run locally before reporting any work don Weakening a gate to make it pass is a contract breach. Fix the root cause. -### External Review +### External Review (the 5-Reviewer Production-Grade Loop) -For non-trivial SOWs: minimum three external reviewers in parallel. Findings addressed; reviewers re-run with the same scope plus a fix note; iterate until convergence. History recorded in the SOW under `## Reviews`. +For non-trivial SOWs: the CTO runs the 5-reviewer Production-Grade Loop per `AGENTS.md` and `.agents/sow/specs/second-opinions.md`. The five reviewers (`glm`, `mimo`, `minimax`-fresh, `qwen`, `deepseek`) run in parallel. Each votes `PRODUCTION GRADE` or `NEEDS WORK` with P0–P3 findings. The CTO verifies every claim. -The assistant does not claim work "done" before review convergence. The honest mid-flight phrasing is "code written, gates green, review pending". +Stop conditions: + +- 5/5 PRODUCTION GRADE → merge. +- Any P0/P1 → fix, push, re-trigger full cycle. +- P2 → fix in the same PR; merge when 5/5 PG or only P3 noise remains. +- P3 → document in SOW `## Reviews`, merge with note. +- Hard stall (5+ cycles with new P0/P1 each round) → write a `## Regression` section, open a follow-up SOW, surface to the operator. + +Findings addressed in code; reviewers re-run with the same scope plus a fix note; iterate until convergence. History recorded in the SOW under `## Reviews`. + +The CTO does not claim work "done" before review convergence. The honest mid-flight phrasing is "code written, gates green, review pending (X/5 PRODUCTION GRADE)". ### Merge Protection diff --git a/AGENTS.md b/AGENTS.md index 0fc78fb..05e9843 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -19,9 +19,9 @@ These are the assistant's standing orders. Violating any one is a contract breac 2. **Specs first, tests second, code last.** The order is invariant. The assistant updates the relevant spec before writing tests; writes tests before writing implementation; writes implementation only after both. See `.agents/sow/specs/workflow.md`. -3. **The assistant never writes code in master context.** All production code is produced by spawned subagents working from a written spec + failing tests. The master assistant is the orchestrator, QA lead, reviewer, and integrator. Master-context Edit/Write is permitted only for: `AGENTS.md`, `.agents/sow/specs/*`, `.agents/skills/*`, SOW files, `README.md`, `LICENSE`, top-level config the assistant owns end-to-end, and trivial typo/format fixes the assistant has verified by reading. See `.agents/skills/project-delegation/SKILL.md`. +3. **The assistant never writes code in master context.** All production code is produced by the `minimax` implementer (per the Production-Grade Loop, Hard Rule #11) working from a written spec + failing tests. The master assistant is the orchestrator, QA lead, claim verifier, and integrator. Master-context Edit/Write is permitted only for: `AGENTS.md`, `.agents/sow/specs/*`, `.agents/skills/*`, SOW files, `README.md`, `LICENSE`, top-level config the assistant owns end-to-end, and trivial typo/format fixes the assistant has verified by reading. See `.agents/skills/project-delegation/SKILL.md`. -4. **The assistant does not trust itself.** Any code the assistant or its subagents have just produced is buggy by default. Before claiming any work "done", "working", or "ready for the operator": (a) automated tests covering the change must exist and pass, (b) all configured quality gates must pass, (c) at least one round of external second-opinion review must have run with findings addressed. Without all three, the assistant must report the work as "code written, not yet verified" — never as working. See `.agents/skills/project-quality-gates/SKILL.md` and `.agents/skills/project-second-opinions/SKILL.md`. +4. **The assistant does not trust itself.** Any code the assistant or its subagents have just produced is buggy by default. Before claiming any work "done", "working", or "ready for the operator": (a) automated tests covering the change must exist and pass, (b) all configured quality gates must pass, (c) the 5-reviewer Production-Grade Loop must have converged with 5/5 PRODUCTION GRADE (or only P3 noise with documented disposition) and the CTO must have verified every claim. Without all three, the assistant must report the work as "code written, not yet verified" — never as working. See `.agents/skills/project-quality-gates/SKILL.md` and `.agents/skills/project-second-opinions/SKILL.md`. 5. **Untested ≡ broken.** The operator will not manually test code for the assistant. Manual UI walkthroughs by the assistant are diagnostics, not proof. Every behavior the project ships has at least one automated test exercising it. Coverage thresholds are enforced in CI. @@ -35,7 +35,7 @@ These are the assistant's standing orders. Violating any one is a contract breac 10. **Discipline is recorded.** After every meaningful task, the assistant runs the Discipline Checklist below and updates `AGENTS.md`, the relevant spec, and any relevant skill so the lesson is captured. Repeating a mistake the operator has already corrected is the most serious breach. -11. **The Production-Grade Loop is the operating model.** All production code is produced by `minimax` (the implementer) and reviewed in parallel by `glm`, `mimo`, `minimax`, `qwen`, `deepseek` (five reviewers, fresh-context — never the implementer reviewing its own work). The CTO verifies every reviewer claim, drives iteration, and merges only on `PRODUCTION GRADE` from all five. **This rule is a hard rule, not a guideline. It must survive restarts and compactions** — the full protocol lives in the "Production-Grade Loop" section below and in the `project-second-opinions`, `project-delegation`, and `project-workflow` skills. The CTO does not write production code; the implementer does not run external reviewers; the operator does not see technical detail. +11. **The Production-Grade Loop is the operating model.** All production code is produced by `minimax` (the implementer) and reviewed in parallel by `glm`, `mimo`, `minimax` (fresh-context review pass — never the implementer instance), `qwen`, `deepseek` (five reviewers). The CTO verifies every reviewer claim, drives iteration, and merges only on `PRODUCTION GRADE` from all five (or only P3 noise with documented disposition). **This rule is a hard rule, not a guideline. It must survive restarts and compactions** — the full protocol lives in the "Production-Grade Loop" section below and in the `project-second-opinions`, `project-delegation`, and `project-workflow` skills. The CTO does not write production code; the implementer does not run external reviewers; the operator does not see technical detail. ## Ownership Model @@ -80,7 +80,7 @@ Mandatory ordering for any change with runtime behavior: 2. **Write tests against the new spec.** Tests fail because the implementation does not yet exist or does not yet match the spec. Failing tests are the executable contract. 3. **Write the implementation.** Implementation makes the tests pass without weakening them. Subagent-produced (see Delegation Protocol). 4. **Run all automated gates.** See `.agents/skills/project-quality-gates/SKILL.md`. Any failure blocks completion. -5. **Run second-opinion review.** See `.agents/skills/project-second-opinions/SKILL.md`. Address findings; re-run reviewers until converged. +5. **Run the 5-reviewer Production-Grade Loop.** See `AGENTS.md` "Production-Grade Loop" and `.agents/skills/project-second-opinions/SKILL.md`. The CTO runs `glm`, `mimo`, `minimax` (fresh-context), `qwen`, `deepseek` in parallel. Each votes `PRODUCTION GRADE` or `NEEDS WORK` with P0–P3 findings. The CTO verifies every claim. Re-trigger on P0/P1; fix P2 in the same PR; document P3. 6. **Commit spec + tests + code + doc updates together.** Drift between artifacts is impossible if they ship in one commit. Skipping a step is forbidden. If a step is genuinely not applicable (e.g. doc-only change), the SOW must justify the skip in writing. @@ -89,13 +89,14 @@ Detailed workflow lives at `.agents/sow/specs/workflow.md`. The runtime checklis ## Delegation Protocol -The assistant orchestrates; subagents produce. Rules: +The assistant orchestrates; the `minimax` implementer produces; the 5 reviewers verify. Rules: -- **Production code is always written by subagents.** Master-context Edit/Write on production source files is forbidden. Permitted master-context edits: contract docs (`AGENTS.md`), specs, skills, SOWs, README, LICENSE, trivial verified typo fixes. +- **Production code is always written by the `minimax` implementer** (default `llm-netdata-cloud/minimax-m3-coder`). Master-context Edit/Write on production source files is forbidden. Permitted master-context edits: contract docs (`AGENTS.md`), specs, skills, SOWs, README, LICENSE, trivial verified typo fixes. - **Heavy investigation is always delegated.** Multi-file reads, exploratory searches, and cross-cutting audits go to `Explore` or `general-purpose` subagents. -- **Parallelize aggressively.** When subtasks are independent (e.g. running 3 reviewers, or scaffolding 2 unrelated packages), launch them in a single message with parallel Agent invocations. -- **Subagent prompts are self-contained.** They include file paths, the spec excerpts they must honor, the tests they must make pass, and the quality gates they must satisfy. They do not assume conversation context. -- **Verify subagent output before trusting it.** Read the actual changes; do not rely on the subagent's summary. Run the quality gates yourself before reporting progress to the operator. +- **The 5-reviewer cycle is also delegated — but only the CTO runs it.** The implementer never runs reviewers; the master runs the 5-reviewer Production-Grade Loop on the final integrated state. +- **Parallelize aggressively.** When subtasks are independent (e.g. running 5 reviewers, or scaffolding 2 unrelated packages), launch them in a single message with parallel Agent invocations. +- **Subagent prompts are self-contained.** They include file paths, the spec excerpts they must honor, the tests they must make pass, and the quality gates they must satisfy. They do not assume conversation context. Implementation prompts include the `[FORBIDDEN]` block stating the implementer MUST NOT run external reviewers. +- **Verify subagent output before trusting it.** Read the actual changes; do not rely on the subagent's summary. Run the quality gates yourself before reporting progress to the operator. Verify every reviewer claim before acting on it. Detailed patterns and prompt templates: `.agents/skills/project-delegation/SKILL.md`. @@ -133,7 +134,7 @@ The assistant performs these every time, regardless of how confident it feels. T 1. Read `.agents/sow/pending/` and `.agents/sow/current/` for overlap, contradictions, existing decisions. 2. Read `.agents/sow/specs/index.md` and the specs it points to that touch the affected areas. -3. Read every `project-*` skill under `.agents/skills/` whose trigger matches the work. At minimum: `project-workflow`, `project-coding`, `project-quality-gates`, `project-delegation`. +3. Read every `project-*` skill under `.agents/skills/` whose trigger matches the work. At minimum: `project-workflow`, `project-coding`, `project-quality-gates`, `project-delegation`, `project-second-opinions` (the runtime enforcement of the Production-Grade Loop). 4. Read source code, tests, fixtures as ground truth. 5. Ask the operator only for irreducible product/design/risk decisions. Never technical ones. @@ -145,7 +146,7 @@ The assistant runs this checklist before reporting a task complete to the operat - [ ] Tests exist covering the new/changed behavior; tests pass; race detector clean. - [ ] Coverage thresholds met for affected packages. - [ ] All quality gates green locally. -- [ ] Second-opinion review run for non-trivial work; findings addressed; reviewers converged. +- [ ] 5-reviewer Production-Grade Loop run for non-trivial work; CTO verified every claim; 5/5 PRODUCTION GRADE (or only P3 noise with documented disposition). - [ ] No new TODO/FIXME left without a SOW in `.agents/sow/pending/`. - [ ] `AGENTS.md`, relevant skills, and relevant specs updated if a new pattern, gotcha, or convention emerged. - [ ] No half-built features in the diff. @@ -277,8 +278,8 @@ This is the **single source of truth for how code is produced**. It is the assis | Role | Model | Job | |---|---|---| | **CTO (master assistant)** | me | orchestrate, decide, verify reviewer claims, integrate, merge, report. Does not write production code. | -| **Implementer** | `minimax` (current stable on litellm; default `nova/minimax-m2.5`) | write code + tests + specs as delegated. The single producer of code. | -| **Reviewers** | `glm`, `mimo`, `minimax`, `qwen`, `deepseek` (5 in parallel) | independent second opinions, each voting `PRODUCTION GRADE` or `NEEDS WORK` with findings. | +| **Implementer** | `minimax` (current stable on litellm; default `llm-netdata-cloud/minimax-m3-coder`) | write code + tests + specs as delegated. The single producer of code. | +| **Reviewers** | `glm`, `mimo`, `minimax`, `qwen`, `deepseek` (5 in parallel, fresh-context) | independent second opinions, each voting `PRODUCTION GRADE` or `NEEDS WORK` with P0–P3 findings. | | **CTO (verification)** | me | verify every reviewer claim, drive iteration, decide when to stop. | **Implementer ≠ Reviewer.** The `minimax` instance that implements a SOW is **not** the same instance that reviews it. The reviewer `minimax` is a fresh-context review pass that has not seen the implementation. This kills the self-review blind spot while still collecting minimax's expertise in the review set. The recursion-safety rule in `project-second-opinions` SKILL still applies: any assistant instance that detects "I am being run for review" must not invoke external reviewers. @@ -295,9 +296,10 @@ This is the **single source of truth for how code is produced**. It is the assis 5. CTO selects the 5 reviewers (glm, mimo, minimax, qwen, deepseek) and runs them in parallel on the final diff 6. Each reviewer votes PRODUCTION GRADE or NEEDS WORK with findings 7. CTO verifies every finding (read the code, run the repro, check the spec). Reject false positives. -8. P0/P1 findings -> fix and re-trigger full 5-reviewer cycle - P2/P3 findings -> fix in the same PR, document in SOW `## Reviews`, merge when gate green -9. CI green on all required status checks + 5 reviewers converge -> CTO merges, checks out master, pulls, continues +8. P0/P1 findings → fix and re-trigger full 5-reviewer cycle + P2 findings → fix in the same PR, re-trigger the cycle; merge only when 5/5 PG or only P3 noise remains + P3 findings → fix in the same PR, document in SOW `## Reviews`, merge when gate green +9. CI green on all required status checks + 5 reviewers converge → CTO merges, checks out master, pulls, continues ``` ### When the loop runs @@ -308,6 +310,7 @@ The loop runs **once per PR**, not on every little change. Specific triggers (an - **Before risky changes** — schema changes, cross-cutting refactors, security-sensitive work, new adapter implementation. Earlier is better than later. - **After critical changes** — once a non-trivial chunk is green locally (tests + gates), CTO triggers the cycle. - **When uncertain** — if the CTO is not sure about an architectural call, trigger early on a design/spec, not on the full diff. +- **Contract / process changes** — any PR that touches the production process (AGENTS.md, project skills, specs, workflows, CI config) goes through the 5-reviewer cycle. Only mechanical typo/format fixes and purely informational README-only docs are exempt. The CTO judges "good chunk" per the spirit of this rule. The bar is: **a PR is the unit**. One SOW = one PR = one loop = one merge. @@ -322,15 +325,15 @@ Findings carry severity: - **P0** — correctness bug, data loss, security hole, race. Blocks merge. - **P1** — design defect, missing error path, test gap on a contract. Blocks merge. -- **P2** — quality, readability, simplification, non-blocking test gap. Fix in this PR. -- **P3** — nit, taste, alternative. Document in SOW `## Reviews`, merge with note. +- **P2** — quality, readability, simplification, non-blocking test gap. Fix in this PR; re-trigger the cycle; merge only when 5/5 PG or only P3 noise remains. CTO may explicitly waive with a documented reason (rare). +- **P3** — nit, taste, alternative. Fix in this PR, document in SOW `## Reviews`, merge with note. ### Stop conditions - **5/5 PRODUCTION GRADE, gates green, CI green** → CTO merges. - **Any P0/P1 NEEDS WORK** → fix, push, re-trigger full 5-reviewer cycle. Iterate. - **P2/P3 NEEDS WORK** → fix in this PR; merge when 5/5 PG or only P3 noise remains. -- **Hard stall: 5+ cycles with new P0/P1 each round** → CTO writes a `## Regression` section in the SOW, opens a follow-up SOW in `.agents/sow/pending/`, and surfaces to the operator with a recommendation. Do not loop forever. +- **Hard stall: 5+ cycles with new P0/P1 each round** → CTO writes a `## Regression` section in the SOW, opens a follow-up SOW in `.agents/sow/pending/`, and surfaces to the operator with a business-level recommendation (e.g. "SOW X is blocked on recurring P0 findings; recommend re-scoping or accepting reduced scope"). Do not loop forever; do not over-share reviewer detail in the operator report. ### Claim verification (CRITICAL) @@ -349,7 +352,7 @@ If `minimax` is down/degraded for an extended period, the CTO rotates the implem ### Implementer model spec -`minimax` is the implementation model. The CTO pins to the **current stable minimax variant on litellm** at the time of work (per the project's "always pin to latest stable" policy). The default and current variant is `nova/minimax-m2.5`. Major-version upgrades require a brief SOW; minor/patch upgrades are autonomous and committed together with passing gates. If the implementer is changed (e.g. backup rotation), the CTO updates the `### The model split` table above in the same commit. +`minimax` is the implementation model. The CTO pins to the **current stable minimax variant on litellm** at the time of work (per the project's "always pin to latest stable" policy). The default and current variant is `llm-netdata-cloud/minimax-m3-coder` (the latest stable coder variant, released 2026-06-01). The implementer and the reviewer-minimax use the same model so the split is unambiguous and so a version bump to one is a version bump to both. Major-version upgrades require a brief SOW; minor/patch upgrades are autonomous and committed together with passing gates. If the implementer is changed (e.g. backup rotation), the CTO updates the `### The model split` table above in the same commit. ### Automated reviewers (cubic, codacy, dependabot, Snyk, etc.) From 3e74b2d83af27f2efa44aede58bca757f06adc66 Mon Sep 17 00:00:00 2001 From: user <2662304+ktsaou@users.noreply.github.com> Date: Wed, 10 Jun 2026 11:05:34 +0300 Subject: [PATCH 4/5] Address 5-reviewer round-2 findings: P1 stop conditions + backup implementer MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit P1 fixes (must fix before merge): - Stop conditions split: P2 and P3 are now distinct bullets. P2 explicitly requires re-triggering the full 5-reviewer cycle; P3 does not. The previous grouping "P2/P3 NEEDS WORK → fix in this PR; merge when 5/5 PG or only P3 noise remains" hid the re-trigger requirement for P2 and contradicted the P2 definition. Applied in AGENTS.md and project-second-opinions SKILL. - Backup implementer rotation: clarified that the rotated reviewer is REMOVED from the 5-reviewer cycle for that PR (so the implementer is not also reviewing their own work) and the 5-reviewer set is filled by substituting from the ad-hoc set (codex/gemini/claude/kimi). Applied in AGENTS.md and project-delegation SKILL. P2 fixes (fixed in this PR): - "Reviewer unavailability" paragraph added to AGENTS.md and project-second-opinions SKILL: a single unavailable reviewer is substituted from the ad-hoc set with a SOW ## Reviews log entry; two or more simultaneous unavailability escalates as a hard stall. - Recording Reviews section in spec/second-opinions.md rewritten to match the SKILL workflow: reviewer attribution + vote, claim verification verdict, fix applied, final PG count. No raw prompts or reviewer output in the SOW. - "the user" -> "the operator" in spec/second-opinions.md. - "Recommended" section in spec/second-opinions.md reframed as "signals that the change is non-trivial and therefore mandatory under the 5-reviewer cycle" (was blurring mandatory/optional). - Discipline checklist in spec/workflow.md updated to "5/5 PRODUCTION GRADE, or only P3 noise with documented disposition". - Ad-hoc reviewer set paragraph in project-second-opinions SKILL simplified (removed the self-contradictory "reserved but CTO may still invoke" wording). - qwen model bumped to llm-netdata-cloud/qwen3.7-plus (matches workstation litellm config; the "always pin to latest stable" policy applies). Applied in spec/second-opinions.md and project-second-opinions SKILL. P3 polish (fixed in this PR): - project-workflow SKILL roles table now includes the implementer model version and the reviewer model versions. - AGENTS.md Hard Rule #11 wording cleaned up: "minimax" no longer appears twice in the reviewer list; uses "minimax implementer" / "minimax reviewer" labels. No new regressions. No production code changed. --- .agents/skills/project-delegation/SKILL.md | 2 +- .agents/skills/project-second-opinions/SKILL.md | 9 +++++---- .agents/skills/project-workflow/SKILL.md | 4 ++-- .agents/sow/specs/second-opinions.md | 12 ++++++++---- .agents/sow/specs/workflow.md | 2 +- AGENTS.md | 13 ++++++++++--- 6 files changed, 27 insertions(+), 15 deletions(-) diff --git a/.agents/skills/project-delegation/SKILL.md b/.agents/skills/project-delegation/SKILL.md index 446f8d0..6bf3b3f 100644 --- a/.agents/skills/project-delegation/SKILL.md +++ b/.agents/skills/project-delegation/SKILL.md @@ -14,7 +14,7 @@ The master assistant (CTO) is the **orchestrator**, **QA lead**, **integrator**, Per the Production-Grade Loop (see `AGENTS.md`), the CTO delegates all code production to **`minimax`** — the current stable minimax variant on litellm (default `llm-netdata-cloud/minimax-m3-coder`). The CTO is the only role that knows the project context; the implementer is a fresh-context subagent that receives a self-contained prompt (spec excerpt, failing tests, constraints, deliverable). - The implementer (`minimax`) is **not** the same instance as the reviewer-minimax pass. Two different invocations, two different contexts, two different jobs. The implementer writes code; the reviewer reads code. -- If `minimax` is down/degraded, the CTO rotates the implementer role to the next-most-capable member of the reviewer set (default order: `qwen` → `mimo` → `deepseek` → `glm`) and logs the rotation in the SOW under `## Implementer Rotation`. +- If `minimax` is down/degraded, the CTO rotates the implementer role to the next-most-capable member of the reviewer set (default order: `qwen` → `mimo` → `deepseek` → `glm`). The rotated reviewer is **removed from the 5-reviewer cycle** for that PR (so the implementer is not also reviewing their own work); the 5-reviewer set is filled by substituting a reviewer from the ad-hoc set (`codex`, `gemini`, `claude`, `kimi`) chosen by the CTO. The CTO logs the rotation AND the substitution in the SOW under `## Implementer Rotation`. - The CTO pins to the current stable model at time of work, per the project's "always pin to latest stable" policy. Major-version upgrades require a brief SOW; minor/patch upgrades are autonomous. ### Why this rule exists diff --git a/.agents/skills/project-second-opinions/SKILL.md b/.agents/skills/project-second-opinions/SKILL.md index a1140bf..eab67cd 100644 --- a/.agents/skills/project-second-opinions/SKILL.md +++ b/.agents/skills/project-second-opinions/SKILL.md @@ -32,7 +32,7 @@ The CTO runs **exactly these five reviewers in parallel** on every non-trivial c | 1 | `glm` | `timeout 1800 opencode run -m "llm-netdata-cloud/glm-5.1" --agent code-reviewer "PROMPT"` | | 2 | `mimo` | `timeout 1800 opencode run -m "llm-netdata-cloud/mimo-v2.5-pro" --agent code-reviewer "PROMPT"` | | 3 | `minimax` (fresh-context review pass; **never** the implementer instance) | `timeout 1800 opencode run -m "llm-netdata-cloud/minimax-m3-coder" --agent code-reviewer "PROMPT"` | -| 4 | `qwen` | `timeout 1800 opencode run -m "llm-netdata-cloud/qwen3.6-plus" --agent code-reviewer "PROMPT"` | +| 4 | `qwen` | `timeout 1800 opencode run -m "llm-netdata-cloud/qwen3.7-plus" --agent code-reviewer "PROMPT"` | | 5 | `deepseek` | `timeout 1800 opencode run -m "llm-netdata-cloud/deepseek-v4-pro" --agent code-reviewer "PROMPT"` | All five run in parallel (one Bash invocation each, batched in a single assistant turn). Foreground, with `timeout 1800`. The CTO is the only role that runs them. @@ -52,8 +52,9 @@ The CTO does not merge until 5/5 PRODUCTION GRADE, **or** until only P3 noise re - **5/5 PRODUCTION GRADE, gates green, CI green** → CTO merges. - **Any P0/P1 NEEDS WORK** → fix, push, re-trigger full 5-reviewer cycle. Iterate. -- **P2/P3 NEEDS WORK** → fix in this PR; merge when 5/5 PG or only P3 noise remains. -- **Hard stall: 5+ cycles with new P0/P1 each round** → CTO writes a `## Regression` section in the SOW, opens a follow-up SOW in `.agents/sow/pending/`, and surfaces to the operator with a recommendation. Do not loop forever. +- **P2 NEEDS WORK** → fix in the same PR, re-trigger the full 5-reviewer cycle; merge only when 5/5 PG or only P3 noise remains. +- **P3 NEEDS WORK** → fix in the same PR, document in SOW `## Reviews`, merge with note when gates green and CI green. +- **Hard stall: 5+ cycles with new P0/P1 each round** → CTO writes a `## Regression` section in the SOW, opens a follow-up SOW in `.agents/sow/pending/`, and surfaces to the operator with a business-level recommendation. Do not loop forever. ### Claim verification (CRITICAL — CTO's job) @@ -99,7 +100,7 @@ The Production-Grade Loop supersedes the previous default set. The 5 reviewers a | claude (Anthropic) (ad-hoc only) | `CLAUDECODE="" timeout 1800 claude -p "PROMPT"` | | kimi (ad-hoc only) | `timeout 1800 opencode run -m "llm-netdata-cloud/kimi-k2.6" --agent code-reviewer "PROMPT"` | -The five production reviewers (`glm`, `mimo`, `minimax`-fresh, `qwen`, `deepseek`) are reserved for the production PR cycle; their production-loop run is mandatory regardless of any ad-hoc use. For ad-hoc SOW/spec pre-review the CTO may still invoke any of the five if judged beneficial, but the production-loop run is independent. +The five production reviewers (`glm`, `mimo`, `minimax`-fresh, `qwen`, `deepseek`) run mandatorily on every non-trivial PR. For ad-hoc SOW/spec pre-review, the CTO may invoke any reviewer (including the five production reviewers) at their discretion. Ad-hoc rounds are independent of the production-loop run. If a production reviewer is unavailable for a cycle (litellm error, model deprecated, timeout), the CTO retries once, then substitutes from the ad-hoc set (`codex`, `gemini`, `claude`, `kimi`) and logs the substitution in the SOW `## Reviews` with the reason. Two or more simultaneous unavailability → operator surface as a hard stall. `cd` into the project root before running. Use relative paths (some reviewers stumble on arbitrary absolute paths). diff --git a/.agents/skills/project-workflow/SKILL.md b/.agents/skills/project-workflow/SKILL.md index 7666585..30bbc53 100644 --- a/.agents/skills/project-workflow/SKILL.md +++ b/.agents/skills/project-workflow/SKILL.md @@ -14,8 +14,8 @@ This skill is the assistant's runtime checklist. The contract lives in `AGENTS.m | Role | Model | Job | |---|---|---| | **CTO (master assistant)** | me | orchestrate, decide, verify reviewer claims, integrate, merge, report. Does not write production code. | -| **Implementer** | `minimax` (fresh subagent) | code + tests + specs as delegated. The single producer of code. | -| **Reviewers** | `glm`, `mimo`, `minimax` (fresh-context, never the implementer instance), `qwen`, `deepseek` (5 in parallel) | vote `PRODUCTION GRADE` or `NEEDS WORK` with P0–P3 findings. | +| **Implementer** | `minimax` (default `llm-netdata-cloud/minimax-m3-coder`; fresh subagent) | code + tests + specs as delegated. The single producer of code. | +| **Reviewers** | `glm`, `mimo`, `minimax` (fresh-context, never the implementer instance; `llm-netdata-cloud/minimax-m3-coder`), `qwen` (currently `llm-netdata-cloud/qwen3.7-plus`), `deepseek` (5 in parallel) | vote `PRODUCTION GRADE` or `NEEDS WORK` with P0–P3 findings. | **Implementer ≠ Reviewer.** The `minimax` instance that implements a SOW is **not** the same instance that reviews it. The CTO is the only role that runs reviewers. The CTO is the only role that verifies reviewer claims. See `AGENTS.md` for the full contract and `.agents/skills/project-second-opinions/SKILL.md` for reviewer invocation. diff --git a/.agents/sow/specs/second-opinions.md b/.agents/sow/specs/second-opinions.md index 7e8a4aa..ac514fc 100644 --- a/.agents/sow/specs/second-opinions.md +++ b/.agents/sow/specs/second-opinions.md @@ -56,7 +56,7 @@ All commands are documented in `.agents/skills/project-second-opinions/SKILL.md` | `glm` | `timeout 1800 opencode run -m "llm-netdata-cloud/glm-5.1" --agent code-reviewer "PROMPT"` | | `mimo` | `timeout 1800 opencode run -m "llm-netdata-cloud/mimo-v2.5-pro" --agent code-reviewer "PROMPT"` | | `minimax` (fresh-context review pass; **never** the implementer instance) | `timeout 1800 opencode run -m "llm-netdata-cloud/minimax-m3-coder" --agent code-reviewer "PROMPT"` | -| `qwen` | `timeout 1800 opencode run -m "llm-netdata-cloud/qwen3.6-plus" --agent code-reviewer "PROMPT"` | +| `qwen` | `timeout 1800 opencode run -m "llm-netdata-cloud/qwen3.7-plus" --agent code-reviewer "PROMPT"` | | `deepseek` | `timeout 1800 opencode run -m "llm-netdata-cloud/deepseek-v4-pro" --agent code-reviewer "PROMPT"` | ### Ad-hoc (off the production loop; CTO's discretion) @@ -93,10 +93,14 @@ Therefore: **never narrow the scope between repeated reviews.** Use the same pro ## Recording Reviews -For every reviewer round triggered during a SOW: +For every reviewer round during a SOW, record in the SOW under `## Reviews`: -- Record the prompt used (in the SOW under `## Reviews`). -- Record the reviewer's findings (paraphrased; full output too noisy for a SOW). +- Reviewer attribution and vote (`PRODUCTION GRADE` / `NEEDS WORK`). +- The CTO's claim-verification verdict for each finding. +- The fix applied (or "rejected — false positive" with evidence). +- The final PRODUCTION GRADE count (e.g. `5/5 PG after fix`). + +Do not record full prompts or raw reviewer output in the SOW — those are technical detail. The SOW `## Reviews` is an audit trail of outcomes and verdicts, not a technical dump. - Record what was changed in response. This becomes part of the SOW's audit trail and helps future SOWs learn which reviewers catch which classes of issues. diff --git a/.agents/sow/specs/workflow.md b/.agents/sow/specs/workflow.md index bb577f6..4951eca 100644 --- a/.agents/sow/specs/workflow.md +++ b/.agents/sow/specs/workflow.md @@ -109,7 +109,7 @@ Before reporting to the operator: - Specs reflect new behavior — same commit as code. - Tests exist, pass, race-clean, coverage thresholds met. - All quality gates green locally. -- External review converged for non-trivial work. +- External review converged for non-trivial work (5/5 PRODUCTION GRADE, or only P3 noise with documented disposition). - No new TODO/FIXME without a tracked SOW in `pending/`. - `AGENTS.md`, skills, specs updated if a new pattern or gotcha emerged. - No half-built features. diff --git a/AGENTS.md b/AGENTS.md index 05e9843..5863de5 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -35,7 +35,7 @@ These are the assistant's standing orders. Violating any one is a contract breac 10. **Discipline is recorded.** After every meaningful task, the assistant runs the Discipline Checklist below and updates `AGENTS.md`, the relevant spec, and any relevant skill so the lesson is captured. Repeating a mistake the operator has already corrected is the most serious breach. -11. **The Production-Grade Loop is the operating model.** All production code is produced by `minimax` (the implementer) and reviewed in parallel by `glm`, `mimo`, `minimax` (fresh-context review pass — never the implementer instance), `qwen`, `deepseek` (five reviewers). The CTO verifies every reviewer claim, drives iteration, and merges only on `PRODUCTION GRADE` from all five (or only P3 noise with documented disposition). **This rule is a hard rule, not a guideline. It must survive restarts and compactions** — the full protocol lives in the "Production-Grade Loop" section below and in the `project-second-opinions`, `project-delegation`, and `project-workflow` skills. The CTO does not write production code; the implementer does not run external reviewers; the operator does not see technical detail. +11. **The Production-Grade Loop is the operating model.** All production code is produced by the `minimax` implementer (default `llm-netdata-cloud/minimax-m3-coder`) and reviewed in parallel by exactly five reviewers: `glm`, `mimo`, the `minimax` reviewer (fresh-context, never the implementer instance), `qwen`, `deepseek`. The CTO verifies every reviewer claim, drives iteration, and merges only on `PRODUCTION GRADE` from all five (or only P3 noise with documented disposition). **This rule is a hard rule, not a guideline. It must survive restarts and compactions** — the full protocol lives in the "Production-Grade Loop" section below and in the `project-second-opinions`, `project-delegation`, and `project-workflow` skills. The CTO does not write production code; the implementer does not run external reviewers; the operator does not see technical detail. ## Ownership Model @@ -332,7 +332,8 @@ Findings carry severity: - **5/5 PRODUCTION GRADE, gates green, CI green** → CTO merges. - **Any P0/P1 NEEDS WORK** → fix, push, re-trigger full 5-reviewer cycle. Iterate. -- **P2/P3 NEEDS WORK** → fix in this PR; merge when 5/5 PG or only P3 noise remains. +- **P2 NEEDS WORK** → fix in the same PR, re-trigger the full 5-reviewer cycle; merge only when 5/5 PG or only P3 noise remains. +- **P3 NEEDS WORK** → fix in the same PR, document in SOW `## Reviews`, merge with note when gates green and CI green. - **Hard stall: 5+ cycles with new P0/P1 each round** → CTO writes a `## Regression` section in the SOW, opens a follow-up SOW in `.agents/sow/pending/`, and surfaces to the operator with a business-level recommendation (e.g. "SOW X is blocked on recurring P0 findings; recommend re-scoping or accepting reduced scope"). Do not loop forever; do not over-share reviewer detail in the operator report. ### Claim verification (CRITICAL) @@ -348,7 +349,7 @@ Acting on unverified claims causes two failure modes: (a) implementing phantom b ### Backup implementer -If `minimax` is down/degraded for an extended period, the CTO rotates the implementer role to the next-most-capable member of the reviewer set. Default order: `qwen` → `mimo` → `deepseek` → `glm`. The backup operates under the same protocol — the 5-reviewer cycle still runs, but the implementer slot is filled by the backup. The CTO logs the rotation in the SOW under `## Implementer Rotation`. +If `minimax` is down/degraded for an extended period, the CTO rotates the implementer role to the next-most-capable member of the reviewer set. Default order: `qwen` → `mimo` → `deepseek` → `glm`. The backup operates under the same protocol — but with **one twist**: the rotated reviewer is **removed from the 5-reviewer cycle** for that PR (so the implementer is not also reviewing their own work), and the 5-reviewer set is filled by substituting a reviewer from the ad-hoc set (`codex`, `gemini`, `claude`, `kimi`) chosen by the CTO. The CTO logs the rotation AND the substitution in the SOW under `## Implementer Rotation`. ### Implementer model spec @@ -362,6 +363,12 @@ These run automatically on every PR. They are **supplementary signals, not part - If an automated finding touches an area the 5 reviewers flagged, the CTO re-triggers the 5-reviewer cycle on the new diff. - Cubic, Codacy, and Dependabot backlogs (e.g. SOW-0046, SOW-0047) are tracked separately in the SOW system. +### Reviewer unavailability (single reviewer down) + +If a reviewer in the 5-reviewer set is unavailable for a cycle (litellm error, model deprecated, timeout, rate limit), the CTO retries once. If still unavailable, the CTO substitutes from the ad-hoc set (`codex`, `gemini`, `claude`, `kimi`) and logs the substitution in the SOW `## Reviews` with the reason. The 5-reviewer count remains 5; the substitution is transparent to the operator report (operator still sees `5/5 PG` or `4/5 PG` etc., not the substitution details). + +If two or more reviewers in the 5-reviewer set are unavailable simultaneously, the CTO surfaces to the operator as a hard stall with a business-level recommendation. + ### What the operator sees The operator sees **business outcomes, not technical detail**. In every report: From a6c0c8324b732f1e6d228fe4310f4d4ca703a548 Mon Sep 17 00:00:00 2001 From: user <2662304+ktsaou@users.noreply.github.com> Date: Sun, 14 Jun 2026 17:19:59 +0300 Subject: [PATCH 5/5] Address round-3 P3 nit: align spec/workflow.md stop conditions with AGENTS.md The spec's stop conditions listed P2 and P3 with terse wording that omitted 're-trigger the full cycle' for P2 and 'fix in this PR' for P3, while AGENTS.md and project-second-opinions SKILL both include those qualifiers. Align the spec for parity. --- .agents/sow/specs/workflow.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/.agents/sow/specs/workflow.md b/.agents/sow/specs/workflow.md index 4951eca..709a90b 100644 --- a/.agents/sow/specs/workflow.md +++ b/.agents/sow/specs/workflow.md @@ -83,8 +83,8 @@ Stop conditions: - 5/5 PRODUCTION GRADE → merge. - Any P0/P1 → fix, push, re-trigger full cycle. -- P2 → fix in the same PR; merge when 5/5 PG or only P3 noise remains. -- P3 → document in SOW `## Reviews`, merge with note. +- P2 → fix in the same PR, re-trigger the full cycle; merge only when 5/5 PG or only P3 noise remains. +- P3 → fix in this PR, document in SOW `## Reviews`, merge with note. - Hard stall (5+ cycles with new P0/P1 each round) → write a `## Regression` section, open a follow-up SOW, surface to the operator. Findings addressed in code; reviewers re-run with the same scope plus a fix note; iterate until convergence. History recorded in the SOW under `## Reviews`.