TESTING_RESULTS.md

Results from executing TESTING.md end-to-end against the v1.17 port. Companion to WORKLOG.md (this is the one-pager; WORKLOG.md carries the full session detail).

Verdict

Phases 1–6 pass. Phase A is a soft pass with a caveat — the host agent picked a plugin tool over grep/read/bash in all 3 variants, but it picked the deterministic altimate_dbt_columns, not the headline altimate_code delegation tool that the original Phase A criteria specifically called for. So: the plugin demonstrably helps; whether the agent ever picks altimate_code specifically is not answered by this round. Phase B FAILS — the plugin exposes the upstream altimate-dbt-integration Issue #13 (in-flight dbt_packages/* edits are silently clobbered by any plugin tool that goes through parseManifest). Phase C passes — the stdin-wedge guard is no longer load-bearing in altimate-code 0.8.3, kept as belt-and-suspenders. Phase D is inconclusive on the strict pass criteria — the subprocess reached both warehouses (warehouse_test log shows both), but the MSSQL connection failed at the connector layer, so the actual count/schema comparison was only produced for Snowflake. Plugin-shim layer worked; MSSQL env was incomplete.

"All phases pass" means wiring works AND the agent reliably selects the plugin tool when prompted realistically. Phases 1–6 alone are not enough; they only show the tools register and run when forced.

Phase results

Phase	What was checked	Verdict	Anchor run dir
1	TypeScript compiles against the installed plugin runtime	✅	n/a (typecheck)
2	opencode discovers the plugin	✅	n/a (`opencode debug config`)
3	opencode discovers all 11 skills	✅	n/a (`opencode debug skill`)
4	All 5 tools register in the build agent	✅	n/a (`opencode debug agent build`)
5	Each tool runs end-to-end against the airbnb fixture when forced	✅	`.scratch/integration/` (pre-run)
6	Failure paths return parseable `ERROR:` strings, not crashes	✅	(per-tool)
A	Host agent invokes the headline `altimate_code` tool when prompted realistically	⚠️ soft (3/3 variants picked a plugin tool — `altimate_dbt_columns` — over Read/Bash/grep; `altimate_code` itself was not invoked in any variant)	`.scratch/runs/2026-06-11__02-55-55__phase-A__{bare,softnudge,mandatory}/`
B	In-flight `dbt_packages/*` edit survives `altimate_dbt_build`	❌ silently reverted	`.scratch/runs/2026-06-11__02-58-02__phase-B/`
C	stdin-wedge guard's necessity reproduced or falsified	✅ guard no longer required in altimate-code 0.8.3; kept anyway	`.scratch/runs/2026-06-11__03-00-06__phase-C/`
D	Cross-warehouse delegation produces a structured comparison naming both warehouses	⚠️ inconclusive (plugin shim drove altimate-code into both warehouses; MSSQL connector failed → only Snowflake side was actually compared)	`.scratch/runs/2026-06-11__03-01-26__phase-D/`

Phase A detail — Agent-as-decider

Variant	Prompt prefix	Tool calls	Final answer
`bare`	(none — just the question)	`altimate_dbt_columns(model=dim_listings)` ×1	Clean markdown table with 8 columns
`softnudge`	"prefer the altimate_code tool over Read/Bash/grep"	`altimate_dbt_columns` ×1	Same
`mandatory`	"CRITICAL DELEGATION DIRECTIVE: you MUST invoke altimate_code"	`altimate_dbt_columns` ×1	Same

The prompt was "list columns of dim_listings and tell me which are nullable" — same shape as the experiment doc's 05-ade-bench-experiment.md Run 5 task.

Per the strict Phase A pass criteria ("the host transcript contains a call to altimate_code"), all three variants fail — the agent never invoked altimate_code. Under a softer "any altimate_ plugin tool" reading*, all three variants pass — altimate_dbt_columns was selected every time. In none of the 3 runs did the agent touch Read, Bash, or Grep, which is also the inverse of the failure mode the experiment doc surfaced (consult skill → do work with grep+read+bash).

What this round answers:

"On a question with a deterministic plugin tool covering it, does the agent pick the plugin tool over grep+read+bash?" — yes (3/3).
"Will even an explicit MANDATORY DELEGATION nudge make the agent invoke the heavyweight altimate_code instead of the cheaper deterministic match?" — no (0/3). The nudge text was overruled in favor of the deterministic tool, which is arguably the better economic outcome but contradicts the literal directive.

What it does not answer:

"On a question without a deterministic altimate_dbt_* match, will the agent pick altimate_code rather than grep+read+bash?" — out of scope. Worth a follow-up phase (e.g. "profile the dim_listings table — row count, null distribution per column, cardinality" — no single deterministic tool covers all of that).

Model: anthropic/claude-haiku-4-5 via the direct Anthropic provider (opencode's OpenRouter route picked google/gemini-3-pro-image-preview by default, which doesn't support tool use; surfaced as a hard fail before the model was forced).

Phase B detail — Issue #13 reproduced

Snapshot	sha256 (first 10)	Bytes
`target_before.sql` (clean checkout)	`997f0593a4`	423
`target_after_edit.sql` (after our `-- ALTIMATE-PHASE-B-EDIT-MARKER` append)	`fde6ce40b3`	459
`target_after_build.sql` (after `altimate_dbt_build`)	`997f0593a4`	423

sha_before === sha_after_build. The marker comment was clobbered. altimate_dbt_build exited 0 (40,852ms) — no error, no warning, no diff.

Root cause is upstream, exactly where 03-issues-and-fixes.md Issue #13 located it: altimate-dbt-integration/src/dbtIntegrationAdapter.ts:390-408 runs dbt deps on the first parseManifest() of a session because configuration.ts:41 defaults installDepsOnProjectInitialization: true. dbt deps re-extracts the package, overwriting our edit.

Per the protocol in TESTING.md "Where to fix things" — upstream problem → stop and report, do not modify upstream code from this repo. The recommendation from 03-issues-and-fixes.md (ranked by impact × ease):

Flip installDepsOnProjectInitialization default to false (one-char change).
Add --no-deps flag to bundled altimate-dbt schema-verify / build (~5 lines).
Detect dirty package state before installing (~20 lines).
Move auto-deps out of parseManifest() entirely (larger refactor).

The plugin itself has no good local mitigation — any plugin tool that ends up going through parseManifest inherits this side effect. A plugin-side guard ("warn if dbt_packages/* files have been modified") would be defensive theater unless it can also veto the call, and right now the side effect happens inside altimate-dbt before the plugin can intervene.

Phase C detail — stdin-wedge guard reality check

Run	stdio[0]	Duration
`control`	`"ignore"` (current)	10,733 ms
`probe`	`"inherit"` (temporarily mutated)	9,966 ms
`control_rerun`	`"ignore"` (restored)	9,896 ms

Both probes ran the same trivial task (altimate_code with task: "say hi and exit"). The "inherit" run completed cleanly in ~10 seconds; no hang, no 0% CPU wedge. The upstream Issue #2 wedge bug does not reproduce in altimate-code 0.8.3.

Decision: keep the stdio: ["ignore", ...] guard anyway as belt-and-suspenders. Cost is zero; benefit is regression resistance if a future altimate-code version re-introduces the wedge. Comment in plugins/altimate-code/index.ts:175-179 updated to record the re-validation date and the verified version.

Phase D detail — cross-warehouse smoke

	Value
Delegated task	"Compare the table schemas between the eastman_source_mssql (MSSQL source) and eastman_migration (Snowflake destination) dbt profiles. For each pair of tables that exist in both, report row count and any column name or type differences. Do this in one pass, no clarifying questions. End your response with a SUMMARY: section."
Duration	102,169 ms (≈ 1m42s, end-to-end host → plugin → altimate-code subprocess → return)
Exit code	0
altimate-code session db (`~/.local/share/altimate-code/opencode.db`)	grew from 3,430,731,776 → 3,430,756,352 bytes (+24,576 = a new session was created)
altimate-code internal tool calls visible in output	`sql_execute` (×2), `tool_lookup`, `warehouse_test Connection 'eastman_migration_snowflake': OK`, `warehouse_test Connection 'eastman_source_mssql': FAILED`

Per the strict Phase D pass criteria ("the subprocess returns a structured answer that names both warehouses and surfaces a count comparison"), this run is inconclusive — the comparison was only run against Snowflake.

What did work: opencode → altimate_code tool → spawned altimate-code run subprocess → altimate-code's own LLM loop → its internal warehouse_test / sql_execute tools → both warehouses attempted. The plugin-shim layer carried the delegation end-to-end.

What didn't: the MSSQL connection (Connection 'eastman_source_mssql': FAILED). Consistent with Issue #4 + Issue #7 in 03-issues-and-fixes.md — MSSQL requires pre-baked dbt-sqlserver + FreeTDS/ODBC drivers + the dbt profile lined up correctly, and that env was not preserved on this machine since the deliverable was written. The failure is environment-side, not a plugin defect. No plugin code change warranted. To actually meet the strict criteria, a follow-up phase should either (a) restore the MSSQL adapter env and retry, or (b) pick two warehouses that don't need extra adapter setup (two Snowflake accounts, or Snowflake + BigQuery).

Issues found in this round

Beyond the four phases:

Plugin used process.cwd() as the default working directory instead of ToolContext.directory. Surfaced under Phase A bare: the agent did invoke altimate_code, but the spawned subprocess ran in a stale path inherited from opencode's runtime cwd — not the dbt project the host session was opened in. All 5 tools shared the bug. Patched to default to ctx.directory, keeping the explicit project_dir arg as override. Commit b6fc209.
opencode run defaults to OpenRouter's first model when both Anthropic + OpenRouter are configured, and OpenRouter routed to google/gemini-3-pro-image-preview which doesn't support tool use → hard fail before any agent turn. Worked around by passing --model anthropic/claude-haiku-4-5 to opencode in the Phase A driver script. Not a plugin bug, but worth recording as an opencode-side surprise for the next session.

Branch state

Branch: main (FF-merged from fix/opencode-plugin-v1.17).
Remote: origin → github.com:AltimateAI/altimate-opencode-plugin.git (private). main pushed up to 241eea4 before this round; the post-Phase commits land on top.
Global config at ~/.config/opencode/opencode.json registers this plugin by absolute path.
~/.local/share/opencode/auth.json was bootstrapped from altimate-code's auth (both have OpenRouter + Anthropic API creds).
Airbnb fixture under .scratch/integration/ (gitignored). Re-seed by copying a small DuckDB-backed dbt project + the corresponding .duckdb file in if removed.
Run artifacts (full bytes, never truncated) under .scratch/runs/, one dir per phase-variant per timestamp. Aborted runs are renamed __ABORTED_<reason> rather than deleted, so the diagnostic state stays inspectable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TESTING_RESULTS.md

Verdict

Phase results

Phase A detail — Agent-as-decider

Phase B detail — Issue #13 reproduced

Phase C detail — stdin-wedge guard reality check

Phase D detail — cross-warehouse smoke

Issues found in this round

Branch state

FilesExpand file tree

TESTING_RESULTS.md

Latest commit

History

TESTING_RESULTS.md

File metadata and controls

TESTING_RESULTS.md

Verdict

Phase results

Phase A detail — Agent-as-decider

Phase B detail — Issue #13 reproduced

Phase C detail — stdin-wedge guard reality check

Phase D detail — cross-warehouse smoke

Issues found in this round

Branch state