Skip to content

test: remediate tautological instruction-file tests + standing proof-policy sweep#306

Merged
clkao merged 17 commits into
nextfrom
spacedock-ensign/tautological-test-remediation
Jun 5, 2026
Merged

test: remediate tautological instruction-file tests + standing proof-policy sweep#306
clkao merged 17 commits into
nextfrom
spacedock-ensign/tautological-test-remediation

Conversation

@clkao
Copy link
Copy Markdown
Collaborator

@clkao clkao commented Jun 5, 2026

Tests matching substrings over instruction files the model itself wrote proved
nothing — a meaning-inverting paraphrase still passed. This remediates 56 such
tests and adds a standing guard that enforces the proof-policy.

What changed

  • Demote 56 presence-checks to non-AC lints or re-bind them to independent code sources.
  • Add a standing AC-3 sweep in both packages flagging any undeclared instruction-file tautology.
  • Close the match axis: any inspection idiom over ingested bytes must declare, not an enumerated allowlist.
  • Cover the reader axis (param/field/method/closure/path-build) by taint; document the known out-of-scope classes.

Evidence

  • go test ./... 1136 passed; both AC-3 sweeps + all evasion controls green.
  • Every demotion, re-bind, and sweep evasion shape mutation-controlled (RED-then-GREEN); two detached adversarial audits + three validations.

hw

clkao and others added 17 commits June 5, 2026 00:52
…kers

A go/ast scan over the integration test files flags any test that reads an
LLM-ingested instruction file and substring/regex-matches it without
self-classifying as either a non-AC text-consistency lint (markNonAC, naming
its behavioral oracle) or a code-bound invariant (markCodeBoundInvariant,
naming its independent source). Mutation-controlled by
TestSweepDetectsAnUndeclaredTautology. This is the reproducible AC-3 metric.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…o code source

Bucket A (demote to markNonAC, behavioral oracle = live gate-guardrail scenario):
TestGatePresentationPresentInSkill, TestAllNineAssemblyRulesPresentInSkill,
TestGatePresentationAbsentFromFOCore, TestFOCoreInvokesPresentGateSkill.

Bucket B (re-bind to markCodeBoundInvariant): the seam-name + FO-internal checks
now compare the skill frontmatter against the FO contract's actual
Skill(skill="spacedock:present-gate") invocation (independent source); the leak
check binds to the code-derived spacedock vocabulary (AST-extracted from the
dispatch router, status stage-option keys, and cli.go command verbs).

Mutation-controlled: name-drift from the contract seam, user-invocable flip, and
a leak-token insertion each RED; restored green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…seam to contract

Bucket A (demote to markNonAC, oracle = live rejection-flow scenario):
TestFeedbackProcedurePresentInSkill, TestFeedbackFaithfulnessClausesPresentInSkill,
TestFeedbackProcedureAbsentFromFOCore, TestFOCoreInvokesFeedbackRejectionSkill,
TestAlwaysOnMachineryRetainedInFOCore, TestClaudeBareModeSeamStaysConsistent.

Bucket B (re-bind to markCodeBoundInvariant): seam-name + FO-internal checks
compare the skill frontmatter against the FO contract's actual
Skill(skill="spacedock:feedback-rejection-flow") invocation. Mutation-controlled:
name-drift and user-invocable flip each RED; restored green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ecisions to dispatch router

Bucket A demote (oracle = live team-using scenarios): TestGenericBlocksPresentInSkill,
TestGenericBlocksAbsentFromFORuntime, TestFORuntimeInvokesSkill.
Bucket C demote (structural floor): TestFORuntimeDroppedMaterially.
Bucket B re-bind (markCodeBoundInvariant): TestSkillFreeOfSpacedockTokens binds to
the code-derived spacedock vocabulary; TestSpacedockDecisionsStayInFORuntime binds
the retention anchors to the dispatch router's actual subcommands.
Mutation-controlled: leak insertion + decision-anchor rename each RED.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d dispatch-flag docs

Bucket C demote (markNonAC, structural/absence): TestShipLocalCeremonyBlockExists,
TestTerminalTeardownIsBoundedBestEffort, TestAwaitingCompletionStillBansPreCompletionTeamDelete,
TestNoPluginStatusPathInVendoredSkills, TestNoPRMergeOrModBehaviorIntroduced,
TestSkillSurfaceDocumentsSpacedockBinInvariant.
Bucket B re-bind: TestFirstOfficerDispatchDocsUseFlagFileMode binds the required
file-backed flags to dispatch.go's isBuildRequestFlag (mutation-controlled — a
flag renamed in the router reds the docs check).

The integration AC-3 sweep TestNoUndeclaredTautologicalProof is now GREEN: zero
undeclared tautological-behavioral-proof tests remain in skills/integration.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…markers

A go/ast scan over the hostneutrality test files flags any test that reads a
markdown instruction file (readSkill/readText/parseSpans/parseProseSpansForOverlap,
or os.ReadFile of an instruction-path ident or inline .md literal) and matches it
(inline or via assertAll) without self-classifying via markNonAC or
markCodeBoundInvariant. The go/parser code-scan invariants and the spanHostQualified
unit test are NOT flagged. Mutation-controlled by
TestHostneutralitySweepDetectsAnUndeclaredTautology.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…v checks to code

Bucket B re-bind (markCodeBoundInvariant): TestClaudeAdapterOwnsRelocatedCommands +
TestSharedCoreHasNoUnqualifiedClaudeHelpers bind their command tokens to the
dispatch router (dispatch.go); TestCodexRuntimeAdaptersAreLoadable binds the
host-branch token to the binary's CODEX_THREAD_ID read (build.go);
TestNoCrossFileRestatement binds to its different-file n-gram source.
Bucket C demote (markNonAC, prose with no code analog): TestNoDevLeakageInUniversalCore,
TestWorktreeIsolationClauseSurvives, TestRuntimeAdaptersUseNeutralLocationVocabulary,
TestDevDisciplinesSurviveInDevHomes, TestNoAuditTrailExposition,
TestCodexAwaitingCompletionPinsMailboxSemantics, TestLiveScenarioRecommendedPracticePresent.
Mutation-controlled: subcommand rename + CODEX_THREAD_ID removal each RED.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ommand-level oracles

The four split-root prose tests carried the SOLE proof of behavioral claims via a
substring match. Demoted to non-AC text-consistency lints, each naming the real
command-level behavioral oracle that already proves it by driving the binary and
observing output/on-disk state:
- TestFOHaltGateProse -> TestBootJSONStateBackendEntityDirAbsent (binary emits the
  halt signal) + TestStateInitResumesFreshClone (the recovery works)
- TestFOSyncProse / TestEnsignSyncProse -> TestTwoWriterSyncHappyPath +
  TestTwoWriterSameEntityConflictHalts (real 2-writer push/pull-rebase/conflict-halt)
- TestCommissionJourneyProse -> TestStateNewBirthsSplitRoot +
  TestCommissionOrphanBranchScaffolding + TestStateInitInlineNoOp

No new live model scenario was needed: the behaviors are already proven by hermetic
command-level tests, which is stronger and cheaper than a live drive. All cited
oracles verified present and green.

The hostneutrality AC-3 sweep TestNoUndeclaredHostneutralityTautology is now GREEN.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…demote the surfaced tests

The sweep now auto-discovers instruction-file reader helpers to a fixpoint
(a test -> startupStep1 -> foSharedCore -> os.ReadFile chain is caught), and
detects a direct os.ReadFile of a .md skill-tree path. This closed the gap where
a tautology hid behind a multi-hop helper. The newly-surfaced tests are
classified:
- markCodeBoundInvariant: TestStartupEmbeddedRangeBracketsContractVersion
  (contract.CONTRACT_VERSION), TestPiRuntimeAdaptersAreLoadable +
  TestUserSkillReferenceClosureResolves (os.Stat on the real tree).
- markNonAC: TestStartupAbortSplitsByBinaryPresence (doc-as-deliverable),
  TestStartupGateGuidanceHasSingleProseSource (single-source lint),
  TestCodexIdleNotificationRuntimeContract (oracle: captured idle evidence),
  TestPiFirstOfficerRuntimeForbidsSubagentAcceptanceForStages (oracle: Pi live
  runner), TestUserSkillsPresentWithFrontmatter, TestFOContractCarriesWorkingPrinciplesSection,
  TestShippedInstructionsCarryNoInsiderJargon, TestCommissionStateBackendDecisionRule,
  TestReconcileStep0RequiresTeamIdentityForRoster + TestReconcileStep0DropsOptionalTeamNameFraming
  (oracle: internal/dispatch reconcile_session_test.go code gates).

Integration AC-3 sweep GREEN; full offline go test ./... green (1125 tests).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…halt/sync/journey lints

Per team-lead decision (b): the four split-root prose lints (TestFOHaltGateProse,
TestFOSyncProse, TestEnsignSyncProse, TestCommissionJourneyProse) are honestly
demoted to non-AC text-consistency lints whose OWED behavioral proof is the live
FO/ensign drive split out as task ev3e (fo-halt-sync-journey-live-drives). The
command-level tests cited prove the MECHANISM/SIGNAL the FO keys on, not that the
FO obeys it end-to-end; ev3e owns that live drive. The markNonAC oracle string now
names ev3e as the owed drive so the gap is tracked, not silently covered.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…Delete as owed

Per team-lead point 4 (apply the litmus to the extra offenders beyond the named 19):
- TestTerminalTeardownIsBoundedBestEffort: bind to the EXISTING teardown-grade
  drive (#285) — TestTerminalTeardownGradePassesOnMarkerEmission +
  TestTerminalTeardownGradeFailsWhenMarkerNeverEmitted (mutation-controlled
  expectTerminalTeardownGrade) + the live-e2e run. Not a no-drive claim.
- TestAwaitingCompletionStillBansPreCompletionTeamDelete: the pre-completion-
  TeamDelete ban has NO dedicated drive (distinct from the terminal-teardown HANG
  the #285 grade + TestSonnetTeamDeleteHangReplay cover). It is exercised
  IMPLICITLY by every live team scenario (a premature teardown breaks the run) but
  has no dedicated mutation-controlled assertion. Marked OWED and flagged to
  team-lead for a follow-up task — not silently capped.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…3e mechanics-covered framing

Per team-lead endorsement:
- The prose-only dev-hygiene lints (TestNoDevLeakageInUniversalCore,
  TestWorktreeIsolationClauseSurvives, TestRuntimeAdaptersUseNeutralLocationVocabulary,
  TestDevDisciplinesSurviveInDevHomes, TestNoAuditTrailExposition) are now declared
  as TEXT-HYGIENE lints (a property of the text), explicitly NOT behavioral claims
  and with NO behavioral-oracle pointer — distinct from the Bucket-A demotions. No
  forced re-bind (no genuine independent source exists; theater avoided).
- The four halt/sync/journey lints: behavioral-issuance rides task ev3e's halt
  drive; the sync/journey MECHANICS are noted as already oracle-covered by the
  named command-level tests (state_sync_test.go, build_statecommit_test.go,
  state_init_test.go / state_new_test.go) per ev3e's ideation, which folded the
  sync/journey residual into the halt scenario. All cited oracles verified green.

go test ./... → 1125 passed; both AC-3 sweeps GREEN.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ery + split-.md, mutation-controlled

Validation Cycle 1 findings 2-4 (integration half). The integration sweep's
reader-discovery missed two reader shapes a planted tautology could hide behind:

- path-arg reader (readSkill(t,root,rel)): the helper os.ReadFiles a value built
  from its own string parameter; the .md literal lives in the CALLER. Added
  readsParamPath — parameter-flow detection into the read argument.
- WalkDir collector (shippedSkillText): walks a tree returning .md paths the
  caller reads+matches. Added walksForMarkdown.
- split-.md suffix (name + "." + "md"): added constStringConcat to reconstruct
  constant string concatenations before .md detection, in both the reader and the
  direct-read path.

The four occupant tests the extended discovery surfaced (TestNoPluginPrivateStatusPathInContracts,
TestNoPluginPrivateStatusPathInUserSkills, TestShippedSurfaceHasNoHiddenMachineDependency,
TestPortabilityCheckDiscriminatesHostSpecific) are honest structural-absence /
portability lints — marked markNonAC naming their behavioral coverage (launcher
smoke seam) or pure-portability disposition.

Also classified the two TestPiFirstOfficerRuntime* presence checks the rebase onto
origin/next surfaced — markNonAC naming the Pi live runner.

New planted control TestSweepDetectsEvasionShapes drives each evasion shape
(path-arg, WalkDir, split-.md, multi-hop transitive) and asserts the sweep REDs
then GREENs once declared. Each mechanism independently mutation-verified to RED
when removed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…th-arg/WalkDir/split-.md discovery, mutation-controlled

Validation Cycle 1 findings 1, 3 (HN half), 4 (HN half). The HN sweep had NO
transitive reader discovery (the integration sweep already did) — a tautology one
hop down behind a wrapper that calls the named reader readSkill left it GREEN. It
also missed the path-arg/WalkDir reader shapes and the split-.md suffix evasion.

Refactored sweepHostneutralityTautologies to the two-pass fixpoint design: seed
the named readers, grow to a fixpoint over direct-literal / param-path / WalkDir
readers AND transitive callers of a known reader. Added readsParamPath (param-flow
into os.ReadFile/os.Open), walksForMarkdown, and constStringConcat (rejoins a split
.md suffix before detection). Code-scanning via parser.ParseFile (scanFile) is NOT
a ReadFile/Open read, so the go/parser host-neutrality invariants stay unflagged
(39 package tests green, no false positives).

TestHostneutralitySweepDetectsAnUndeclaredTautology gains the multi-hop case
(finding 1's explicit ask). New TestHostneutralitySweepDetectsEvasionShapes drives
multi-hop / path-arg / WalkDir / split-.md and asserts RED-then-GREEN. Each
mechanism independently mutation-verified to RED when removed — including the
transitive fixpoint itself (finding 4).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… reader axis classes

Cycle-2 validation + FO detached audit: the sweep rested on three allow-lists,
each admitting a one-line evasion (M1 match idioms: strings.Index/ContainsAny/
EqualFold/bytes.*/regexp.Match/Split; M2 path-build: strings.Join not just +;
M3 flow: struct-field/method-receiver, not just string params). Whack-a-mole —
close the CLASS per both reviewers' endorsed direction.

Match axis: drop the matchFuncs enumeration. The sweep now keys on the READ — a
test that ingests instruction-file content MUST declare markNonAC/markCodeBoundInvariant
regardless of how it inspects the bytes. Zero new markers needed: every shipped
instruction-content reader already declares (the manifest/.json + docs/dev recipe
readers are not instruction-surface and stay unflagged).

Reader axis: detect the read by TAINT (readsInstructionContent). Any string derived
from an instruction-file path — built by +/strings.Join/filepath.Join/fmt.Sprintf,
flowed through a param, a package-wide struct field (instructionTaintedFields), a
method receiver, or a local — flowing into any read sink (ReadFile/Open/ReadAll/
bufio) is an ingest; plus WalkDir-collected .md. isInstructionPathLiteral is the
positive instruction-surface predicate (skill-tree/contract segment, not .json,
not docs/dev recipes — resolves Cycle-2 P1 by scoping to the shipped surface).

TestSweepDetectsEvasionShapes gains M1 (strings.Index + regexp.Regexp.Match([]byte)),
M2 (strings.Join .md path), M3 (struct-field + method flow) RED-then-GREEN, alongside
the retained 4 cycle-1/2 shapes. Each mechanism independently mutation-verified to
RED when removed. Full integration package green (60), vet clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…h + reader axis classes

Port the integration sweep's Cycle-3 positive/taint redesign to the HN sweep so
both packages close the same evasion classes (M1 match idioms, M2 strings.Join
path-build, M3 struct-field/method-receiver flow).

Match axis: drop the matchFuncs enumeration; key on the READ — a test that ingests
instruction-file content MUST declare regardless of inspection idiom. Reader axis:
detect by TAINT (readsInstructionContent) over instructionPathIdents + .md
skill-tree segments, built by any idiom, flowed through param/struct-field/method/
local, into any read sink (ReadFile/Open/ReadAll/bufio) or WalkDir-collected .md.
scanFile's parser.ParseFile over .go source is not a content read sink, so the
go/parser code invariants stay unflagged (40 package tests green, no false positives).

TestHostneutralitySweepDetectsEvasionShapes gains M1 (strings.Index +
regexp.Regexp.Match([]byte)), M2 (strings.Join .md path), M3 (struct-field + method
flow). Each mechanism independently mutation-verified to RED when removed; the M1
positive rule proven load-bearing by re-introducing an enumerated match gate and
observing the strings.Index shape evade again.

Removed the now-dead constStringConcat (-only concat) — the segment-taint
predicate subsumes it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-only)

Cycle-3 captain Option C: the sweep doc comments overclaimed that the guard catches
a tautology regardless of path-construction/flow shape through ANY indirection. The
cycle-3 validation + detached audit falsified that on the READER axis (M-D:
[]string/...string-param + range/slice-element flow evades).

Corrected the wording in both nonac_marker_test.go files to the real bound:
- MATCH axis: closed/universal/load-bearing (keying on the read subsumes every
  inspection idiom) — kept.
- READER axis: detects in-package reads of recognized instruction paths via the
  covered flow shapes (bare-string param, :=/= local, struct field, method receiver,
  closure; path built by +/strings.Join/filepath.Join/fmt.Sprintf).
- KNOWN out-of-scope, tracked in the forked follow-up sweep-guard-reader-axis-invert
  (id 4qnn7dbzkyh9qv65t618vtxy), audit-backstopped: M-A unrecognized surfaces
  (AGENTS.md/mods), M-B cross-package, M-C package-var-from-another-file, M-D
  []string/range flow.

Also corrected the stale 'calls a reader AND a match' map comments (the rule now
keys on the READ alone). NO detection logic changed — the non-comment +/- diff is
empty for both files; go vet clean; go test ./... 1131 green (unchanged).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
clkao added a commit that referenced this pull request Jun 5, 2026
@clkao clkao temporarily deployed to CI-E2E-OPUS June 5, 2026 15:11 — with GitHub Actions Inactive
@clkao clkao temporarily deployed to CI-E2E-CODEX June 5, 2026 15:11 — with GitHub Actions Inactive
@clkao clkao merged commit 2f27fdf into next Jun 5, 2026
9 of 10 checks passed
clkao added a commit that referenced this pull request Jun 5, 2026
…e+timeout saga, API-overload reckoning, 0.19.6 slate filed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant