test: remediate tautological instruction-file tests + standing proof-policy sweep#306
Merged
Merged
Conversation
…kers A go/ast scan over the integration test files flags any test that reads an LLM-ingested instruction file and substring/regex-matches it without self-classifying as either a non-AC text-consistency lint (markNonAC, naming its behavioral oracle) or a code-bound invariant (markCodeBoundInvariant, naming its independent source). Mutation-controlled by TestSweepDetectsAnUndeclaredTautology. This is the reproducible AC-3 metric. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…o code source Bucket A (demote to markNonAC, behavioral oracle = live gate-guardrail scenario): TestGatePresentationPresentInSkill, TestAllNineAssemblyRulesPresentInSkill, TestGatePresentationAbsentFromFOCore, TestFOCoreInvokesPresentGateSkill. Bucket B (re-bind to markCodeBoundInvariant): the seam-name + FO-internal checks now compare the skill frontmatter against the FO contract's actual Skill(skill="spacedock:present-gate") invocation (independent source); the leak check binds to the code-derived spacedock vocabulary (AST-extracted from the dispatch router, status stage-option keys, and cli.go command verbs). Mutation-controlled: name-drift from the contract seam, user-invocable flip, and a leak-token insertion each RED; restored green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…seam to contract Bucket A (demote to markNonAC, oracle = live rejection-flow scenario): TestFeedbackProcedurePresentInSkill, TestFeedbackFaithfulnessClausesPresentInSkill, TestFeedbackProcedureAbsentFromFOCore, TestFOCoreInvokesFeedbackRejectionSkill, TestAlwaysOnMachineryRetainedInFOCore, TestClaudeBareModeSeamStaysConsistent. Bucket B (re-bind to markCodeBoundInvariant): seam-name + FO-internal checks compare the skill frontmatter against the FO contract's actual Skill(skill="spacedock:feedback-rejection-flow") invocation. Mutation-controlled: name-drift and user-invocable flip each RED; restored green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ecisions to dispatch router Bucket A demote (oracle = live team-using scenarios): TestGenericBlocksPresentInSkill, TestGenericBlocksAbsentFromFORuntime, TestFORuntimeInvokesSkill. Bucket C demote (structural floor): TestFORuntimeDroppedMaterially. Bucket B re-bind (markCodeBoundInvariant): TestSkillFreeOfSpacedockTokens binds to the code-derived spacedock vocabulary; TestSpacedockDecisionsStayInFORuntime binds the retention anchors to the dispatch router's actual subcommands. Mutation-controlled: leak insertion + decision-anchor rename each RED. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d dispatch-flag docs Bucket C demote (markNonAC, structural/absence): TestShipLocalCeremonyBlockExists, TestTerminalTeardownIsBoundedBestEffort, TestAwaitingCompletionStillBansPreCompletionTeamDelete, TestNoPluginStatusPathInVendoredSkills, TestNoPRMergeOrModBehaviorIntroduced, TestSkillSurfaceDocumentsSpacedockBinInvariant. Bucket B re-bind: TestFirstOfficerDispatchDocsUseFlagFileMode binds the required file-backed flags to dispatch.go's isBuildRequestFlag (mutation-controlled — a flag renamed in the router reds the docs check). The integration AC-3 sweep TestNoUndeclaredTautologicalProof is now GREEN: zero undeclared tautological-behavioral-proof tests remain in skills/integration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…markers A go/ast scan over the hostneutrality test files flags any test that reads a markdown instruction file (readSkill/readText/parseSpans/parseProseSpansForOverlap, or os.ReadFile of an instruction-path ident or inline .md literal) and matches it (inline or via assertAll) without self-classifying via markNonAC or markCodeBoundInvariant. The go/parser code-scan invariants and the spanHostQualified unit test are NOT flagged. Mutation-controlled by TestHostneutralitySweepDetectsAnUndeclaredTautology. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…v checks to code Bucket B re-bind (markCodeBoundInvariant): TestClaudeAdapterOwnsRelocatedCommands + TestSharedCoreHasNoUnqualifiedClaudeHelpers bind their command tokens to the dispatch router (dispatch.go); TestCodexRuntimeAdaptersAreLoadable binds the host-branch token to the binary's CODEX_THREAD_ID read (build.go); TestNoCrossFileRestatement binds to its different-file n-gram source. Bucket C demote (markNonAC, prose with no code analog): TestNoDevLeakageInUniversalCore, TestWorktreeIsolationClauseSurvives, TestRuntimeAdaptersUseNeutralLocationVocabulary, TestDevDisciplinesSurviveInDevHomes, TestNoAuditTrailExposition, TestCodexAwaitingCompletionPinsMailboxSemantics, TestLiveScenarioRecommendedPracticePresent. Mutation-controlled: subcommand rename + CODEX_THREAD_ID removal each RED. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ommand-level oracles The four split-root prose tests carried the SOLE proof of behavioral claims via a substring match. Demoted to non-AC text-consistency lints, each naming the real command-level behavioral oracle that already proves it by driving the binary and observing output/on-disk state: - TestFOHaltGateProse -> TestBootJSONStateBackendEntityDirAbsent (binary emits the halt signal) + TestStateInitResumesFreshClone (the recovery works) - TestFOSyncProse / TestEnsignSyncProse -> TestTwoWriterSyncHappyPath + TestTwoWriterSameEntityConflictHalts (real 2-writer push/pull-rebase/conflict-halt) - TestCommissionJourneyProse -> TestStateNewBirthsSplitRoot + TestCommissionOrphanBranchScaffolding + TestStateInitInlineNoOp No new live model scenario was needed: the behaviors are already proven by hermetic command-level tests, which is stronger and cheaper than a live drive. All cited oracles verified present and green. The hostneutrality AC-3 sweep TestNoUndeclaredHostneutralityTautology is now GREEN. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…demote the surfaced tests The sweep now auto-discovers instruction-file reader helpers to a fixpoint (a test -> startupStep1 -> foSharedCore -> os.ReadFile chain is caught), and detects a direct os.ReadFile of a .md skill-tree path. This closed the gap where a tautology hid behind a multi-hop helper. The newly-surfaced tests are classified: - markCodeBoundInvariant: TestStartupEmbeddedRangeBracketsContractVersion (contract.CONTRACT_VERSION), TestPiRuntimeAdaptersAreLoadable + TestUserSkillReferenceClosureResolves (os.Stat on the real tree). - markNonAC: TestStartupAbortSplitsByBinaryPresence (doc-as-deliverable), TestStartupGateGuidanceHasSingleProseSource (single-source lint), TestCodexIdleNotificationRuntimeContract (oracle: captured idle evidence), TestPiFirstOfficerRuntimeForbidsSubagentAcceptanceForStages (oracle: Pi live runner), TestUserSkillsPresentWithFrontmatter, TestFOContractCarriesWorkingPrinciplesSection, TestShippedInstructionsCarryNoInsiderJargon, TestCommissionStateBackendDecisionRule, TestReconcileStep0RequiresTeamIdentityForRoster + TestReconcileStep0DropsOptionalTeamNameFraming (oracle: internal/dispatch reconcile_session_test.go code gates). Integration AC-3 sweep GREEN; full offline go test ./... green (1125 tests). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…halt/sync/journey lints Per team-lead decision (b): the four split-root prose lints (TestFOHaltGateProse, TestFOSyncProse, TestEnsignSyncProse, TestCommissionJourneyProse) are honestly demoted to non-AC text-consistency lints whose OWED behavioral proof is the live FO/ensign drive split out as task ev3e (fo-halt-sync-journey-live-drives). The command-level tests cited prove the MECHANISM/SIGNAL the FO keys on, not that the FO obeys it end-to-end; ev3e owns that live drive. The markNonAC oracle string now names ev3e as the owed drive so the gap is tracked, not silently covered. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…Delete as owed Per team-lead point 4 (apply the litmus to the extra offenders beyond the named 19): - TestTerminalTeardownIsBoundedBestEffort: bind to the EXISTING teardown-grade drive (#285) — TestTerminalTeardownGradePassesOnMarkerEmission + TestTerminalTeardownGradeFailsWhenMarkerNeverEmitted (mutation-controlled expectTerminalTeardownGrade) + the live-e2e run. Not a no-drive claim. - TestAwaitingCompletionStillBansPreCompletionTeamDelete: the pre-completion- TeamDelete ban has NO dedicated drive (distinct from the terminal-teardown HANG the #285 grade + TestSonnetTeamDeleteHangReplay cover). It is exercised IMPLICITLY by every live team scenario (a premature teardown breaks the run) but has no dedicated mutation-controlled assertion. Marked OWED and flagged to team-lead for a follow-up task — not silently capped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…3e mechanics-covered framing Per team-lead endorsement: - The prose-only dev-hygiene lints (TestNoDevLeakageInUniversalCore, TestWorktreeIsolationClauseSurvives, TestRuntimeAdaptersUseNeutralLocationVocabulary, TestDevDisciplinesSurviveInDevHomes, TestNoAuditTrailExposition) are now declared as TEXT-HYGIENE lints (a property of the text), explicitly NOT behavioral claims and with NO behavioral-oracle pointer — distinct from the Bucket-A demotions. No forced re-bind (no genuine independent source exists; theater avoided). - The four halt/sync/journey lints: behavioral-issuance rides task ev3e's halt drive; the sync/journey MECHANICS are noted as already oracle-covered by the named command-level tests (state_sync_test.go, build_statecommit_test.go, state_init_test.go / state_new_test.go) per ev3e's ideation, which folded the sync/journey residual into the halt scenario. All cited oracles verified green. go test ./... → 1125 passed; both AC-3 sweeps GREEN. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ery + split-.md, mutation-controlled Validation Cycle 1 findings 2-4 (integration half). The integration sweep's reader-discovery missed two reader shapes a planted tautology could hide behind: - path-arg reader (readSkill(t,root,rel)): the helper os.ReadFiles a value built from its own string parameter; the .md literal lives in the CALLER. Added readsParamPath — parameter-flow detection into the read argument. - WalkDir collector (shippedSkillText): walks a tree returning .md paths the caller reads+matches. Added walksForMarkdown. - split-.md suffix (name + "." + "md"): added constStringConcat to reconstruct constant string concatenations before .md detection, in both the reader and the direct-read path. The four occupant tests the extended discovery surfaced (TestNoPluginPrivateStatusPathInContracts, TestNoPluginPrivateStatusPathInUserSkills, TestShippedSurfaceHasNoHiddenMachineDependency, TestPortabilityCheckDiscriminatesHostSpecific) are honest structural-absence / portability lints — marked markNonAC naming their behavioral coverage (launcher smoke seam) or pure-portability disposition. Also classified the two TestPiFirstOfficerRuntime* presence checks the rebase onto origin/next surfaced — markNonAC naming the Pi live runner. New planted control TestSweepDetectsEvasionShapes drives each evasion shape (path-arg, WalkDir, split-.md, multi-hop transitive) and asserts the sweep REDs then GREENs once declared. Each mechanism independently mutation-verified to RED when removed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…th-arg/WalkDir/split-.md discovery, mutation-controlled Validation Cycle 1 findings 1, 3 (HN half), 4 (HN half). The HN sweep had NO transitive reader discovery (the integration sweep already did) — a tautology one hop down behind a wrapper that calls the named reader readSkill left it GREEN. It also missed the path-arg/WalkDir reader shapes and the split-.md suffix evasion. Refactored sweepHostneutralityTautologies to the two-pass fixpoint design: seed the named readers, grow to a fixpoint over direct-literal / param-path / WalkDir readers AND transitive callers of a known reader. Added readsParamPath (param-flow into os.ReadFile/os.Open), walksForMarkdown, and constStringConcat (rejoins a split .md suffix before detection). Code-scanning via parser.ParseFile (scanFile) is NOT a ReadFile/Open read, so the go/parser host-neutrality invariants stay unflagged (39 package tests green, no false positives). TestHostneutralitySweepDetectsAnUndeclaredTautology gains the multi-hop case (finding 1's explicit ask). New TestHostneutralitySweepDetectsEvasionShapes drives multi-hop / path-arg / WalkDir / split-.md and asserts RED-then-GREEN. Each mechanism independently mutation-verified to RED when removed — including the transitive fixpoint itself (finding 4). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… reader axis classes Cycle-2 validation + FO detached audit: the sweep rested on three allow-lists, each admitting a one-line evasion (M1 match idioms: strings.Index/ContainsAny/ EqualFold/bytes.*/regexp.Match/Split; M2 path-build: strings.Join not just +; M3 flow: struct-field/method-receiver, not just string params). Whack-a-mole — close the CLASS per both reviewers' endorsed direction. Match axis: drop the matchFuncs enumeration. The sweep now keys on the READ — a test that ingests instruction-file content MUST declare markNonAC/markCodeBoundInvariant regardless of how it inspects the bytes. Zero new markers needed: every shipped instruction-content reader already declares (the manifest/.json + docs/dev recipe readers are not instruction-surface and stay unflagged). Reader axis: detect the read by TAINT (readsInstructionContent). Any string derived from an instruction-file path — built by +/strings.Join/filepath.Join/fmt.Sprintf, flowed through a param, a package-wide struct field (instructionTaintedFields), a method receiver, or a local — flowing into any read sink (ReadFile/Open/ReadAll/ bufio) is an ingest; plus WalkDir-collected .md. isInstructionPathLiteral is the positive instruction-surface predicate (skill-tree/contract segment, not .json, not docs/dev recipes — resolves Cycle-2 P1 by scoping to the shipped surface). TestSweepDetectsEvasionShapes gains M1 (strings.Index + regexp.Regexp.Match([]byte)), M2 (strings.Join .md path), M3 (struct-field + method flow) RED-then-GREEN, alongside the retained 4 cycle-1/2 shapes. Each mechanism independently mutation-verified to RED when removed. Full integration package green (60), vet clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…h + reader axis classes Port the integration sweep's Cycle-3 positive/taint redesign to the HN sweep so both packages close the same evasion classes (M1 match idioms, M2 strings.Join path-build, M3 struct-field/method-receiver flow). Match axis: drop the matchFuncs enumeration; key on the READ — a test that ingests instruction-file content MUST declare regardless of inspection idiom. Reader axis: detect by TAINT (readsInstructionContent) over instructionPathIdents + .md skill-tree segments, built by any idiom, flowed through param/struct-field/method/ local, into any read sink (ReadFile/Open/ReadAll/bufio) or WalkDir-collected .md. scanFile's parser.ParseFile over .go source is not a content read sink, so the go/parser code invariants stay unflagged (40 package tests green, no false positives). TestHostneutralitySweepDetectsEvasionShapes gains M1 (strings.Index + regexp.Regexp.Match([]byte)), M2 (strings.Join .md path), M3 (struct-field + method flow). Each mechanism independently mutation-verified to RED when removed; the M1 positive rule proven load-bearing by re-introducing an enumerated match gate and observing the strings.Index shape evade again. Removed the now-dead constStringConcat (-only concat) — the segment-taint predicate subsumes it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-only) Cycle-3 captain Option C: the sweep doc comments overclaimed that the guard catches a tautology regardless of path-construction/flow shape through ANY indirection. The cycle-3 validation + detached audit falsified that on the READER axis (M-D: []string/...string-param + range/slice-element flow evades). Corrected the wording in both nonac_marker_test.go files to the real bound: - MATCH axis: closed/universal/load-bearing (keying on the read subsumes every inspection idiom) — kept. - READER axis: detects in-package reads of recognized instruction paths via the covered flow shapes (bare-string param, :=/= local, struct field, method receiver, closure; path built by +/strings.Join/filepath.Join/fmt.Sprintf). - KNOWN out-of-scope, tracked in the forked follow-up sweep-guard-reader-axis-invert (id 4qnn7dbzkyh9qv65t618vtxy), audit-backstopped: M-A unrecognized surfaces (AGENTS.md/mods), M-B cross-package, M-C package-var-from-another-file, M-D []string/range flow. Also corrected the stale 'calls a reader AND a match' map comments (the rule now keys on the READ alone). NO detection logic changed — the non-comment +/- diff is empty for both files; go vet clean; go test ./... 1131 green (unchanged). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
clkao
added a commit
that referenced
this pull request
Jun 5, 2026
clkao
added a commit
that referenced
this pull request
Jun 5, 2026
clkao
added a commit
that referenced
this pull request
Jun 5, 2026
clkao
added a commit
that referenced
this pull request
Jun 5, 2026
…e+timeout saga, API-overload reckoning, 0.19.6 slate filed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Tests matching substrings over instruction files the model itself wrote proved
nothing — a meaning-inverting paraphrase still passed. This remediates 56 such
tests and adds a standing guard that enforces the proof-policy.
What changed
Evidence
go test ./...1136 passed; both AC-3 sweeps + all evasion controls green.hw