Skip to content

fix: implement audit findings from consolidated review#3935

Open
piotr-roslaniec wants to merge 17 commits intofix/review-findingsfrom
fix/audit-findings
Open

fix: implement audit findings from consolidated review#3935
piotr-roslaniec wants to merge 17 commits intofix/review-findingsfrom
fix/audit-findings

Conversation

@piotr-roslaniec
Copy link
Copy Markdown
Collaborator

@piotr-roslaniec piotr-roslaniec commented Apr 9, 2026

Summary

Implements 15 fixes from the consolidated audit of PR #3933 (covenant signer + fault isolation). These address findings from 5 review passes (initial triage, 2 multi-agent lens reviews, PR reviewer feedback, validation review).

Fixes included

P1 - High

  • A2: Use canonicaljson.Marshal for handoff payload hash (cross-language determinism)
  • A3/A23: Improve wallet registry error messages, elevate log to Error, add WalletChainData godoc
  • A4: Extract Submit critical section into createOrDedup helper (5 unlock points -> defer)
  • A5: Deterministic tiebreaker when both timestamps unparseable (lexicographic RequestID)
  • A6: Restrict healthz auth bypass to GET method only
  • A22: Poison route keys from skipped jobs to preserve dedupe semantics

P2 - Medium

  • A7: Document AuthToken CLI flag /proc visibility risk
  • A12/A27: Cancel service context on init failure + add SIGINT/SIGTERM signal handling
  • A15: Document advisory flock limitations and storage requirements
  • A16: Add aggregate load summary with skip count
  • A24: Remove superseded job from byRequestID on dedup replacement
  • A25: Rename misleading test after resilient loading change
  • A29: Use errors.Is() for errJobNotFound comparison

Follow-up issues

Pre-mainnet (must address before mainnet deployment)

A1: Default requireApprovalTrustRoots to true in production

  • File: pkg/covenantsigner/server.go:70-75
  • Risk: When signerApprovalVerifier is nil, the service accepts requests without cryptographic signer approval verification. The requireApprovalTrustRoots config flag exists as mitigation but defaults to false.
  • Action: Default requireApprovalTrustRoots to true in production configs, or fail startup when port is non-zero and no verifier is available.
  • Effort: Low

A8: Bitcoin regtest integration tests for witness scripts

  • File: pkg/tbtc/covenant_signer.go (design)
  • Risk: Go unit tests cannot enforce cleanstack, signature encoding quirks, or opcode limits. A script passing unit tests could be rejected by the Bitcoin network, locking funds.
  • Action: Introduce btcd or Bitcoin Core regtest integration tests that compile witness scripts, sign transactions, and broadcast on a local regtest node.
  • Effort: High

Post-merge (code quality and maintainability)

A9: Split monolithic validation.go (1922 lines)

  • File: pkg/covenantsigner/validation.go
  • Issue: Mixes input parsing, crypto verification, normalization, and commitment computation in a single file.
  • Action: Split into focused files: validation_quote.go, validation_approval.go, validation_template.go.
  • Effort: High

A10: Deduplicate UTXO resolution logic

  • File: pkg/tbtc/covenant_signer.go:512-628
  • Issue: resolveSelfV1ActiveUtxo and resolveQcV1ActiveUtxo are near-identical (~60 lines each).
  • Action: Extract shared resolveActiveUtxo(request, witnessScript, templateName) helper.
  • Effort: Medium

A11: Deduplicate trust root normalization

  • File: pkg/covenantsigner/validation.go:628-724
  • Issue: normalizeDepositorTrustRoots and normalizeCustodianTrustRoots are structurally identical.
  • Action: Unify via generics or shared inner function.
  • Effort: Low

A13: strictUnmarshal trailing token rejection

  • File: pkg/covenantsigner/validation.go:71-75
  • Issue: Inconsistent with server-level decodeJSON which rejects trailing JSON tokens.
  • Action: Add trailing-token rejection or document single-document context constraint.
  • Effort: Low

A14: Document Engine interface contract

  • File: pkg/covenantsigner/engine.go
  • Issue: No documentation of what nil Transition means (OnSubmit: default to Pending; OnPoll: no-op).
  • Action: Add godoc specifying the contract and allowed return states.
  • Effort: Low

A26: Rate limiting on submit endpoint

  • File: pkg/covenantsigner/server.go:345
  • Issue: Authenticated callers can spawn unbounded concurrent 5-minute signing operations. Mitigated by auth and routeRequestID dedup but no per-client throttle exists.
  • Action: Add token-bucket rate limiting per auth token (e.g., golang.org/x/time/rate, ~5 req/min).
  • Effort: Medium

A28: Replace reflect.DeepEqual for Handoff comparison

  • File: pkg/covenantsigner/service.go:201
  • Issue: sameJobRevision uses reflect.DeepEqual for map[string]any Handoff comparison. Correct but potentially slow at scale.
  • Action: Define a concrete HandoffData struct with an Equal method. Acceptable as-is for single-digit RPM throughput.
  • Effort: Low

A30: Document qcV1SignerHandoff schema

  • File: pkg/tbtc/covenant_signer.go:35
  • Issue: The Kind constant serves as a version identifier but the downstream contract for handoff artifacts is implicit.
  • Action: Add doc comment noting this struct defines the downstream API schema.
  • Effort: Low

Type-safety follow-up (from A3)

A3 (type-level): Add HasRegistryData to WalletChainData

  • File: pkg/tbtc/chain.go:418
  • Issue: Zero [32]byte for MembersIDsHash means "unavailable" but this is implicit. Downstream guards work but future code could silently use the zero value.
  • Action: Add HasRegistryData bool field or use *[32]byte for MembersIDsHash to make unavailability explicit at the type level.
  • Effort: Medium (touches many callers)

Pre-existing test failure

TestSubmitHandlerPreservesContextValues fails on the base branch (fix/review-findings). Confirmed not introduced by these changes -- the test expects request context values to propagate through the service context detachment, which is not the current design.

Test plan

  • go build ./pkg/covenantsigner/... and go build ./pkg/tbtc/... compile cleanly
  • All covenantsigner tests pass (except pre-existing TestSubmitHandlerPreservesContextValues)
  • All relevant tbtc tests pass (GetWallet fault isolation, signer approval, payload hash)
  • Pinned hash test (TestComputeQcV1SignerHandoffPayloadHash_DeterministicKeyOrdering) still passes after canonicaljson switch
  • CI green

…hange

TestStoreLoadFailsOnInvalidUpdatedAtForDuplicateRouteKeys now asserts
success (resilient loading), not failure. Rename to reflect actual
behavior.
Direct == comparison is correct today since errJobNotFound is never
wrapped, but errors.Is is more resilient to future wrapping changes.
The auth bypass checked only the path, allowing any HTTP method to
skip bearer auth on /healthz. Restrict to GET to match the registered
handler and prevent unintended bypass on other methods.
The auth token is visible in /proc/PID/cmdline when passed as a CLI
flag. Add documentation recommending environment variables or config
files for non-loopback deployments.
Operators previously saw only individual warnings per corrupt file but
had no summary of total loaded vs skipped. Add a summary log line at
the end of load() for operational visibility.
When load replaces a job during route-key deduplication, the superseded
job's entry remained in byRequestID, leaking stale data in the
secondary index.
…s unparseable

When two duplicate-route-key jobs both have unparseable timestamps, the
winner previously depended on non-deterministic file iteration order.
Use lexicographic RequestID comparison as a stable tiebreaker.
…dedupe

When load() skips a malformed job file, GetByRouteRequest can no longer
find that job. A retry then silently creates a duplicate signing job,
breaking node-local idempotency. Fix by partially parsing skipped files
to extract route keys and marking them as poisoned. GetByRouteRequest
returns an error for poisoned keys, forcing the caller to investigate
rather than creating a duplicate.
The Submit method had 5 separate mutex.Unlock() call sites, making the
locking pattern fragile for a security-critical signing path. Extract
the dedup-check-and-create logic into a helper that uses defer Unlock,
reducing the main Submit method to two clean lock scopes.
…gnals

Two fixes:
- Call cancelService() when net.Listen fails after context creation to
  prevent a context leak on initialization error.
- Add SIGINT/SIGTERM signal handling so in-flight signing operations are
  cancelled promptly on any shutdown path, not only when the parent
  context is cancelled.
… requirements

POSIX flock is advisory and Linux-specific. Document that the data
directory must use local or block-level storage with single-writer
access, not network filesystems.
Three improvements for operator visibility during registry outages:
- Sentinel errors now mention that the wallet registry may be
  unavailable, helping operators distinguish registry failures from
  genuinely missing data.
- GetWallet log elevated from Warn to Error with actionable message
  explaining that signer approval operations will fail.
- WalletChainData godoc documents zero-value semantics for registry-
  sourced fields.
Switch from encoding/json.Marshal to canonicaljson.Marshal for the
content-addressed handoff bundle ID. Both produce identical output for
current payloads (alphabetical key ordering, no HTML content), but
canonicaljson explicitly disables HTML escaping, making the
serialization contract clearer for non-Go consumers.
… timestamps

The earlier tiebreaker fix incorrectly entered the "both unparseable"
branch when only the candidate had an invalid timestamp. Parse both
timestamps explicitly to distinguish three cases: only candidate bad
(keep existing), only existing bad (replace), both bad (lexicographic
tiebreak).
… context

The test injected a value into the HTTP request context and expected it
to be visible in the engine, but the submit handler deliberately
derives its context from the service context (not the request context)
to survive HTTP disconnects. Fix the test to inject the value into the
service context, which matches the actual design contract.
Copy link
Copy Markdown

@mswilkison mswilkison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • High: pkg/covenantsigner/server.go:136 now installs its own signal.NotifyContext for SIGINT/SIGTERM, but the normal node start path still builds the process root context from context.Background() in cmd/start.go:145 and then waits forever on <-ctx.Done() in cmd/start.go:276. With covenant signer enabled, the first SIGINT/SIGTERM will now be consumed by the signer's internal handler, which only cancels serviceCtx and shuts down the signer HTTP server. The rest of the node keeps running instead of exiting on the first signal. Before this change, the process terminated normally. This signal handling needs to live at the top level and cancel the shared root context, rather than being added inside covenantsigner.Initialize.

  • Medium: The new poisoned-route safeguard can mask a valid surviving job. In pkg/covenantsigner/store.go:240, every malformed file with a recoverable route key marks that route as poisoned, and pkg/covenantsigner/store.go:371 returns errPoisonedRouteKey before checking byRouteKey. That means if startup sees both a valid job and a malformed stale sibling for the same route key, retries by routeRequestId stop deduping to the valid job and instead fail with "manual recovery required." A realistic path to that state already exists when a replacement leaves the old file behind after delete failure. Poisoning should only apply when no valid winner for that route survives load, or be cleared once a valid job is loaded for the same key.

…ess signals

The signal.NotifyContext for SIGINT/SIGTERM inside Initialize consumed
the first signal, cancelling only the signer's service context while
the rest of the node kept running. The process root context in
cmd/start.go is context.Background() and relies on the OS default
signal handler to terminate. Revert to the original parent-context-only
shutdown path.
…me key

A malformed file processed before its valid sibling would poison the
route key, then the valid job would load into byRouteKey but remain
inaccessible via GetByRouteRequest because the poison check ran first.
Clear the poison entry when a valid job is successfully indexed for that
route key.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants