fix: implement audit findings from consolidated review#3935
fix: implement audit findings from consolidated review#3935piotr-roslaniec wants to merge 17 commits intofix/review-findingsfrom
Conversation
…hange TestStoreLoadFailsOnInvalidUpdatedAtForDuplicateRouteKeys now asserts success (resilient loading), not failure. Rename to reflect actual behavior.
Direct == comparison is correct today since errJobNotFound is never wrapped, but errors.Is is more resilient to future wrapping changes.
The auth bypass checked only the path, allowing any HTTP method to skip bearer auth on /healthz. Restrict to GET to match the registered handler and prevent unintended bypass on other methods.
The auth token is visible in /proc/PID/cmdline when passed as a CLI flag. Add documentation recommending environment variables or config files for non-loopback deployments.
Operators previously saw only individual warnings per corrupt file but had no summary of total loaded vs skipped. Add a summary log line at the end of load() for operational visibility.
When load replaces a job during route-key deduplication, the superseded job's entry remained in byRequestID, leaking stale data in the secondary index.
…s unparseable When two duplicate-route-key jobs both have unparseable timestamps, the winner previously depended on non-deterministic file iteration order. Use lexicographic RequestID comparison as a stable tiebreaker.
…dedupe When load() skips a malformed job file, GetByRouteRequest can no longer find that job. A retry then silently creates a duplicate signing job, breaking node-local idempotency. Fix by partially parsing skipped files to extract route keys and marking them as poisoned. GetByRouteRequest returns an error for poisoned keys, forcing the caller to investigate rather than creating a duplicate.
The Submit method had 5 separate mutex.Unlock() call sites, making the locking pattern fragile for a security-critical signing path. Extract the dedup-check-and-create logic into a helper that uses defer Unlock, reducing the main Submit method to two clean lock scopes.
…gnals Two fixes: - Call cancelService() when net.Listen fails after context creation to prevent a context leak on initialization error. - Add SIGINT/SIGTERM signal handling so in-flight signing operations are cancelled promptly on any shutdown path, not only when the parent context is cancelled.
… requirements POSIX flock is advisory and Linux-specific. Document that the data directory must use local or block-level storage with single-writer access, not network filesystems.
Three improvements for operator visibility during registry outages: - Sentinel errors now mention that the wallet registry may be unavailable, helping operators distinguish registry failures from genuinely missing data. - GetWallet log elevated from Warn to Error with actionable message explaining that signer approval operations will fail. - WalletChainData godoc documents zero-value semantics for registry- sourced fields.
Switch from encoding/json.Marshal to canonicaljson.Marshal for the content-addressed handoff bundle ID. Both produce identical output for current payloads (alphabetical key ordering, no HTML content), but canonicaljson explicitly disables HTML escaping, making the serialization contract clearer for non-Go consumers.
… timestamps The earlier tiebreaker fix incorrectly entered the "both unparseable" branch when only the candidate had an invalid timestamp. Parse both timestamps explicitly to distinguish three cases: only candidate bad (keep existing), only existing bad (replace), both bad (lexicographic tiebreak).
0495130 to
5ef7617
Compare
… context The test injected a value into the HTTP request context and expected it to be visible in the engine, but the submit handler deliberately derives its context from the service context (not the request context) to survive HTTP disconnects. Fix the test to inject the value into the service context, which matches the actual design contract.
mswilkison
left a comment
There was a problem hiding this comment.
Findings
-
High:
pkg/covenantsigner/server.go:136now installs its ownsignal.NotifyContextforSIGINT/SIGTERM, but the normal node start path still builds the process root context fromcontext.Background()incmd/start.go:145and then waits forever on<-ctx.Done()incmd/start.go:276. With covenant signer enabled, the firstSIGINT/SIGTERMwill now be consumed by the signer's internal handler, which only cancelsserviceCtxand shuts down the signer HTTP server. The rest of the node keeps running instead of exiting on the first signal. Before this change, the process terminated normally. This signal handling needs to live at the top level and cancel the shared root context, rather than being added insidecovenantsigner.Initialize. -
Medium: The new poisoned-route safeguard can mask a valid surviving job. In
pkg/covenantsigner/store.go:240, every malformed file with a recoverable route key marks that route as poisoned, andpkg/covenantsigner/store.go:371returnserrPoisonedRouteKeybefore checkingbyRouteKey. That means if startup sees both a valid job and a malformed stale sibling for the same route key, retries byrouteRequestIdstop deduping to the valid job and instead fail with "manual recovery required." A realistic path to that state already exists when a replacement leaves the old file behind after delete failure. Poisoning should only apply when no valid winner for that route survives load, or be cleared once a valid job is loaded for the same key.
…ess signals The signal.NotifyContext for SIGINT/SIGTERM inside Initialize consumed the first signal, cancelling only the signer's service context while the rest of the node kept running. The process root context in cmd/start.go is context.Background() and relies on the OS default signal handler to terminate. Revert to the original parent-context-only shutdown path.
…me key A malformed file processed before its valid sibling would poison the route key, then the valid job would load into byRouteKey but remain inaccessible via GetByRouteRequest because the poison check ran first. Clear the poison entry when a valid job is successfully indexed for that route key.
Summary
Implements 15 fixes from the consolidated audit of PR #3933 (covenant signer + fault isolation). These address findings from 5 review passes (initial triage, 2 multi-agent lens reviews, PR reviewer feedback, validation review).
Fixes included
P1 - High
canonicaljson.Marshalfor handoff payload hash (cross-language determinism)createOrDeduphelper (5 unlock points -> defer)P2 - Medium
errors.Is()for errJobNotFound comparisonFollow-up issues
Pre-mainnet (must address before mainnet deployment)
A1: Default
requireApprovalTrustRootsto true in productionpkg/covenantsigner/server.go:70-75signerApprovalVerifieris nil, the service accepts requests without cryptographic signer approval verification. TherequireApprovalTrustRootsconfig flag exists as mitigation but defaults to false.requireApprovalTrustRootsto true in production configs, or fail startup when port is non-zero and no verifier is available.A8: Bitcoin regtest integration tests for witness scripts
pkg/tbtc/covenant_signer.go(design)cleanstack, signature encoding quirks, or opcode limits. A script passing unit tests could be rejected by the Bitcoin network, locking funds.btcdor Bitcoin Core regtest integration tests that compile witness scripts, sign transactions, and broadcast on a local regtest node.Post-merge (code quality and maintainability)
A9: Split monolithic validation.go (1922 lines)
pkg/covenantsigner/validation.govalidation_quote.go,validation_approval.go,validation_template.go.A10: Deduplicate UTXO resolution logic
pkg/tbtc/covenant_signer.go:512-628resolveSelfV1ActiveUtxoandresolveQcV1ActiveUtxoare near-identical (~60 lines each).resolveActiveUtxo(request, witnessScript, templateName)helper.A11: Deduplicate trust root normalization
pkg/covenantsigner/validation.go:628-724normalizeDepositorTrustRootsandnormalizeCustodianTrustRootsare structurally identical.A13: strictUnmarshal trailing token rejection
pkg/covenantsigner/validation.go:71-75decodeJSONwhich rejects trailing JSON tokens.A14: Document Engine interface contract
pkg/covenantsigner/engine.goA26: Rate limiting on submit endpoint
pkg/covenantsigner/server.go:345golang.org/x/time/rate, ~5 req/min).A28: Replace reflect.DeepEqual for Handoff comparison
pkg/covenantsigner/service.go:201sameJobRevisionusesreflect.DeepEqualformap[string]anyHandoff comparison. Correct but potentially slow at scale.HandoffDatastruct with anEqualmethod. Acceptable as-is for single-digit RPM throughput.A30: Document qcV1SignerHandoff schema
pkg/tbtc/covenant_signer.go:35Kindconstant serves as a version identifier but the downstream contract for handoff artifacts is implicit.Type-safety follow-up (from A3)
A3 (type-level): Add
HasRegistryDatato WalletChainDatapkg/tbtc/chain.go:418[32]bytefor MembersIDsHash means "unavailable" but this is implicit. Downstream guards work but future code could silently use the zero value.HasRegistryData boolfield or use*[32]bytefor MembersIDsHash to make unavailability explicit at the type level.Pre-existing test failure
TestSubmitHandlerPreservesContextValuesfails on the base branch (fix/review-findings). Confirmed not introduced by these changes -- the test expects request context values to propagate through the service context detachment, which is not the current design.Test plan
go build ./pkg/covenantsigner/...andgo build ./pkg/tbtc/...compile cleanlyTestSubmitHandlerPreservesContextValues)TestComputeQcV1SignerHandoffPayloadHash_DeterministicKeyOrdering) still passes after canonicaljson switch