feat(tbtc): fault isolation for wallet registry and resilient store loading#3933
feat(tbtc): fault isolation for wallet registry and resilient store loading#3933lrsaturnino wants to merge 3 commits intofeat/psbt-covenant-final-project-prfrom
Conversation
…re loading GetWallet now degrades gracefully when the wallet registry is unavailable, returning Bridge-sourced fields with a zero MembersIDsHash instead of failing outright. Covenant signer store loading skips unreadable files, malformed JSON, and invalid timestamps on duplicate route keys rather than aborting the entire load.
mswilkison
left a comment
There was a problem hiding this comment.
Two substantive issues remain:
-
pkg/covenantsigner/store.go
load()now skips unreadable/malformed job files and still returns success. After a restart, that can leave the process without the persisted job that originally owned arouteRequestId, whileService.Submitstill deduplicates only throughstore.GetByRouteRequest(...). If the skipped file belonged to an already-accepted request, a retry will create a second signing job for the same covenant request instead of failing closed. I don't think this should merge without a fail-closed safeguard for skipped persisted jobs, or an equivalent quarantine/recovery path that preserves dedupe semantics. This is not limited to theskipped > 0 && loaded == 0case; losing even one accepted job is enough to break node-local idempotency/replay protection. -
pkg/chain/ethereum/tbtc.go
The best-effort wallet-registry read looks safe for the current caller set, but the degraded-mode contract is implicit: a zeroMembersIDsHashnow means "registry data unavailable". That needs to be explicit inWalletChainDatadocumentation at minimum, and the signer-approval path should surface the likely cause more clearly. Today the operator-facing error degrades intowallet chain data must include members IDs hash, which obscures the underlying registry outage. Also, the new tests exercise the local mock chain, not the real Ethereum adapter path that changed.
Non-blocking:
- When load replaces one job with another for the same route key, the superseded job can remain reachable through
byRequestID. TestStoreLoadFailsOnInvalidUpdatedAtForDuplicateRouteKeysnow asserts success and should be renamed.
Summary
1. GetWallet — wallet registry failure is fatal
Issue:
GetWalletcalls both the Bridge and the wallet registry. If the registry call fails (transient RPC outage, registry contract issue), the entireGetWalletreturns an error — even though most callers only need Bridge-sourced fields (State, timestamps,MainUtxoHash) and never touchMembersIDsHash. A transient registry outage cascades into failures for all wallet queries.Solution: The registry call is now best-effort. On failure,
GetWalletlogs a warning and returns Bridge-sourced fields with a zero-valuedMembersIDsHash. Downstream consumers that need registry data (e.g. signer approval certificate) already guard against the zero value.2. Store load aborts on any single corrupt file
Issue:
Store.load()iterates persisted job files and returns a hard error on any of: unreadable file (disk I/O error), malformed JSON, or unparseable timestamp when resolving duplicate route keys. A single corrupt file prevents the entire store from loading, blocking the covenant signer from starting.Solution: Each failure mode now logs a warning and skips the offending file instead of aborting:
descriptor.Content()error → skip with warning.json.Unmarshalerror → skip with warning.isNewerOrSameJobRevisionfails, the loader checks which job has a parseable timestamp — if the candidate's timestamp is valid, it replaces the existing job (whose timestamp must be the broken one); otherwise the candidate is skipped. This preserves the best available data.Test coverage
GetWallet fault isolation (
pkg/tbtc/get_wallet_fault_isolation_test.go)TestGetWalletReturnsDataWhenRegistryFailsMembersIDsHashis zero, no errorTestGetWalletReturnsFullDataWhenRegistrySucceedsMembersIDsHashpopulatedTestGetWalletBridgeFailureStillReturnsErrorTest mock (
pkg/tbtc/chain_test.go)walletRegistryErrsmap andsetWalletRegistryErr()tolocalChain, enabling simulation of registry failures that return degraded data with zeroMembersIDsHash.Resilient store loading (
pkg/covenantsigner/store_test.go)TestStoreLoadSkipsUnreadableFileTestStoreLoadSkipsMalformedJSONTestStoreLoadSkipsInvalidTimestampOnDuplicateRouteKey"invalid-timestamp"→ store loads, valid-timestamp job winsTestStoreLoadFailsOnInvalidUpdatedAtForDuplicateRouteKeys(updated)Test helpers (
pkg/covenantsigner/covenantsigner_test.go)faultingDescriptor: returns injected error fromContent(), simulating unreadable files.contentFaultingHandle: injects faulting descriptors into theReadAllchannel alongside normal ones.