Skip to content

feat(tbtc): fault isolation for wallet registry and resilient store loading#3933

Open
lrsaturnino wants to merge 3 commits intofeat/psbt-covenant-final-project-prfrom
feat/fault-isolation-resilient-loading
Open

feat(tbtc): fault isolation for wallet registry and resilient store loading#3933
lrsaturnino wants to merge 3 commits intofeat/psbt-covenant-final-project-prfrom
feat/fault-isolation-resilient-loading

Conversation

@lrsaturnino
Copy link
Copy Markdown
Member

@lrsaturnino lrsaturnino commented Apr 9, 2026

Summary

1. GetWallet — wallet registry failure is fatal

Issue: GetWallet calls both the Bridge and the wallet registry. If the registry call fails (transient RPC outage, registry contract issue), the entire GetWallet returns an error — even though most callers only need Bridge-sourced fields (State, timestamps, MainUtxoHash) and never touch MembersIDsHash. A transient registry outage cascades into failures for all wallet queries.

Solution: The registry call is now best-effort. On failure, GetWallet logs a warning and returns Bridge-sourced fields with a zero-valued MembersIDsHash. Downstream consumers that need registry data (e.g. signer approval certificate) already guard against the zero value.

2. Store load aborts on any single corrupt file

Issue: Store.load() iterates persisted job files and returns a hard error on any of: unreadable file (disk I/O error), malformed JSON, or unparseable timestamp when resolving duplicate route keys. A single corrupt file prevents the entire store from loading, blocking the covenant signer from starting.

Solution: Each failure mode now logs a warning and skips the offending file instead of aborting:

  • Unreadable file: descriptor.Content() error → skip with warning.
  • Malformed JSON: json.Unmarshal error → skip with warning.
  • Invalid timestamp on duplicate route key: when isNewerOrSameJobRevision fails, the loader checks which job has a parseable timestamp — if the candidate's timestamp is valid, it replaces the existing job (whose timestamp must be the broken one); otherwise the candidate is skipped. This preserves the best available data.

Test coverage

GetWallet fault isolation (pkg/tbtc/get_wallet_fault_isolation_test.go)

Test Scenario
TestGetWalletReturnsDataWhenRegistryFails Registry error → returns Bridge fields, MembersIDsHash is zero, no error
TestGetWalletReturnsFullDataWhenRegistrySucceeds Both sources succeed → all fields including MembersIDsHash populated
TestGetWalletBridgeFailureStillReturnsError Unknown wallet (Bridge failure) → still returns error (no behavior change)

Test mock (pkg/tbtc/chain_test.go)

  • Added walletRegistryErrs map and setWalletRegistryErr() to localChain, enabling simulation of registry failures that return degraded data with zero MembersIDsHash.

Resilient store loading (pkg/covenantsigner/store_test.go)

Test Scenario
TestStoreLoadSkipsUnreadableFile One readable job + one faulting descriptor → store loads, valid job accessible
TestStoreLoadSkipsMalformedJSON One valid job + one garbage JSON file → store loads, valid job accessible
TestStoreLoadSkipsInvalidTimestampOnDuplicateRouteKey Two jobs on same route key, one with "invalid-timestamp" → store loads, valid-timestamp job wins
TestStoreLoadFailsOnInvalidUpdatedAtForDuplicateRouteKeys (updated) Now expects success instead of error — validates the graceful degradation

Test helpers (pkg/covenantsigner/covenantsigner_test.go)

  • faultingDescriptor: returns injected error from Content(), simulating unreadable files.
  • contentFaultingHandle: injects faulting descriptors into the ReadAll channel alongside normal ones.

…re loading

GetWallet now degrades gracefully when the wallet registry is
unavailable, returning Bridge-sourced fields with a zero MembersIDsHash
instead of failing outright. Covenant signer store loading skips
unreadable files, malformed JSON, and invalid timestamps on duplicate
route keys rather than aborting the entire load.
Copy link
Copy Markdown

@mswilkison mswilkison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two substantive issues remain:

  1. pkg/covenantsigner/store.go
    load() now skips unreadable/malformed job files and still returns success. After a restart, that can leave the process without the persisted job that originally owned a routeRequestId, while Service.Submit still deduplicates only through store.GetByRouteRequest(...). If the skipped file belonged to an already-accepted request, a retry will create a second signing job for the same covenant request instead of failing closed. I don't think this should merge without a fail-closed safeguard for skipped persisted jobs, or an equivalent quarantine/recovery path that preserves dedupe semantics. This is not limited to the skipped > 0 && loaded == 0 case; losing even one accepted job is enough to break node-local idempotency/replay protection.

  2. pkg/chain/ethereum/tbtc.go
    The best-effort wallet-registry read looks safe for the current caller set, but the degraded-mode contract is implicit: a zero MembersIDsHash now means "registry data unavailable". That needs to be explicit in WalletChainData documentation at minimum, and the signer-approval path should surface the likely cause more clearly. Today the operator-facing error degrades into wallet chain data must include members IDs hash, which obscures the underlying registry outage. Also, the new tests exercise the local mock chain, not the real Ethereum adapter path that changed.

Non-blocking:

  • When load replaces one job with another for the same route key, the superseded job can remain reachable through byRequestID.
  • TestStoreLoadFailsOnInvalidUpdatedAtForDuplicateRouteKeys now asserts success and should be renamed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants