Signer leader#30
Conversation
additional unit tests tssmock mocking + processor handling tmp unit-test for tss signature making inside processor change api wip: node receives the tss engine fix: leader mechanism to use ethaddress instead of indices connecting the node
There was a problem hiding this comment.
Pull request overview
This PR implements a leader-based protocol for TSS (Threshold Signature Scheme) signing where a designated leader guardian can forward VAAv1 messages with specific committees to other guardians for signing. The changes remove the old tsscomm protocol and replace it with a gossip-based approach.
Changes:
- Removes the old tsscomm protocol service and replaces it with a gossip-based TSS communication mechanism
- Implements leader-based VAA forwarding where leaders publish VAAv1 messages via gossip
- Adds comprehensive test coverage for TSS gossip functionality and processor handling
Reviewed changes
Copilot reviewed 26 out of 27 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| proto/tsscomm/v1/tsscomm.proto | Removes entire tsscomm proto definition |
| proto/gossip/v1/gossip.proto | Adds TSSGossipMessage to gossip protocol |
| node/pkg/tss/gossiper.go | Implements leader VAA witnessing and gossip message handling |
| node/pkg/tss/client.go | Refactors SignerClient to support VAA handling |
| node/pkg/tss/api.go | Defines Gossiper interface and updates Signer interface |
| node/pkg/processor/processor.go | Implements TSS response handling and signature processing |
| node/pkg/p2p/p2p.go | Adds TSS gossip topic support |
| node/cmd/guardiand/node.go | Updates CLI flags for TSS configuration |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| p.logger.Info("TSS signing request stopped", | ||
| zap.String("hash", hash), | ||
| zap.String("status", status.GetProtocol()), | ||
| zap.String("code", codes.Code(status.GetCode()).String()), | ||
| zap.String("message", status.GetMessage()), | ||
| ) | ||
|
|
||
| // TODO: consider additional handling based on status code |
There was a problem hiding this comment.
The TODO suggests that different status codes should be handled differently (e.g., retries for transient failures vs. permanent failures), but this logic is missing. Consider implementing appropriate error handling based on status codes.
| p.logger.Info("TSS signing request stopped", | |
| zap.String("hash", hash), | |
| zap.String("status", status.GetProtocol()), | |
| zap.String("code", codes.Code(status.GetCode()).String()), | |
| zap.String("message", status.GetMessage()), | |
| ) | |
| // TODO: consider additional handling based on status code | |
| code := codes.Code(status.GetCode()) | |
| p.logger.Info("TSS signing request stopped", | |
| zap.String("hash", hash), | |
| zap.String("status", status.GetProtocol()), | |
| zap.String("code", code.String()), | |
| zap.String("message", status.GetMessage()), | |
| ) | |
| // Additional handling based on status code. This is intentionally limited to | |
| // logging/diagnostics to avoid changing existing functional behavior. | |
| switch code { | |
| case codes.Unavailable, codes.DeadlineExceeded, codes.ResourceExhausted, codes.Aborted: | |
| // These errors are often transient and may be retried by the caller. | |
| p.logger.Warn("transient TSS signer error; request may be retried by caller", | |
| zap.String("hash", hash), | |
| zap.String("status", status.GetProtocol()), | |
| zap.String("code", code.String()), | |
| zap.String("message", status.GetMessage()), | |
| ) | |
| case codes.Canceled: | |
| // The signing request was canceled (e.g., by the client); no retry implied. | |
| p.logger.Debug("TSS signing request was canceled", | |
| zap.String("hash", hash), | |
| zap.String("status", status.GetProtocol()), | |
| zap.String("code", code.String()), | |
| zap.String("message", status.GetMessage()), | |
| ) | |
| case codes.OK: | |
| // A successful status should normally be represented as a signature response, | |
| // but if it appears here, no additional action is required. | |
| default: | |
| // Non-transient, non-OK errors are considered non-recoverable. | |
| p.logger.Error("non-recoverable TSS signer error", | |
| zap.String("hash", hash), | |
| zap.String("status", status.GetProtocol()), | |
| zap.String("code", code.String()), | |
| zap.String("message", status.GetMessage()), | |
| ) | |
| } |
fx: tss used wrong pubsub topic in p2p package fx: removed redundant check
| if p.GuardianSigner == nil { | ||
| return nil, errors.New("guardian signer must not be nil") | ||
| } | ||
| if p.SocketPath == "" { |
There was a problem hiding this comment.
should this also make sure the p.LeaderAddress is in the guardianset?
There was a problem hiding this comment.
I'm not sure it is relevant. When this function actually runs it is possible a valid guardianSetState doesn't exist yet.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 28 out of 29 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func (dt vaaHandling) verifyGossipSig(msg *gossipv1.TSSGossipMessage, gs *common.GuardianSet) error { | ||
| pubKey, err := ethcrypto.Ecrecover(ethcrypto.Keccak256(msg.Message), msg.Signature) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to recover public key: %w", err) | ||
| } | ||
|
|
||
| signerAddr := ethcommon.BytesToAddress(ethcrypto.Keccak256(pubKey[1:])[12:]) | ||
|
|
||
| if signerAddr != dt.leaderAddress { | ||
| return fmt.Errorf("signature not from leader: got %s, want %s", signerAddr.Hex(), dt.leaderAddress.Hex()) | ||
| } | ||
|
|
||
| if _, ok := gs.KeyIndex(signerAddr); !ok { | ||
| return fmt.Errorf("leader address %s not in guardian set", signerAddr.Hex()) | ||
| } |
There was a problem hiding this comment.
The signature recovery process appears incomplete. The code recovers the public key and derives an address, but the comparison with dt.leaderAddress seems insufficient for proper authentication. The recovered address should ideally be cross-verified with the guardian set to ensure it's a valid guardian, not just the leader. Additionally, the signature format should be validated to ensure it follows the expected ECDSA signature structure (65 bytes with recovery ID).
There was a problem hiding this comment.
this line if _, ok := gs.KeyIndex(signerAddr); !ok { ensures the guardian is part of the guardian set.
The ecrecover checks the structure of the signature.
| case s.conn.signResponses <- resp: // forward response to consumer. | ||
| default: | ||
| // drop response if channel is full to avoid blocking. This is not ideal, but prevents deadlocks. | ||
| // log as an error, since it indicates that the consumer is not keeping up. | ||
| logger.Error("signResponses channel full, dropping response", zap.Stringer("response", resp)) | ||
| } |
There was a problem hiding this comment.
When the channel is full, the code logs an error but drops the response silently. This could lead to lost signatures in high-throughput scenarios. Consider using a buffered channel with monitoring/alerting when approaching capacity, or implementing backpressure to slow down the sender rather than silently dropping critical signature responses.
| p.storeSignedVAA(signed) | ||
| p.thresholdSigner.WitnessNewVaa(signed) | ||
| // TODO: handle ctx. | ||
| if err := p.thresholdSigner.WitnessNewVaaV1(context.Background(), signed); err != nil { |
There was a problem hiding this comment.
Using context.Background() here bypasses the cancellation chain and could cause the WitnessNewVaaV1 call to continue even after the processor is shutting down. Consider passing through the context from the HandleQuorum caller or maintaining a processor-level context that gets canceled on shutdown.
| func (s *SignerClient) gossipListener(ctx context.Context, logger *zap.Logger) { | ||
| logger = logger.Named("gossipListener") | ||
| dt := s.vaaData | ||
| for { | ||
| select { | ||
| case <-ctx.Done(): | ||
| return | ||
| case msg := <-dt.incomingGossip: | ||
| gs := dt.gst.Get() | ||
| if gs == nil { | ||
| logger.Debug("nil guardian set") | ||
| continue | ||
| } | ||
|
|
||
| if err := dt.verifyGossipSig(msg, gs); err != nil { | ||
| logger.Warn("invalid gossip signature", zap.Error(err)) | ||
| continue | ||
| } | ||
|
|
||
| newVaa, err := vaa.Unmarshal(msg.Message) | ||
| if err != nil { | ||
| logger.Warn("malformed VAA", zap.Error(err)) | ||
| continue | ||
| } | ||
|
|
||
| if newVaa.Version != vaa.VaaVersion1 { | ||
| continue | ||
| } | ||
|
|
||
| if err := newVaa.Verify(gs.Keys); err != nil { | ||
| logger.Warn("invalid VAA", zap.Error(err)) | ||
| continue | ||
| } | ||
|
|
||
| if err := s.vaaToSignRequest(newVaa, gs); err != nil { | ||
| logger.Error("failed to create sign request from VAA", zap.Error(err)) | ||
| continue | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
The gossipListener goroutine doesn't have any mechanism to signal that it has started or is running. If the goroutine fails to start or exits unexpectedly, there's no way for the caller to detect this. Consider using an error channel or a sync.WaitGroup to ensure the goroutine has started successfully before proceeding.
| rq.Committee = append(rq.Committee, key) | ||
| } | ||
|
|
||
| if len(rq.Committee) < s.configurations.ThresholdSize { |
There was a problem hiding this comment.
Should this support ThresholdSize == 0? If not, can assert it when loading a config
| return &signerClient{} | ||
| type Parameters struct { | ||
| // configurations for the signer client. | ||
| Configurations `json:"configurations"` |
There was a problem hiding this comment.
Is the Parameters struct serialized somewhere? (using the json tags)
No description provided.