Motivation
While reviewing the stale ACK fix in #3173 / #3160, we found a similar stale ACK risk in the Artery system message delivery path.
Classic remoting and Artery use different implementations, so the Classic EndpointReader UID fix does not cover this path. Artery uses SystemMessageDelivery for acked system message delivery.
Current behavior
Artery ACK/NACK messages carry the remote unique address:
remote/src/main/scala/org/apache/pekko/remote/artery/SystemMessageDelivery.scala:53
remote/src/main/scala/org/apache/pekko/remote/artery/SystemMessageDelivery.scala:54
However, SystemMessageDelivery.notify currently accepts ACK/NACK replies by comparing only the address:
remote/src/main/scala/org/apache/pekko/remote/artery/SystemMessageDelivery.scala:157-160
case ack: Ack => if (ack.from.address == remoteAddress) ackCallback.invoke(ack)
case nack: Nack => if (nack.from.address == remoteAddress) nackCallback.invoke(nack)
When a remote system restarts with a new UID, the association moves to a new incarnation:
remote/src/main/scala/org/apache/pekko/remote/artery/Association.scala:306-317
SystemMessageDelivery then notices the incarnation change on send/resend and clears delivery state, including resetting the sequence number:
remote/src/main/scala/org/apache/pekko/remote/artery/SystemMessageDelivery.scala:222-225
remote/src/main/scala/org/apache/pekko/remote/artery/SystemMessageDelivery.scala:242-246
remote/src/main/scala/org/apache/pekko/remote/artery/SystemMessageDelivery.scala:296-302
ACK processing then clears unacknowledged messages by sequence number only:
remote/src/main/scala/org/apache/pekko/remote/artery/SystemMessageDelivery.scala:189-197
This means a delayed ACK/NACK from an older UID for the same address can be accepted after a new incarnation has started, if the old UID is not quarantined and the old UID mapping is still known.
Why the old UID can still reach this path
Inbound decoding and filtering are based on originUid and the association registry:
remote/src/main/scala/org/apache/pekko/remote/artery/Codecs.scala:427-428
remote/src/main/scala/org/apache/pekko/remote/artery/Handshake.scala:328-332
remote/src/main/scala/org/apache/pekko/remote/artery/InboundQuarantineCheck.scala:45-71
The registry adds uid -> association mappings:
remote/src/main/scala/org/apache/pekko/remote/artery/Association.scala:1180-1197
A new UID handshake does not appear to remove the old UID mapping. The unused-quarantined cleanup checks the association current state, so when the same association currently has a healthy new UID, an old UID mapping may remain:
remote/src/main/scala/org/apache/pekko/remote/artery/Association.scala:1236-1256
Possible impact
If an old-UID ACK with a low sequence number is accepted after the new incarnation has reset SystemMessageDelivery sequence numbers, it can incorrectly acknowledge system messages from the new incarnation. That can cause new system messages to be removed from the unacknowledged buffer even though the ACK belongs to the previous remote UID.
This is similar in shape to the stale ACK issue fixed for Classic remoting, but it is in the Artery system message delivery implementation.
Proposed change
In SystemMessageDelivery.notify, validate the full UniqueAddress of ACK/NACK replies against the current association state instead of comparing only the address.
For example, accept the reply only when:
outboundContext.associationState.uniqueRemoteAddress().contains(reply.from)
This should be applied to both Ack and Nack.
Suggested tests
Add focused coverage in remote/src/test/scala/org/apache/pekko/remote/artery/SystemMessageDeliverySpec.scala:
- Establish an old
UniqueAddress for the same remote address.
- Send at least one acked system message and then move the association to a new
UniqueAddress / incarnation.
- Trigger
SystemMessageDelivery state clearing and sequence number reset for the new incarnation.
- Inject
Ack(1, oldUniqueAddress) and verify it does not acknowledge the new incarnation message.
- Inject
Ack(1, newUniqueAddress) and verify the current message is acknowledged.
- Add the same stale-source check for
Nack.
Related
Motivation
While reviewing the stale ACK fix in #3173 / #3160, we found a similar stale ACK risk in the Artery system message delivery path.
Classic remoting and Artery use different implementations, so the Classic
EndpointReaderUID fix does not cover this path. Artery usesSystemMessageDeliveryfor acked system message delivery.Current behavior
Artery ACK/NACK messages carry the remote unique address:
remote/src/main/scala/org/apache/pekko/remote/artery/SystemMessageDelivery.scala:53remote/src/main/scala/org/apache/pekko/remote/artery/SystemMessageDelivery.scala:54However,
SystemMessageDelivery.notifycurrently accepts ACK/NACK replies by comparing only the address:remote/src/main/scala/org/apache/pekko/remote/artery/SystemMessageDelivery.scala:157-160When a remote system restarts with a new UID, the association moves to a new incarnation:
remote/src/main/scala/org/apache/pekko/remote/artery/Association.scala:306-317SystemMessageDeliverythen notices the incarnation change on send/resend and clears delivery state, including resetting the sequence number:remote/src/main/scala/org/apache/pekko/remote/artery/SystemMessageDelivery.scala:222-225remote/src/main/scala/org/apache/pekko/remote/artery/SystemMessageDelivery.scala:242-246remote/src/main/scala/org/apache/pekko/remote/artery/SystemMessageDelivery.scala:296-302ACK processing then clears unacknowledged messages by sequence number only:
remote/src/main/scala/org/apache/pekko/remote/artery/SystemMessageDelivery.scala:189-197This means a delayed ACK/NACK from an older UID for the same address can be accepted after a new incarnation has started, if the old UID is not quarantined and the old UID mapping is still known.
Why the old UID can still reach this path
Inbound decoding and filtering are based on
originUidand the association registry:remote/src/main/scala/org/apache/pekko/remote/artery/Codecs.scala:427-428remote/src/main/scala/org/apache/pekko/remote/artery/Handshake.scala:328-332remote/src/main/scala/org/apache/pekko/remote/artery/InboundQuarantineCheck.scala:45-71The registry adds
uid -> associationmappings:remote/src/main/scala/org/apache/pekko/remote/artery/Association.scala:1180-1197A new UID handshake does not appear to remove the old UID mapping. The unused-quarantined cleanup checks the association current state, so when the same association currently has a healthy new UID, an old UID mapping may remain:
remote/src/main/scala/org/apache/pekko/remote/artery/Association.scala:1236-1256Possible impact
If an old-UID ACK with a low sequence number is accepted after the new incarnation has reset
SystemMessageDeliverysequence numbers, it can incorrectly acknowledge system messages from the new incarnation. That can cause new system messages to be removed from the unacknowledged buffer even though the ACK belongs to the previous remote UID.This is similar in shape to the stale ACK issue fixed for Classic remoting, but it is in the Artery system message delivery implementation.
Proposed change
In
SystemMessageDelivery.notify, validate the fullUniqueAddressof ACK/NACK replies against the current association state instead of comparing only the address.For example, accept the reply only when:
This should be applied to both
AckandNack.Suggested tests
Add focused coverage in
remote/src/test/scala/org/apache/pekko/remote/artery/SystemMessageDeliverySpec.scala:UniqueAddressfor the same remote address.UniqueAddress/ incarnation.SystemMessageDeliverystate clearing and sequence number reset for the new incarnation.Ack(1, oldUniqueAddress)and verify it does not acknowledge the new incarnation message.Ack(1, newUniqueAddress)and verify the current message is acknowledged.Nack.Related