-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Description
At site we are now regularly troubled by instances where the Aggregators lose contact with crossbar. Crossbar reports a ping-pong error, and drops the aggregator client; then the aggregator sees the dropped connection and reconnects a few seconds later. During the outage, aggregator doesn't get data on its subscribed feeds, and data is lost.
This is likely to be related to disk i/o causing something to hang in the aggregator and foiling the ping checks.
If a more direct solution cannot be found, I propose making ocs more robust to such dropouts by adding the following:
- Agents may publish their data multiple times. Each bundle they send will be tagged with some identifier (so duplicates can be removed/ignored by the aggregator).
- The Aggregator will monitor identifers on feeds and (a) warn / alert when a bundle is dropeed (b) quietly accept (and not record) any duplicate packets.
The scheme could be made backwards compatible -- only some agents (such as ACU) will need to enable this safety function.
Metadata
Metadata
Assignees
Labels
No labels