Skip to content

Aggregator misses data due to losing contact with crossbar #439

@mhasself

Description

@mhasself

At site we are now regularly troubled by instances where the Aggregators lose contact with crossbar. Crossbar reports a ping-pong error, and drops the aggregator client; then the aggregator sees the dropped connection and reconnects a few seconds later. During the outage, aggregator doesn't get data on its subscribed feeds, and data is lost.

This is likely to be related to disk i/o causing something to hang in the aggregator and foiling the ping checks.

If a more direct solution cannot be found, I propose making ocs more robust to such dropouts by adding the following:

  • Agents may publish their data multiple times. Each bundle they send will be tagged with some identifier (so duplicates can be removed/ignored by the aggregator).
  • The Aggregator will monitor identifers on feeds and (a) warn / alert when a bundle is dropeed (b) quietly accept (and not record) any duplicate packets.

The scheme could be made backwards compatible -- only some agents (such as ACU) will need to enable this safety function.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions