Skip to content

Allow to configure "crash toleration" of the BRUPOP controller #748

@mikn

Description

@mikn

Image I'm using: 1.5.0

Issue or Feature Request: We recently pushed a bad release that didn't boot. Despite the first node crashing and the controller registering this, it continued updating the next node, and so on. The controller will keep crashing nodes ad infinitum unless you stop it, which seems unintuitive. I would propose a new configuration setting that allows you to set a ceiling of "allowed crashes across the cluster" that would pause the controller from performing further updates if it reaches that threshold.
Could be called CRASH_TOLERANCE or similar.

Wouldn't mind implementing this if you find this palatable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions