Distributed leader election (for microservices) using k/v stores

This code servers as a template demonstrating how to handle a locking task this is connecting a destination Telnet host while implementing a "distributed leader election" operational model leveraging the fail-closed architecture software pattern.

This pattern can be implemented with code and supported on regular key/values stored in Redis or DragonFly (same Redis API,no need for Redis Streams or any other advance module) working as a locking mechanism. This locking mechanism ensures the following actions:

Only one instance, out of potentially many of a single microservice type bein run, can hold the LOCK_KEY (the leader) at a time, a typical scenario en Docker Swarm, K8s and/or HashiCopr Nomad orchestrators.
The leader periodically renews its lock; if it fails in doing so, another instance can take over the leadership.

Meaning of `fail-closed` architecture software pattern

The code has been architected using the fail-closed architecture software pattern, which design assumes dependencies will sometimes break (for example, Redis going down) and defines that, in those cases, the system should stop doing the “dangerous” or critical thing rather than continue in an incorrect or inconsistent state.

By implementing this pattern it's posible to ensure high availability in a system where only one node should perform a certain action.

How it works in practice

When an instance cannot acquire the lock (because another instance holds it or because Redis is unreachable), it calls close_telnet() and goes to sleep as standby.
When the instance is leader but lock renewal fails, it immediately drops leadership, closes locking task and waits before trying again.

In both cases, failure leads to “closed” behavior: no locking task can run safely, no leader work, and the instances stay passive until it can prove it's safe to get the lock again, that means Redis is up again to signal who should be the leader.

This is the essence of fail-close: default to off/standby when not 100% sure things are correct.

Strengths of this design

Atomic operations: The code uses Redis' atomic SET NX command for lock acquisition and a Lua script for renewal, which prevents race conditions and ensures consistency.
Optimized Redis connection: The Redis connection is configured with appropriate timeouts and retry settings:
```
socket_connect_timeout=1.0,
socket_timeout=1.0,
retry_on_timeout=True
```
This prevents the application from hanging indefinitely if Redis is unresponsive.
Error handling: The code properly handles Redis errors in both the try_acquire() and renew() functions, preventing crashes when Redis is unavailable.
Configurable parameters: Key parameters like TTL, renewal frequency, and sleep times are configurable via environment variables, allowing for tuning without code changes.
Unique instance ID: Using UUID, ensures each instance has a unique identifier, which is essential for proper lock ownership.

Areas for improvement

Sleep timing: The code uses time.sleep() with fixed intervals. This could be inefficient in some scenarios:
- In line 124, it sleeps for a fixed STANDBY_SLEEP_MS regardless of how long until the current lock expires
- In line 137, it sleeps for RENEW_EVERY_MS without accounting for how long the renewal operation took
Lack of backoff strategy: There's no exponential backoff when Redis is unavailable, which could lead to excessive connection attempts during outages.
No graceful shutdown: There's no handling for graceful shutdown (e.g., via signals), which could leave resources in an inconsistent state.
Single Redis Instance: The code connects to a single Redis instance, creating a single point of failure for the leader election mechanism.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
src/kvdistlead		src/kvdistlead
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed leader election (for microservices) using k/v stores

Meaning of `fail-closed` architecture software pattern

How it works in practice

Strengths of this design

Areas for improvement

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Distributed leader election (for microservices) using k/v stores

Meaning of fail-closed architecture software pattern

How it works in practice

Strengths of this design

Areas for improvement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages

Meaning of `fail-closed` architecture software pattern