When the system becomes slow or unstable, we currently have no clear way to tell whether the root cause is in our application logic or in Kafka.
Proposed solution
Integrate kafka_exporter and add a Prometheus scrape config for it in the monitoring repo so we can observe broker health, topic behavior, and consumer group lag.
This will help us quickly answer whether:
- the application is slow or failing to process messages efficiently
- consumers are falling behind
- Kafka itself is experiencing broker or topic-level issues
When the system becomes slow or unstable, we currently have no clear way to tell whether the root cause is in our application logic or in Kafka.
Proposed solution
Integrate
kafka_exporterand add a Prometheus scrape config for it in the monitoring repo so we can observe broker health, topic behavior, and consumer group lag.This will help us quickly answer whether: