A heartbeat monitor for ClickHouse — polls system tables, exposes Prometheus metrics on /metrics.
- A lightweight sidecar that connects to ClickHouse and exposes 60+ Prometheus-compatible metric families
- A poll-based exporter for queries, merges, mutations, replication, parts health, Keeper state, distributed DDL, dictionaries, disk usage, query regression detection, and ClickHouse failure counters
- Compatible with ClickHouse 22.x+ (auto-detects version for correct system table schemas)
- Ships with a Grafana dashboard and a Helm chart with ServiceMonitor and PrometheusRule resources
- Zero config beyond a DSN — sensible defaults for everything
- Not a replacement for
clickhouse-exporteror built-in Prometheus endpoints — clickpulse computes operational signals (merge pressure, part count explosions, mutation backlogs) that raw counters miss - Not a query profiler — it captures top-N queries by time and memory, not full query plans or flame graphs
- Not a cluster manager — it reads system tables, it never writes to ClickHouse or modifies settings
- Not an alerting engine — pair it with Alertmanager or Grafana alerts for thresholds
Observe, don't interfere. clickpulse opens a read-only window into ClickHouse's own system tables. It adds no extensions, modifies no data, and uses minimal resources. The metrics tell you what's happening; you decide what to do about it.
Create a dedicated user with read access to system tables:
CREATE USER clickpulse IDENTIFIED BY 'your-secure-password';
GRANT SELECT ON system.* TO clickpulse;| System table | Collector | Notes |
|---|---|---|
system.processes |
Queries/activity | Active queries, memory, elapsed time |
system.query_log |
Query regression | Mean time deltas, call counts |
system.merges |
Merge pressure | Active merges, bytes/sec, part count |
system.mutations |
Mutations | Stuck mutations, parts remaining |
system.replicas |
Replication | Queue size, lag, readonly state |
system.replication_queue |
Replication queue | Task counts, oldest task age, retry counts, missing-part failures |
system.parts |
Parts health | Part counts per partition and table, sizes, compression |
system.disks |
Disk/storage | Free space, total, usage ratio |
system.dictionaries |
Dictionaries | Load status, staleness |
system.distributed_ddl_queue |
Distributed DDL | Stuck operations |
system.zookeeper / Keeper API |
Keeper health | Latency, ephemeral nodes, leader |
system.metrics |
Server metrics | ClickHouse internal counters |
system.events |
Server events | Cumulative insert, replication, Kafka, object storage, Keeper memory, and guardrail counters |
clickpulse connects using a ClickHouse DSN:
clickhouse://clickpulse:password@localhost:9000/default?secure=false
For production, use secure=true for TLS.
# Build
make build
# Run
export CLICKHOUSE_DSN="clickhouse://clickpulse@localhost:9000/default"
./bin/clickpulse serve
# Docker
docker build -t clickpulse:dev .
docker run -e CLICKHOUSE_DSN="clickhouse://clickpulse@localhost:9000/default" -p 9188:9188 clickpulse:devMetrics at http://localhost:9188/metrics, health check at /healthz.
The Helm PrometheusRule covers reachability, scrape errors, sustained replication queue backlog, missing-part replication failures, replica sync failures, replicated data loss, Kafka failures, object storage failures, query memory limits, ClickHouse guardrail rejections, merge backlog, stuck mutations, part explosions, disk fullness, and Keeper health.
helm upgrade --install clickpulse charts/clickpulse/ -f clickpulse-values.yaml -n clickpulse-systemMinimal clickpulse-values.yaml:
targets:
- name: prod
dsn: "clickhouse://clickpulse@clickhouse:9000/default?secure=true"
serviceMonitor:
enabled: true
labels:
release: prometheus-operator
prometheusRule:
enabled: true
labels:
release: prometheus-operatorsudo cp bin/clickpulse /usr/local/bin/
sudo cp deploy/clickpulse.service /etc/systemd/system/
sudo mkdir -p /etc/clickpulse
sudo cp deploy/clickpulse.env.example /etc/clickpulse/clickpulse.env
sudo chmod 600 /etc/clickpulse/clickpulse.env
# Edit CLICKHOUSE_DSN in /etc/clickpulse/clickpulse.env, then:
sudo systemctl daemon-reload
sudo systemctl enable --now clickpulseAll configuration is via environment variables:
| Variable | Default | Description |
|---|---|---|
CLICKHOUSE_DSN or DATABASE_URL |
(required) | ClickHouse connection string |
METRICS_PORT |
9188 |
Port for the HTTP metrics server |
POLL_INTERVAL |
5s |
How often to collect metrics |
SLOW_QUERY_THRESHOLD |
5s |
Duration after which a query is counted as slow |
REGRESSION_THRESHOLD |
2.0 |
Mean time ratio above which a query is flagged as regressed |
STMT_LIMIT |
50 |
Number of top queries to track per dimension |
TELEGRAM_BOT_TOKEN |
(disabled) | Telegram bot token for alerts |
TELEGRAM_CHAT_ID |
(disabled) | Telegram chat ID for alerts |
ALERT_WEBHOOK_URL |
(disabled) | Slack or generic webhook URL for alerts |
ALERT_COOLDOWN |
5m |
Minimum interval between repeated alerts |
GRAFANA_URL |
(disabled) | Grafana base URL for anomaly annotations |
GRAFANA_TOKEN |
(disabled) | Grafana service account token |
GRAFANA_DASHBOARD_UID |
(optional) | Scope annotations to a specific dashboard |
cmd/clickpulse/main.go CLI entry point (delegates to internal/cli)
internal/
cli/ Cobra commands: serve, version, status, doctor
config/ Environment-based configuration
collector/ Poll loop + collectors
processes.go system.processes (active queries, memory, elapsed)
merges.go system.merges (merge pressure, bytes/sec)
mutations.go system.mutations (stuck mutations, parts remaining)
replication.go system.replicas (queue size, lag, readonly)
replication_queue.go system.replication_queue (task age, retries, failures)
parts.go system.parts (partition/table part counts, sizes)
querylog.go system.query_log (mean time deltas, stateful)
disks.go system.disks (free space, usage ratio)
dictionaries.go system.dictionaries (load status, staleness)
ddl.go system.distributed_ddl_queue (stuck DDL)
discrepancy.go replication consistency checks
keeper.go Keeper health via mntr
server.go system.metrics + system.events
querier.go Interface for testability
keeper/ ClickHouse Keeper / ZooKeeper health
metrics/ Prometheus metric definitions
snapshot/ Point-in-time cluster snapshot
doctor/ Connectivity and permission diagnostics
alerter/ Telegram, webhook alerting
annotator/ Grafana anomaly annotations
charts/clickpulse/ Helm chart with ServiceMonitor + PrometheusRule
grafana/
clickpulse-dashboard.json Importable Grafana dashboard
deploy/
clickpulse.service systemd unit file
clickpulse.env.example Environment file template
- Query fingerprints are truncated to 80 characters
- No support for multiple ClickHouse clusters in a single process
- Query regression detection requires
system.query_log(enabled by default) - Keeper metrics require either ClickHouse Keeper or ZooKeeper access
- Part count metrics are per-partition and per-table; the default part explosion alert remains partition-scoped
- Core scaffold and CLI (serve, version, status, doctor)
- Processes collector (system.processes)
- Merge pressure collector (system.merges)
- Mutations collector (system.mutations)
- Replication collector (system.replicas)
- Replication queue collector (system.replication_queue)
- Parts health collector (system.parts)
- Query regression detection (system.query_log)
- Disk/storage collector (system.disks)
- Keeper health collector
- Distributed DDL collector
- Dictionaries collector
- Server metrics/events collector
- Grafana dashboard
- Helm chart with ServiceMonitor + PrometheusRule
- Built-in alerting (Telegram, Slack)
- Anomaly annotations