Skip to content

ppiankov/clickpulse

Repository files navigation

CI License: MIT

clickpulse

A heartbeat monitor for ClickHouse — polls system tables, exposes Prometheus metrics on /metrics.

What clickpulse is

  • A lightweight sidecar that connects to ClickHouse and exposes 60+ Prometheus-compatible metric families
  • A poll-based exporter for queries, merges, mutations, replication, parts health, Keeper state, distributed DDL, dictionaries, disk usage, query regression detection, and ClickHouse failure counters
  • Compatible with ClickHouse 22.x+ (auto-detects version for correct system table schemas)
  • Ships with a Grafana dashboard and a Helm chart with ServiceMonitor and PrometheusRule resources
  • Zero config beyond a DSN — sensible defaults for everything

What clickpulse is NOT

  • Not a replacement for clickhouse-exporter or built-in Prometheus endpoints — clickpulse computes operational signals (merge pressure, part count explosions, mutation backlogs) that raw counters miss
  • Not a query profiler — it captures top-N queries by time and memory, not full query plans or flame graphs
  • Not a cluster manager — it reads system tables, it never writes to ClickHouse or modifies settings
  • Not an alerting engine — pair it with Alertmanager or Grafana alerts for thresholds

Philosophy

Observe, don't interfere. clickpulse opens a read-only window into ClickHouse's own system tables. It adds no extensions, modifies no data, and uses minimal resources. The metrics tell you what's happening; you decide what to do about it.

ClickHouse prerequisites

Required: a monitoring user

Create a dedicated user with read access to system tables:

CREATE USER clickpulse IDENTIFIED BY 'your-secure-password';
GRANT SELECT ON system.* TO clickpulse;

System tables used

System table Collector Notes
system.processes Queries/activity Active queries, memory, elapsed time
system.query_log Query regression Mean time deltas, call counts
system.merges Merge pressure Active merges, bytes/sec, part count
system.mutations Mutations Stuck mutations, parts remaining
system.replicas Replication Queue size, lag, readonly state
system.replication_queue Replication queue Task counts, oldest task age, retry counts, missing-part failures
system.parts Parts health Part counts per partition and table, sizes, compression
system.disks Disk/storage Free space, total, usage ratio
system.dictionaries Dictionaries Load status, staleness
system.distributed_ddl_queue Distributed DDL Stuck operations
system.zookeeper / Keeper API Keeper health Latency, ephemeral nodes, leader
system.metrics Server metrics ClickHouse internal counters
system.events Server events Cumulative insert, replication, Kafka, object storage, Keeper memory, and guardrail counters

Connection string

clickpulse connects using a ClickHouse DSN:

clickhouse://clickpulse:password@localhost:9000/default?secure=false

For production, use secure=true for TLS.

Quick start

# Build
make build

# Run
export CLICKHOUSE_DSN="clickhouse://clickpulse@localhost:9000/default"
./bin/clickpulse serve

# Docker
docker build -t clickpulse:dev .
docker run -e CLICKHOUSE_DSN="clickhouse://clickpulse@localhost:9000/default" -p 9188:9188 clickpulse:dev

Metrics at http://localhost:9188/metrics, health check at /healthz.

The Helm PrometheusRule covers reachability, scrape errors, sustained replication queue backlog, missing-part replication failures, replica sync failures, replicated data loss, Kafka failures, object storage failures, query memory limits, ClickHouse guardrail rejections, merge backlog, stuck mutations, part explosions, disk fullness, and Keeper health.

Helm (Kubernetes)

helm upgrade --install clickpulse charts/clickpulse/ -f clickpulse-values.yaml -n clickpulse-system

Minimal clickpulse-values.yaml:

targets:
  - name: prod
    dsn: "clickhouse://clickpulse@clickhouse:9000/default?secure=true"

serviceMonitor:
  enabled: true
  labels:
    release: prometheus-operator

prometheusRule:
  enabled: true
  labels:
    release: prometheus-operator

systemd

sudo cp bin/clickpulse /usr/local/bin/
sudo cp deploy/clickpulse.service /etc/systemd/system/
sudo mkdir -p /etc/clickpulse
sudo cp deploy/clickpulse.env.example /etc/clickpulse/clickpulse.env
sudo chmod 600 /etc/clickpulse/clickpulse.env
# Edit CLICKHOUSE_DSN in /etc/clickpulse/clickpulse.env, then:
sudo systemctl daemon-reload
sudo systemctl enable --now clickpulse

Configuration

All configuration is via environment variables:

Variable Default Description
CLICKHOUSE_DSN or DATABASE_URL (required) ClickHouse connection string
METRICS_PORT 9188 Port for the HTTP metrics server
POLL_INTERVAL 5s How often to collect metrics
SLOW_QUERY_THRESHOLD 5s Duration after which a query is counted as slow
REGRESSION_THRESHOLD 2.0 Mean time ratio above which a query is flagged as regressed
STMT_LIMIT 50 Number of top queries to track per dimension
TELEGRAM_BOT_TOKEN (disabled) Telegram bot token for alerts
TELEGRAM_CHAT_ID (disabled) Telegram chat ID for alerts
ALERT_WEBHOOK_URL (disabled) Slack or generic webhook URL for alerts
ALERT_COOLDOWN 5m Minimum interval between repeated alerts
GRAFANA_URL (disabled) Grafana base URL for anomaly annotations
GRAFANA_TOKEN (disabled) Grafana service account token
GRAFANA_DASHBOARD_UID (optional) Scope annotations to a specific dashboard

Architecture

cmd/clickpulse/main.go              CLI entry point (delegates to internal/cli)
internal/
  cli/                               Cobra commands: serve, version, status, doctor
  config/                            Environment-based configuration
  collector/                         Poll loop + collectors
    processes.go                     system.processes (active queries, memory, elapsed)
    merges.go                        system.merges (merge pressure, bytes/sec)
    mutations.go                     system.mutations (stuck mutations, parts remaining)
    replication.go                   system.replicas (queue size, lag, readonly)
    replication_queue.go             system.replication_queue (task age, retries, failures)
    parts.go                         system.parts (partition/table part counts, sizes)
    querylog.go                      system.query_log (mean time deltas, stateful)
    disks.go                         system.disks (free space, usage ratio)
    dictionaries.go                  system.dictionaries (load status, staleness)
    ddl.go                           system.distributed_ddl_queue (stuck DDL)
    discrepancy.go                   replication consistency checks
    keeper.go                        Keeper health via mntr
    server.go                        system.metrics + system.events
    querier.go                       Interface for testability
  keeper/                            ClickHouse Keeper / ZooKeeper health
  metrics/                           Prometheus metric definitions
  snapshot/                          Point-in-time cluster snapshot
  doctor/                            Connectivity and permission diagnostics
  alerter/                           Telegram, webhook alerting
  annotator/                         Grafana anomaly annotations
charts/clickpulse/                   Helm chart with ServiceMonitor + PrometheusRule
grafana/
  clickpulse-dashboard.json          Importable Grafana dashboard
deploy/
  clickpulse.service                 systemd unit file
  clickpulse.env.example             Environment file template

Known limitations

  • Query fingerprints are truncated to 80 characters
  • No support for multiple ClickHouse clusters in a single process
  • Query regression detection requires system.query_log (enabled by default)
  • Keeper metrics require either ClickHouse Keeper or ZooKeeper access
  • Part count metrics are per-partition and per-table; the default part explosion alert remains partition-scoped

Roadmap

  • Core scaffold and CLI (serve, version, status, doctor)
  • Processes collector (system.processes)
  • Merge pressure collector (system.merges)
  • Mutations collector (system.mutations)
  • Replication collector (system.replicas)
  • Replication queue collector (system.replication_queue)
  • Parts health collector (system.parts)
  • Query regression detection (system.query_log)
  • Disk/storage collector (system.disks)
  • Keeper health collector
  • Distributed DDL collector
  • Dictionaries collector
  • Server metrics/events collector
  • Grafana dashboard
  • Helm chart with ServiceMonitor + PrometheusRule
  • Built-in alerting (Telegram, Slack)
  • Anomaly annotations

License

MIT

About

A heartbeat monitor for ClickHouse — polls system tables, exposes Prometheus metrics

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages