Skip to content

viragtripathi/cockroachdb-analyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CockroachDB Analyzer

Modular diagnostic and performance analyzer for CockroachDB clusters. Works with CockroachDB via SQL (crdb_internal tables) or HTTP admin API, and can also analyze offline JSON range dumps.

Features

  • Hot Ranges -- Find the largest or busiest ranges in the cluster
  • Hot Nodes -- Identify overloaded nodes by lease count and data size
  • Data Skew -- Detect disproportionately large ranges
  • Table Stats -- Per-table breakdown of range count and total size
  • Contention -- Surface lock contention events and slow statements
  • Range Details -- Inspect specific ranges by ID
  • Index Usage -- Find unused and hottest indexes
  • Lease Balance -- Check replica/lease distribution across nodes
  • Statement Fingerprints -- Identify slowest queries, most executed, highest rows read
  • Cluster Health -- Node liveness, capacity, version skew detection
  • Node Hotspot -- Deep-dive into why a specific node is running hot
  • Rebalance Status -- Check if cluster rebalancing is complete with per-node distribution diagnostics
  • Job Status -- Detect stuck GC jobs, coordinator imbalance, and failed jobs
  • Statement Errors -- Failing queries, retry errors, per-node failure distribution, contention correlation
  • Snapshot / History / Compare -- Save point-in-time snapshots and diff them later

All commands support table, JSON, and CSV output formats.

Requirements

  • Python 3.10+
  • Network access to a CockroachDB cluster (SQL port 26257 and/or HTTP admin port 8080)

Installation

This project uses uv for fast, reproducible dependency management.

# Install uv (if you don't have it)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and install
git clone https://github.com/cockroachdb/cockroachdb-analyzer.git
cd cockroachdb-analyzer
uv sync            # installs runtime deps into .venv
uv sync --dev      # includes dev tools (ruff, mypy, pytest)

The crdb-analyzer CLI is available inside the managed venv. Run it with:

uv run crdb-analyzer --help
Alternative: install with pip
pip install -e .
pip install -e ".[dev]"   # with dev tools

Quick Start

Set your connection URL

# Local insecure cluster
export CRDB_SQL_URL="postgresql://root@localhost:26257/defaultdb?sslmode=disable"

# CockroachDB Cloud (TLS)
export CRDB_SQL_URL="postgresql://myuser:changeme@host:26257/mydb?sslmode=verify-full&sslrootcert=/path/to/ca.crt"

Analyzer Commands

# Top 10 ranges by size
crdb-analyzer hot-ranges --limit 10

# Node load distribution
crdb-analyzer hot-nodes

# Data skew detection
crdb-analyzer data-skew --limit 20

# Per-table range breakdown (filter by database/table)
crdb-analyzer table-stats --database mydb --table users

# Contention and slow statements
crdb-analyzer contention --limit 10

# Unused and hot indexes
crdb-analyzer index-usage --limit 30

# Replica/lease balance across nodes
crdb-analyzer lease-balance

# Slow queries, most executed, highest rows read
crdb-analyzer stmt-fingerprints --limit 20

# Cluster health: node liveness, capacity, version skew
crdb-analyzer cluster-health

# Deep-dive: why is node 2 running hot?
crdb-analyzer node-hotspot --node-id 2

# Check if cluster rebalancing is complete
crdb-analyzer rebalance-status

# With a custom balance threshold (default: 5%)
crdb-analyzer rebalance-status --balance-threshold 10

# Detect stuck GC jobs, coordinator imbalance, failed jobs
crdb-analyzer job-status

# Statement failures, retries, and contention errors (last 1h default)
crdb-analyzer stmt-errors

# Look back further
crdb-analyzer stmt-errors --since 6h --limit 50

Diagnosing a Hot Node

When you see one node with more leases, replicas, or data than others, node-hotspot explains why:

crdb-analyzer node-hotspot --node-id 2

This shows:

  • Node vs Cluster Average -- leases, replicas, and data size compared to the mean
  • Top Tables by Lease Count -- which tables have disproportionately many leases on this node
  • Largest Ranges -- the biggest ranges whose lease sits on this node
  • Store Capacity -- disk usage across all nodes to spot hardware asymmetry
  • Zone Configs -- any lease preferences or constraints that could be pinning to this node
  • Locality Match -- whether the node's locality matches any zone config lease preferences

Checking Rebalance Status

crdb-analyzer rebalance-status

This checks:

  • Replication Stats -- under/over-replicated and unavailable ranges (must all be 0)
  • Store Balance -- range count spread across stores (configurable threshold, default 5%)
  • Per-Node Range Distribution -- expected vs actual replicas per node with RF breakdown, flags OVER/UNDER nodes
  • Rebalance Direction -- net add/remove voter events per store in the last hour
  • Rangelog Activity -- detects active add_voter/remove_voter pairs (rebalancing) vs add_voter-only (initial placement)
  • Cluster Settings -- shows current values and adjustment commands for kv.snapshot_rebalance.max_rate, kv.range_split.load_qps_threshold, and range_max_bytes

Works on self-hosted and CockroachDB Cloud clusters (virtual clusters get a reduced view).

Detecting Stuck Jobs

crdb-analyzer job-status

This detects:

  • Schema Change GC backlog -- stuck "waiting for MVCC GC" jobs from repeated TRUNCATE TABLE operations that occupy dead ranges and inflate node storage
  • Coordinator imbalance -- one node coordinating disproportionately many jobs (adds CPU/IO load)
  • Failed/reverting jobs -- recent job failures with error details
  • GC TTL configuration -- shows gc.ttlseconds per zone with the exact ALTER ... CONFIGURE ZONE command to adjust it

Getting Range IDs for range-details

Range IDs appear in the output of hot-ranges, data-skew, and table-stats. Run one of those first, then use the range_id column values:

# Step 1: find interesting ranges
crdb-analyzer hot-ranges --limit 5
# Output shows range_id column: 16895, 18879, 14166, ...

# Step 2: drill into specific ranges
crdb-analyzer range-details --range-ids 16895,18879,14166

Historical Analysis: Snapshot, History, Compare

Any analysis can be saved as a point-in-time snapshot for later review or comparison.

# Save a snapshot automatically when running any analyzer
crdb-analyzer hot-nodes --save
crdb-analyzer hot-ranges --limit 20 --save

# Or explicitly capture a snapshot
crdb-analyzer snapshot hot-nodes
crdb-analyzer snapshot cluster-health

# List all snapshots
crdb-analyzer history

# Filter by type or time window
crdb-analyzer history --type hot-nodes
crdb-analyzer history --since 2h
crdb-analyzer history --since 1d

# View a specific snapshot
crdb-analyzer history --show abc123def456

# Compare two snapshots (shows added/removed/changed rows)
crdb-analyzer compare abc123 def456

Snapshots are stored in SQLite by default (~/.crdb-analyzer/snapshots.db). To store them in the CockroachDB cluster instead:

crdb-analyzer --snapshot-store crdb snapshot hot-nodes
crdb-analyzer --snapshot-store crdb history --type hot-nodes

Continuous Monitoring with the Daemon

The daemon command runs snapshot cycles on a schedule so you always have historical data to look back on. It captures all (or selected) analyses every N seconds and optionally prunes old snapshots.

# Default: all 9 analyses every 15 minutes
crdb-analyzer daemon

# Every 5 minutes, only hot-ranges and hot-nodes
crdb-analyzer daemon --interval 300 --analyses hot-ranges,hot-nodes

# Every 10 minutes, keep only the last 7 days of data
crdb-analyzer daemon --interval 600 --retention-days 7

# Track specific nodes for hotspot analysis each cycle
crdb-analyzer daemon --node-ids 1,2,3

# Combine: every 5 min, track nodes 2 and 3, keep 14 days
crdb-analyzer daemon --interval 300 --node-ids 2,3 --retention-days 14

# Store snapshots in CockroachDB instead of local SQLite
crdb-analyzer --snapshot-store crdb daemon --interval 900

The daemon prints a summary each cycle and responds to Ctrl+C / SIGTERM for graceful shutdown.

Running as a background service (systemd example):

# /etc/systemd/system/crdb-analyzer.service
[Unit]
Description=CockroachDB Analyzer Daemon
After=network.target

[Service]
Type=simple
EnvironmentFile=/etc/crdb-analyzer/env
ExecStart=/usr/local/bin/crdb-analyzer daemon --interval 900 --retention-days 30
Restart=on-failure
RestartSec=30

[Install]
WantedBy=multi-user.target

Create the environment file (keep credentials out of unit files):

# /etc/crdb-analyzer/env
CRDB_SQL_URL=postgresql://myuser:changeme@host:26257/defaultdb?sslmode=verify-full
sudo systemctl enable --now crdb-analyzer

Running with nohup:

nohup crdb-analyzer daemon --interval 900 --retention-days 14 > /var/log/crdb-analyzer.log 2>&1 &

Once the daemon has been collecting data, you can look back at any point:

# What did hot-ranges look like 24 hours ago?
crdb-analyzer history --type hot-ranges --since 24h

# View a specific snapshot
crdb-analyzer history --show <snapshot_id>

# Compare morning vs evening
crdb-analyzer compare <morning_id> <evening_id>

Output formats

crdb-analyzer hot-ranges --limit 5                    # table (default)
crdb-analyzer --format json hot-nodes                  # JSON
crdb-analyzer --format csv data-skew --limit 100 > skew.csv  # CSV

Pass URL directly

crdb-analyzer --sql-url "postgresql://root@localhost:26257/defaultdb?sslmode=disable" hot-ranges

Analyze offline JSON range dumps

crdb-analyzer hot-ranges --from-file ranges.json
crdb-analyzer hot-nodes --from-file ranges.json
crdb-analyzer data-skew --from-file ranges.json

HTTP admin API mode

# Self-hosted (insecure)
crdb-analyzer --admin-url "http://localhost:8080" hot-ranges

# With authentication (CLI flags)
crdb-analyzer \
  --admin-url "https://your-cluster:8080" \
  --admin-user "myuser" \
  --admin-password "changeme" \
  hot-ranges

# With authentication (environment variables)
export CRDB_ADMIN_USER="myuser"
export CRDB_ADMIN_PASSWORD="changeme"
crdb-analyzer --admin-url "https://your-cluster:8080" hot-ranges

Credentials can also be set in the YAML config file (see below).

Note: CockroachDB Cloud admin UIs often use session-based auth which may not work with basic auth. For Cloud clusters, prefer --sql-url which all analyzers support.

YAML config file

Create ~/.crdb-analyzer.yaml or ./crdb-analyzer.yaml:

sql_url: "postgresql://root@localhost:26257/defaultdb?sslmode=disable"
admin_url: "http://localhost:8080"
admin_user: "myuser"        # optional
admin_password: "changeme"  # optional
ca_cert: "/path/to/ca.crt"  # optional, for TLS
client_cert: "/path/to/client.crt"  # optional
client_key: "/path/to/client.key"   # optional
timeout: 30

Debug logging

crdb-analyzer -v hot-ranges --limit 5

All Commands

Command Description
hot-ranges Find hottest ranges by size (SQL) or QPS (API/file)
hot-nodes Show load distribution across nodes
data-skew Detect data skew by finding the largest ranges
table-stats Per-table range distribution and size breakdown
contention Lock contention and slow statements
range-details Inspect specific ranges by ID
index-usage Unused and hot indexes
lease-balance Replica/lease distribution across nodes
stmt-fingerprints Slow queries, most executed, highest rows read
cluster-health Node liveness, capacity, version skew
node-hotspot Diagnose why a specific node is running hot
rebalance-status Check if cluster rebalancing is complete
job-status Detect stuck GC jobs, coordinator imbalance
stmt-errors Failing queries, retries, contention errors
snapshot Capture point-in-time snapshot of any analysis
history Browse and view historical snapshots
compare Diff two snapshots side by side
daemon Run continuous snapshots on a schedule

Architecture

crdb_analyzer/
├── cli.py                 # Click CLI with 19 commands
├── config.py              # Config resolution (CLI > env > YAML file)
├── retry.py               # CockroachDB-aware retry with exponential backoff
├── clients/
│   ├── http_client.py     # HTTP client for /_status/ranges, /hotranges, etc.
│   └── sql_client.py      # psycopg 3 client with retry and reconnect
├── analyzers/
│   ├── base.py            # Base class with shared range data extraction
│   ├── hot_ranges.py      # Top ranges by size (SQL) or QPS (API)
│   ├── hot_nodes.py       # Per-node load aggregation
│   ├── data_skew.py       # Largest ranges detection
│   ├── table_stats.py     # Per-table/index breakdown
│   ├── contention.py      # Lock contention + slow statements
│   ├── index_usage.py     # Unused and hot indexes
│   ├── lease_balance.py   # Replica/lease distribution
│   ├── stmt_fingerprints.py # Statement fingerprint analysis
│   ├── cluster_health.py  # Cluster health overview
│   ├── node_hotspot.py   # Per-node hotspot deep-dive
│   ├── rebalance_status.py # Rebalance completion check
│   ├── job_status.py     # Stuck jobs, GC backlog, coordinator imbalance
│   └── stmt_errors.py   # Statement failures, retries, contention
├── storage/
│   ├── base.py            # Abstract snapshot store interface
│   ├── sqlite_store.py    # SQLite backend (default, local)
│   └── crdb_store.py      # CockroachDB backend (centralized)
└── formatters/
    └── output.py          # Table, JSON, CSV output

Reliability

  • Automatic retry with exponential backoff and jitter on CockroachDB serialization failures (40001), connection resets, and timeouts
  • Auto-reconnect on stale or broken SQL connections

Development

uv sync --dev

# Run tests
uv run pytest

# Lint
uv run ruff check .

# Type check
uv run mypy crdb_analyzer/

# Run a specific command
uv run crdb-analyzer hot-ranges --limit 5

License

Apache License 2.0 - see LICENSE for details.

Acknowledgements

This project was inspired by Andrew Deally's article Troubleshooting Hot Ranges — CockroachDB.

About

Modular diagnostic and performance analyzer for CockroachDB

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages