Skip to content

Commit 5270fb9

Browse files
authored
chore: Modernize the Apache CouchDB mixin (#1522)
* modernize the apache couchdb mixin * fix links and try to fix lint * pr feedback * make fmt * fix units on histogram * fix a couple of issues caught in PR review * fix lint with selector * fix lint with selector * fix some issues due to recent commits; interval/legends * make fmt * address PR feedback minus the description implementation * use couchdb_database_reads_total as varMetric; filter out zero values; fix log link in dashboards * make fmt * withIncludeVars(true) * remove public mixin setting of the filteringSelector * merge from main; re-build the mixin
1 parent a4f2a73 commit 5270fb9

21 files changed

+2741
-5780
lines changed

apache-couchdb-mixin/README.md

Lines changed: 32 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ and the following alerts:
1818
- CouchDBReplicatorJobsCrashing
1919
- CouchDBReplicatorChangesQueuesDying
2020
- CouchDBReplicatorConnectionOwnersCrashing
21-
- CouchDBReplicatorConnectionWorkersCrashing
21+
- CouchDBReplicatorWorkersCrashing
2222

2323
## Apache CouchDB Overview
2424

@@ -58,6 +58,35 @@ scrape_configs:
5858
__path__: /var/log/couchdb/couchdb.log
5959
```
6060
61+
## CouchDB Version Compatibility
62+
63+
This mixin supports **Apache CouchDB 3.3.1 and later** and handles differences in metric naming conventions between versions.
64+
65+
### Metric Naming Changes
66+
67+
Between CouchDB 3.3.0 and 3.5.0, there was a change in how some metrics are named. Specifically, some metrics that previously had a `_total` suffix no longer include it in newer versions:
68+
69+
- **CouchDB 3.3.0 and earlier**: `couchdb_open_os_files_total`
70+
- **CouchDB 3.5.0 and later**: `couchdb_open_os_files`
71+
72+
### How the Mixin Handles This
73+
74+
By default, the mixin is configured to work with both naming conventions automatically through the `metricsSource` configuration in `config.libsonnet`. This ensures dashboards and alerts work correctly regardless of which CouchDB version you're running.
75+
76+
If you need to customize this behavior, you can modify the `metricsSource` in your `config.libsonnet`:
77+
78+
```jsonnet
79+
{
80+
_config+:: {
81+
// For CouchDB 3.5.0+ only (no _total suffix)
82+
metricsSource: ['prometheus']
83+
84+
// OR for backwards compatibility with both versions
85+
metricsSource: ['prometheus', 'prometheusWithTotal'],
86+
},
87+
}
88+
```
89+
6190
## Alerts Overview
6291

6392
- CouchDBUnhealthyCluster: At least one of the nodes in a cluster is reporting the cluster as being unstable.
@@ -68,8 +97,8 @@ scrape_configs:
6897
- CouchDBManyReplicatorJobsPending: There is a high number of replicator jobs pending for a node.
6998
- CouchDBReplicatorJobsCrashing: There are replicator jobs crashing for a node.
7099
- CouchDBReplicatorChangesQueuesDying: There are replicator changes queue process deaths for a node.
71-
- CouchDBReplicatorConnectionOwnersCrashing: There are replicator connection owner process crashes for a node.
72-
- CouchDBReplicatorConnectionWorkersCrashing: There are replicator connection worker process crashes for a node.
100+
- CouchDBReplicatorOwnersCrashing: There are replicator connection owner process crashes for a node.
101+
- CouchDBReplicatorWorkersCrashing: There are replicator connection worker process crashes for a node.
73102

74103
## Install tools
75104

apache-couchdb-mixin/alerts/alerts.libsonnet renamed to apache-couchdb-mixin/alerts.libsonnet

Lines changed: 33 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
{
2-
prometheusAlerts+:: {
2+
new(this): {
33
groups+: [
44
{
55
name: 'ApacheCouchDBAlerts',
66
rules: [
77
{
88
alert: 'CouchDBUnhealthyCluster',
99
expr: |||
10-
min by(job, couchdb_cluster) (couchdb_couch_replicator_cluster_is_stable) < %(alertsCriticalClusterIsUnstable5m)s
11-
||| % $._config,
10+
min by(job, couchdb_cluster) (couchdb_couch_replicator_cluster_is_stable{%(filteringSelector)s}) < %(alertsCriticalClusterIsUnstable5m)s
11+
||| % this.config,
1212
'for': '5m',
1313
labels: {
1414
severity: 'critical',
@@ -19,14 +19,14 @@
1919
(
2020
'{{$labels.couchdb_cluster}} has reported a value of {{ printf "%%.0f" $value }} for its stability over the last 5 minutes, ' +
2121
'which is below the threshold of %(alertsCriticalClusterIsUnstable5m)s.'
22-
) % $._config,
22+
) % this.config,
2323
},
2424
},
2525
{
2626
alert: 'CouchDBHigh4xxResponseCodes',
2727
expr: |||
28-
sum by(job, instance) (increase(couchdb_httpd_status_codes{code=~"4.*"}[5m])) > %(alertsWarning4xxResponseCodes5m)s
29-
||| % $._config,
28+
sum by(job, instance) (increase(couchdb_httpd_status_codes{%(filteringSelector)s}[5m])) > %(alertsWarning4xxResponseCodes5m)s
29+
||| % (this.config { filteringSelector: if this.config.filteringSelector != '' then this.config.filteringSelector + ',code=~"4.."' else 'code=~"4.."' }),
3030
'for': '5m',
3131
labels: {
3232
severity: 'warning',
@@ -37,14 +37,14 @@
3737
(
3838
'{{ printf "%%.0f" $value }} 4xx responses have been detected over the last 5 minutes on {{$labels.instance}}, ' +
3939
'which is above the threshold of %(alertsWarning4xxResponseCodes5m)s.'
40-
) % $._config,
40+
) % this.config,
4141
},
4242
},
4343
{
4444
alert: 'CouchDBHigh5xxResponseCodes',
4545
expr: |||
46-
sum by(job, instance) (increase(couchdb_httpd_status_codes{code=~"5.*"}[5m])) > %(alertsCritical5xxResponseCodes5m)s
47-
||| % $._config,
46+
sum by(job, instance) (increase(couchdb_httpd_status_codes{%(filteringSelector)s}[5m])) > %(alertsCritical5xxResponseCodes5m)s
47+
||| % (this.config { filteringSelector: if this.config.filteringSelector != '' then this.config.filteringSelector + ',code=~"5.."' else 'code=~"5.."' }),
4848
'for': '5m',
4949
labels: {
5050
severity: 'critical',
@@ -55,14 +55,14 @@
5555
(
5656
'{{ printf "%%.0f" $value }} 5xx responses have been detected over the last 5 minutes on {{$labels.instance}}, ' +
5757
'which is above the threshold of %(alertsCritical5xxResponseCodes5m)s.'
58-
) % $._config,
58+
) % this.config,
5959
},
6060
},
6161
{
6262
alert: 'CouchDBModerateRequestLatency',
6363
expr: |||
64-
sum by(job, instance) (couchdb_request_time_seconds_sum / couchdb_request_time_seconds_count) > %(alertsWarningRequestLatency5m)s
65-
||| % $._config,
64+
sum by(job, instance) (rate(couchdb_request_time_seconds_sum{%(filteringSelector)s}[5m]) / rate(couchdb_request_time_seconds_count{%(filteringSelector)s}[5m])) * 1000 > %(alertsWarningRequestLatency5m)s
65+
||| % this.config,
6666
'for': '5m',
6767
labels: {
6868
severity: 'warning',
@@ -73,14 +73,14 @@
7373
(
7474
'An average of {{ printf "%%.0f" $value }}ms of request latency has occurred over the last 5 minutes on {{$labels.instance}}, ' +
7575
'which is above the threshold of %(alertsWarningRequestLatency5m)sms. '
76-
) % $._config,
76+
) % this.config,
7777
},
7878
},
7979
{
8080
alert: 'CouchDBHighRequestLatency',
8181
expr: |||
82-
sum by(job, instance) (couchdb_request_time_seconds_sum / couchdb_request_time_seconds_count) > %(alertsCriticalRequestLatency5m)s
83-
||| % $._config,
82+
sum by(job, instance) (rate(couchdb_request_time_seconds_sum{%(filteringSelector)s}[5m]) / rate(couchdb_request_time_seconds_count{%(filteringSelector)s}[5m])) * 1000 > %(alertsCriticalRequestLatency5m)s
83+
||| % this.config,
8484
'for': '5m',
8585
labels: {
8686
severity: 'critical',
@@ -91,14 +91,14 @@
9191
(
9292
'An average of {{ printf "%%.0f" $value }}ms of request latency has occurred over the last 5 minutes on {{$labels.instance}}, ' +
9393
'which is above the threshold of %(alertsCriticalRequestLatency5m)sms. '
94-
) % $._config,
94+
) % this.config,
9595
},
9696
},
9797
{
9898
alert: 'CouchDBManyReplicatorJobsPending',
9999
expr: |||
100-
sum by(job, instance) (couchdb_couch_replicator_jobs_pending) > %(alertsWarningPendingReplicatorJobs5m)s
101-
||| % $._config,
100+
sum by(job, instance) (couchdb_couch_replicator_jobs_pending{%(filteringSelector)s}) > %(alertsWarningPendingReplicatorJobs5m)s
101+
||| % this.config,
102102
'for': '5m',
103103
labels: {
104104
severity: 'warning',
@@ -109,14 +109,14 @@
109109
(
110110
'{{ printf "%%.0f" $value }} replicator jobs are pending on {{$labels.instance}}, ' +
111111
'which is above the threshold of %(alertsWarningPendingReplicatorJobs5m)s. '
112-
) % $._config,
112+
) % this.config,
113113
},
114114
},
115115
{
116116
alert: 'CouchDBReplicatorJobsCrashing',
117117
expr: |||
118-
sum by(job, instance) (increase(couchdb_couch_replicator_jobs_crashes_total[5m])) > %(alertsCriticalCrashingReplicatorJobs5m)s
119-
||| % $._config,
118+
sum by(job, instance) (increase(couchdb_couch_replicator_jobs_crashes_total{%(filteringSelector)s}[5m])) > %(alertsCriticalCrashingReplicatorJobs5m)s
119+
||| % this.config,
120120
'for': '5m',
121121
labels: {
122122
severity: 'critical',
@@ -127,14 +127,14 @@
127127
(
128128
'{{ printf "%%.0f" $value }} replicator jobs have crashed over the last 5 minutes on {{$labels.instance}}, ' +
129129
'which is above the threshold of %(alertsCriticalCrashingReplicatorJobs5m)s. '
130-
) % $._config,
130+
) % this.config,
131131
},
132132
},
133133
{
134134
alert: 'CouchDBReplicatorChangesQueuesDying',
135135
expr: |||
136-
sum by(job, instance) (increase(couchdb_couch_replicator_changes_queue_deaths_total[5m])) > %(alertsWarningDyingReplicatorChangesQueues5m)s
137-
||| % $._config,
136+
sum by(job, instance) (increase(couchdb_couch_replicator_changes_queue_deaths_total{%(filteringSelector)s}[5m])) > %(alertsWarningDyingReplicatorChangesQueues5m)s
137+
||| % this.config,
138138
'for': '5m',
139139
labels: {
140140
severity: 'warning',
@@ -145,14 +145,14 @@
145145
(
146146
'{{ printf "%%.0f" $value }} replicator changes queue processes have died over the last 5 minutes on {{$labels.instance}}, ' +
147147
'which is above the threshold of %(alertsWarningDyingReplicatorChangesQueues5m)s. '
148-
) % $._config,
148+
) % this.config,
149149
},
150150
},
151151
{
152-
alert: 'CouchDBReplicatorConnectionOwnersCrashing',
152+
alert: 'CouchDBReplicatorOwnersCrashing',
153153
expr: |||
154-
sum by(job, instance) (increase(couchdb_couch_replicator_connection_owner_crashes_total[5m])) > %(alertsWarningCrashingReplicatorConnectionOwners5m)s
155-
||| % $._config,
154+
sum by(job, instance) (increase(couchdb_couch_replicator_connection_owner_crashes_total{%(filteringSelector)s}[5m])) > %(alertsWarningCrashingReplicatorConnectionOwners5m)s
155+
||| % this.config,
156156
'for': '5m',
157157
labels: {
158158
severity: 'warning',
@@ -163,14 +163,14 @@
163163
(
164164
'{{ printf "%%.0f" $value }} replicator connection owner processes have crashed over the last 5 minutes on {{$labels.instance}}, ' +
165165
'which is above the threshold of %(alertsWarningCrashingReplicatorConnectionOwners5m)s. '
166-
) % $._config,
166+
) % this.config,
167167
},
168168
},
169169
{
170-
alert: 'CouchDBReplicatorConnectionWorkersCrashing',
170+
alert: 'CouchDBReplicatorWorkersCrashing',
171171
expr: |||
172-
sum by(job, instance) (increase(couchdb_couch_replicator_connection_worker_crashes_total[5m])) > %(alertsWarningCrashingReplicatorConnectionWorkers5m)s
173-
||| % $._config,
172+
sum by(job, instance) (increase(couchdb_couch_replicator_connection_worker_crashes_total{%(filteringSelector)s}[5m])) > %(alertsWarningCrashingReplicatorConnectionWorkers5m)s
173+
||| % this.config,
174174
'for': '5m',
175175
labels: {
176176
severity: 'warning',
@@ -181,7 +181,7 @@
181181
(
182182
'{{ printf "%%.0f" $value }} replicator connection worker processes have crashed over the last 5 minutes on {{$labels.instance}}, ' +
183183
'which is above the threshold of %(alertsWarningCrashingReplicatorConnectionWorkers5m)s. '
184-
) % $._config,
184+
) % this.config,
185185
},
186186
},
187187
],
Lines changed: 43 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,49 @@
11
{
2-
_config+:: {
3-
enableMultiCluster: false,
4-
couchDBSelector: if self.enableMultiCluster then 'job=~"$job", cluster=~"$cluster"' else 'job=~"$job"',
5-
multiClusterSelector: 'job=~"$job"',
2+
local this = self,
3+
filteringSelector: '', // set to apply static filters to all queries and alerts, i.e. job="bar"
4+
groupLabels: ['job', 'couchdb_cluster', 'cluster'],
5+
logLabels: ['job', 'cluster', 'instance'],
6+
instanceLabels: ['instance'],
67

7-
dashboardTags: ['apache-couchdb-mixin'],
8-
dashboardPeriod: 'now-1h',
9-
dashboardTimezone: 'default',
10-
dashboardRefresh: '1m',
8+
dashboardTags: ['apache-couchdb-mixin'],
9+
uid: 'couchdb',
10+
dashboardNamePrefix: 'Apache CouchDB',
11+
dashboardPeriod: 'now-1h',
12+
dashboardTimezone: 'default',
13+
dashboardRefresh: '1m',
14+
metricsSource: [
15+
'prometheus',
16+
/*
17+
* the prometheusWithTotal is used for backwards compatibility as some metrics are suffixed with _total but in later versions of the couchdb-mixin.
18+
* i.e. couchdb_open_os_files_total => couchdb_open_os_files
19+
* This is to ensure that the signals for the metrics that are suffixed with _total continue to work as expected.
20+
* This was an identified as a noticeable change from 3.3.0 to 3.5.0
21+
*/
22+
'prometheusWithTotal',
23+
],
1124

12-
//alert thresholds
13-
alertsCriticalClusterIsUnstable5m: 1, //1 is stable
14-
alertsWarning4xxResponseCodes5m: 5,
15-
alertsCritical5xxResponseCodes5m: 0,
16-
alertsWarningRequestLatency5m: 500, //ms
17-
alertsCriticalRequestLatency5m: 1000, //ms
18-
alertsWarningPendingReplicatorJobs5m: 10,
19-
alertsCriticalCrashingReplicatorJobs5m: 0,
20-
alertsWarningDyingReplicatorChangesQueues5m: 0,
21-
alertsWarningCrashingReplicatorConnectionOwners5m: 0,
22-
alertsWarningCrashingReplicatorConnectionWorkers5m: 0,
25+
// Logging configuration
26+
enableLokiLogs: true,
27+
extraLogLabels: ['level'],
28+
logsVolumeGroupBy: 'level',
29+
showLogsVolume: true,
2330

24-
enableLokiLogs: true,
31+
//alert thresholds
32+
alertsCriticalClusterIsUnstable5m: 1, //1 is stable
33+
alertsWarning4xxResponseCodes5m: 5,
34+
alertsCritical5xxResponseCodes5m: 0,
35+
alertsWarningRequestLatency5m: 500, //ms
36+
alertsCriticalRequestLatency5m: 1000, //ms
37+
alertsWarningPendingReplicatorJobs5m: 10,
38+
alertsCriticalCrashingReplicatorJobs5m: 0,
39+
alertsWarningDyingReplicatorChangesQueues5m: 0,
40+
alertsWarningCrashingReplicatorConnectionOwners5m: 0,
41+
alertsWarningCrashingReplicatorConnectionWorkers5m: 0,
42+
43+
// Signals configuration
44+
signals+: {
45+
overview: (import './signals/overview.libsonnet')(this),
46+
nodes: (import './signals/nodes.libsonnet')(this),
47+
replicator: (import './signals/replicator.libsonnet')(this),
2548
},
2649
}

0 commit comments

Comments
 (0)