New Threshold Syntax

Draft Proposal for Nagios Plugins Development Guidelines and Threshold Syntax Specification 2.0

This page documents and discusses proposed new syntax for Nagios Thresholds. The current method for defining thresholds via the command line is inconsistent and difficult to interpret. This proposal suggests a different way of specifying thresholds, which will also make changes to performance data returned.

This is derived from a proposal posted at nagiosplugins.org/rfc/new_threshold_syntax incorporating additional ideas posted on nagios-plugins mail list. This wiki is open to public editing but please post on nagios-plugins mail list when you add or make significant changes here.

Problem

The current method of specifying thresholds is confusing when there are different checks required. For instance, in check_http, to check page size and time, you can specify -w {warn time}, -c {crit time}, -m {minpagesize}[:maxpagesize], -M {maxage of document}.

Also, note the ways of defining the range are inconsistent. Some alert above the value (time, maxage), some alert below the value (pagesize). This is inconsistent for the same plugin!

So, to check that a web page is returned within 5 seconds, the minimum page size is 10K and the maximum age is 1 day, you would invoke:

check_http -H $HOSTADDRESS$ -c 5 -m 10000 -M 1d

Furthermore, the current specification for ranges in the developer guidelines fails the "obviousness" test: a range of 3:5 will alert if the value is outside that range, rather than inside as you would expect.

Also, the performance data returned by check_http is always time and size. Perhaps you want only time, or you want age as well.

Proposal

Thresholds

This document proposes that threshold arguments are to be specified like:

--threshold={threshold definition} --th={threshold definition}

The threshold definition is a subgetopt format with a list of keywords followed by '=' or ':' and then data. This has a form:

--threshold metric={metric},name={name},label={label},ok={range},warn={range},crit={range},absent=critical|warning|ok|unknown,display=yes|no,perf=yes|no,unit={unit},prefix={SI prefix}

OR

--threshold=metric:{metric},name:{name},label:{label},ok:{range},warn:{range},crit {range},absent:critical|warning|ok|unknown,display:yes|no,perf:yes|no,unit:{unit},prefix:{SI prefix}

Where:

all keywords (ok, warn, crit, unit, etc) are optional but at least one must be present
all keywords are case-insensetive and may appear as either upper or lower case
both '=' and ':' are acceptable separators between keyword and its value. using ':' may be preferred for plugins that have libraries which process command-line options and can not deal with --option=keyword=data cleanly. use of '=' is preferred for '--option keyword=data' command line specification
plugins need not support all keywords from this specification, and there maybe keywords supported by the plugin that are not part of this specification. if any keywords not understood by the plugin are used, the return status should be UNKNOWN

The main set of keywords that may be understood by plugins based on this specification are:

metric

metric={metric}
metric:{metric}

This specifies metric of the data being checked
Supporting this keyword by the plugin is required and this keyword must be specified unless plugin has other means of associating threshold definition with a specific metric such as having metric as a named parameter: --total_connections_received=WARN:threshold,CRIT:threshold
{metric} maybe specified as a regular expression (pcre or perl compatible regular expression to be more precise) and include special regex characters such as * ? if this is supported by the plugin. Regular expressions can capture data with () which maybe used in custom label
{metric} is alphanumeric and may include any visible ASCII character and horizontal spaces. If characters other than A-Za-z0-9_- are used the metric name must be enclosed in double (") or single (') quotes. A non-quoted backslash () is the escape character and preserves the literal value of the next character that follows (\ is used for backslash itself, " and ' for quotes, * and ? for quoting * and ? in regular expression, etc).

name

name={name}
name:{name}

Supporting this keyword by plugin writers and use of this keyword is optional and has significance only for special cases where plugins can check multiple named attributes and each has different metrics that can be checked. Examples of this are process name (if we allow check_process to check multiple processes at the same time) where as amount of process memory and cpu are then metrics. Another example is network interface name where as bytes_in and bytes_out are metrics.
If name is not specified then threshold for specified metric applies to all attributes being checked.
{name} is is alphanumeric and may include any visible ASCII character and horizontal spaces. If characters other than A-Za-z0-9_- are used the metric name must be enclosed in double (") or single (') quotes. A non-quoted backslash () is the escape character and preserves the literal value of the next character that follows (\ is used for backslash itself (" and ' for quotes)

regex

regex={yes_or_no}
regex:{yes_or_no}

This specifies if the metric name is to be processed as regular expression or not.
If regular expressions are enabled then same threshold specification would apply to more than one actual metric or named attribute
{yes_or_no} is one of either 'yes' or 'y' meaning to metric name is to be treated as a regular expression OR 'no' or 'n' meaning it is not a regular expression
Support of this keyword is optional and should be done only if plugin supports regex matching of metric names. However any plugin that supports this specification and regex matching must support this keyword to turn matching on and off for specific metric.
Plugins that support regex matching may choose default "yes" or "no" and should document it in the help. If you're writing new plugin and not sure, choose 'no' for default.

label

label={label}
label:{label}

This specifies name to use when doing status output of a named attribute or metric on status line or in perfdata. This is purely for convenience reasons such as to make short output for long name. If label is not specified then plugin should use name provided in 'name' keyword or in 'metric' if name was not specified.
If regular expressions are supported and enabled then $1, $2, $3 maybe present as part of the {label}. They are to be replaced with captured data from regular expression from (). If these are not present or are not different for multiple matched metrics or names, then plugin must add a suffix such as _2, _3, etc to the label to differentiate different metrics in the output
{label} is alphanumeric with acceptable set of symbols A-Za-z0-9_- and may include special character $ followed by a number if regular expressions are enabled and used
Support of this keyword is optional for plugin writers

perf_label

perf_label={label}
perf_label:{label}

This specifies name to use specifically for perfdata output. This overrides values supplied in metric, name and label for perfdata output. This is to be used for convenience reasons such as to make output for long name or to make output in english when original name might not have been.
If regular expressions are enabled then if $1, $2, $3 maybe present as part of the {label}. They are to be replaced with captured data from regular expression from (). If these are not present or are not different for multiple matched metrics or names, then plugin must add a suffix such as _2, _3, etc to the label to differentiate different metrics in the output
{label} is alphanumeric with acceptable set of symbols A-Za-z0-9_- and may include special character $ followed by a number if regular expressions are enabled and used

ok, warn, crit

ok:{range},warn:{range},crit:{range}
ok={range},warn={range},crit={range}

These are called "levels" and specify type of alert and exit plugin code depending on numeric value of the metric being checked
specification for {range} and how matching is done is discussed in a separate sub-section further below
For warning the full name is 'warning' and it can be abbreviated to 'warn' and 'w' and can be lower or upper case such as 'WARN'.
For critical the full name is 'critical' and it can be abbreviated to 'crit' and 'c' and can be lower or upper case such as 'CRIT'
Levels may be repeated more than once to define an additional range. This allows non-continuous ranges to be defined
When two or more 'warn' or 'crit' levels are repeated then matching is done as a logical OR which means if data matches either the first or second level the alert is issued. If you need to treat multiple levels as logical AND then see below about 'awarn" and 'acrit' levels
Supporting basic 'ok', 'warn', 'crit' is required for all plugins comply with this specification.

awarn, acrit, aok

aok:{range},awarn:{range},acrit:{range}
aok={range},awarn={range},acrit={range}

These are special types of "warn", "crit" and "ok" levels and what is written above about "warn", "crit" and "ok" also applies here except when there are multiple 'warn' and 'awarn' or 'crit' and 'acrit' keywords in same metric specification or with multiple metrics
'awarn' may be abbreviated to 'aw' and acrit' to 'ac'
When two or more 'warn' or 'crit' levels are repeated the matching is done as a logical OR however if 'warn' is followed by 'awarn' or 'crit' by 'acrit' then matching is done as a logical AND which means data should match conditions in both first ('warn') and second ('awarn') levels for alert to be issued. This is rarely (if at all) needed for a single metric...
If multiple thresholds are specified and they use 'awarn' and 'acrit' for specifying levels, the plugin should issue the alert when multiple of these thresholds match together rather than just one. That is if threshold has only one warning level specified as 'awarn' then this is ANDed with threshold that is specified before it which may have had 'warn' or 'awarn' for its level. If you want independent thresholds then always use 'warn' or 'crit' for first level. Also note that with 'awarn' and 'acrit' the order of '--threshold' options is important as does the order of levels inside threshold definition. Also pay attention to 'order' keyword which can be used to specify and force certain order of thresholds even if not specified so on the command line.
Supporting 'aok', 'awarn', 'acrit' is optional for plugin writers

absent

absent={nagios_status}
absent:{nagios_status}

This specifies type of alert and nagios exit code if named metric is not available on the device.
{nagios_status} is one of 'ok' or 'critical' or 'warning' or 'unknown' (without quotes)
'critical' can be abbreviated to 'crit' or 'c'.
'warning' can be abbreviated to 'warn' or 'w'
'unknown' can be abbreviated to 'u'
When this keyword is present, and metric data is not available, the plugin should have value of 'U' (meaning unknown and understood by rrdtool) in performance data in place of actual numeric value
Supporting this keyword is optional for plugin writers

display

display={yes_or_no}
display:{yes_or_no}

This specifies if the metric data should be included in nagios status output.
if metric is specified but without display keyword, then if no alert is raised the metric data maybe be included in status according to plugin default settings.
{yes_or_no} is one of either 'yes' or 'y' meaning to include data in status OR 'no' or 'n' meaning not to include.
Supporting this keyword is optional for plugin writers

perf

perf={yes_or_no}
perf:{yes_or_no}

This specifies if the metric data should or should not be included in the performance output.
if metric is specified but neither perf nor ok, warning and critical are not specified, then no alert is raised, but the performance data maybe be returned according to default settings of the plugin. recommended default setting is 'yes'
{yes_or_no} is one of either 'yes' or 'y' meaning to include data in status OR 'no' or 'n' meaning not to include.
Supporting this keyword is optional for plugin writers

order

order={order_number}
order:{order_number}

This specifies order in which data from specified metric should appear in the status and in performance data output
Lower order metrics should appear first followed by those with higher order number
For status line output plugins may decide to have those metrics that raised CRITICAL or WARNING alerts appear first outside of this order
{order_number} is a integer number consisting of digits 0-9
Supporting this keyword is optional for plugin writers

prefix

prefix={prefix}
prefix:{prefix}

The prefix is used to multiply the input range for display of data in status output. Supporting prefix is optional for plugin writers. Allowed {prefix} values are slight expansion of those defined by NIST at http://physics.nist.gov/cuu/Units/prefixes.html:

Y - yotta - 10^24
Z - zetta - 10^21
E - exa - 10^18
P - peta - 10^15
T - tera - 10^12
G - giga - 10^9
M - mega - 10^6
k or K - kilo - 10^3
h - hecto - 10^2
da - deka - 10^1
d or de - deci - 10^-1
c - centi - 10^-2
m or ml - milli - 10^-3
u or µ or mc - micro - 10^-6
n - nano - 10^-9
p - pico - 10^-12
f - femto - 10^-15
a - atto - 10^-18
z - zepto - 10^-21
y - yocto - 10^-24

The prefix maybe either short abbreviation with 1 or 2 symbols (1st column above) or full name. Note that 'u', 'mc' are both considered valid replacement for µ and should be supported even though only µ is in NIST.

unit, uom

unit={unit} uom={unit} unit:{unit} uom:{unit}

{unit} is a single symbol or string that specify the Unit of Measurement (UOM) for plugins that do not know about the type of value returned (SNMP, Windows performance counters, etc.). Supporting 'unit' and 'uom' keywords is optional for plugin writers.

Valid base UOMs are:

c - a continous counter (such as bytes transmitted on an interface)
% - percentage

Additionally supported are custom UOMs that may combine scaling prefix with a label specifying the type of data returned. These UOMs include but are not limited to:

s - seconds, also: ms, us
b - bits, also: kb, Mb, Tb
B - bytes, also: KB, MB, TB
C - degrees in celceus

For a list of abbreviations used in engineering and scientific measurements see http://www.engineeringtoolbox.com/ANSI-abbreviations-scientific-engineering-terms-d_1622.html

Custom UOMs may either be label abbreviations or may include one-letter prefix followed by abbreviation or may include full NIST prefix followed by '-' and then label. Therefore the following are all equivalent:

us = micro-s = micro-second
kb = kilo-b = kilo-bits

Performance data processing programs should first look for base one-letter units 'c' and '%' after numeric value in performance data to process data. If some other string is found then it can be treated as custom UOM string to use as a label for graphing. When processing custom UOMs programs should first check for known NIST scaling prefixes that could proceed actual custom label. Note that prefix can not appear by itself as UOM, therefore "UOM=m" should be treated as full custom UOM label (likely meaning meters) and not be processed as 10^6 scaling prefix milli for some unknown unit label.

Levels, Ranges and Rules for determining state

As a reminder the levels are specified as:

ok:{range},warn={range},crit={range}

{Range} can be a single numeric value, a "simple range" or a "complex range".

Plugins may also in addition choose to support nagios plugins old range format (a:b and @a:b) and other formats as long as this is clearly documented.

Single Value in Range

For backward compatibility single value is defined as alert if data is above specified value or below 0. That means "warn=10" means to have WARNING alert issued of metric data falls outside of range {0..10}

Single value can only be specified for 'warn' and 'crit' levels and not for 'ok'

Simple Range

Simple ranges are of the format:

start..end

Where:

start and end must be defined
start and end match the regular expression /^[+-]?[0-9]+.?[0-9]*$|^inf$/ (ie, a numeric or "inf")
start ≤ end
if start = "inf", this is negative infinity. This can also be written as "-inf"
if end = "inf", this is positive infinity
endpoints are inclusive of the range
alert is raised if value is inside start and end range

This simple range does not require quoting at the shell.

Complex range

Complex ranges are an extension of simple range that allow to precisely specify if start and end are to be included in the range by using well known from math notation () and [] brackets and to specify exact opposite (negation) of the specified range using ^.

Here are rules regarding brackets:

() brackets are used when end points (or end point) are not include (open interval in math)
[] brackets are used when end points are included (closed interval in math)
It is possible to mix and have start end point using ( bracket and end with ] brackets
For infinity either bracket is allowed and are equivalent. Therefore in case of infinity exactly same bracket must be used as the other end

Negation is done by prefixing brackets with ^. Mathematically it means outside of the specified interval. So open interval ^(start..end) becomes union of two closed intervals [-inf..start] U [end..inf]. Closed interval becomes union of two open intervals i.e. ^[start..end] is (-inf..start) U (end..inf). As you can see negation can also always be done by specifying range with two warn and crit levels without it.

Here are some cases for examples:

If no brackets are used as with simple range this is the same as using square brackets such as:

crit:[start..end] which means critical alert if metric>=start && metric<=end

If range is included in the () brackets this means the start and end are not included and then:

crit:(start..end) means critical alert if metric>start && metric<end

We can have mixed brackets such as:

crit:(start..end] which means critical alert if metric>start && metric<=end

Negation applied to [] brackets

crit:^[start..end] means critical alert if !(metric=>start && metric<=end) i.e. metric<start || metric>end

Negation with mixed brackets

crit:^(start..end] means critical alert if !(metric>start && metric<=end) i.e. metric<=start || metric>end

In above:

start and end must be defined
start and end match the regular expression /^[+-]?[0-9]+.?[0-9]*$|^inf$/ (ie, a numeric or "inf")
start ≤ end
if start = "inf", this is negative infinity. This can also be written as "-inf"
if end = "inf", this is positive infinity
endpoints are excluded from the range if () are used
endpoints are included in the range if [] are used
alert is raised if value is within start and end range, unless ^ is used, in which case alert is raised if outside the range

Note that due to shell characters, quoting may be required for complex syntax.

"Old-Style" Nagios Plugins Specification Range

As a reminder the Nagios Developer Guidelines defines defines range for threshold in the following format:

start:end - means alert if outside of the range {start..end} i.e. if dataend
@start:end - means alert if inside the range {start..end} i.e. ≥ 10 and ≤ 20

These are equivalent to new definitions in the following way:

Old-Style - New Range - Generate an alert if x...

10 -----> 10 or ^[0..10] - alert if < 0 or > 10, (outside the range of {0 .. 10})
10: -----> ^[10..inf] - alert if < 10, (outside {10 .. ∞})
~:10 -----> ^[-inf..10] - alert if > 10, (outside the range of {-∞ .. 10})
10:20 -----> ^[10..20] - alert if < 10 or > 20, (outside the range of {10 .. 20})
@10:20 -----> 10..20 - alert if ≥ 10 and ≤ 20, (inside the range of {10 .. 20})

Plugins may optionally choose to accept old style nagios range definition as well as simple and complex range defined in this document.

Rules for determining state

Given a numeric value, the state of the threshold is calculated from the following ordered rules:

If no levels are specified, return OK
If an ok level is specified and value is within range, return OK
If a critical level is specified and value is within range, return CRITICAL
If a warning level is specified and value is within range, return WARNING
If an ok level is specified, return CRITICAL
Otherwise return OK

Performance data

Performance data is currently defined in nagios as:

'label'=value[UOM];[warn];[crit];[min];[max]

Because the specification for a range has changed, the warning and critical values may not fit performance data any more. Therefore the format is extended to be:

'label'=value[UOM];[warn];[crit];[min];[max];[warn-extended];[crit-extended]

Where [warn-extended] and [crit-extended] a list of one or more complex-range definitions (with [] and () brackets being required). The following are examples of what is acceptable for warn-extended or crit-extended:

'label'=value[UOM];;;[min];[max];[-inf..-5],(30..inf);(-inf,-10),(40..inf)

However if there is only one warning or critical level specified with single value or a simple range it can always be converted to old style nagios range format with start..end being equivalent to old @start:end and plugins should do it and make it possible for performance graphing utilities that are not aware of the new format to work. Therefor threshold definition of:

--threshold=metric:misses,ok:0..100,warn:100..200,crit:200..inf

Should go into performance data (assuming value of 25 and min 0 and max 1000) as:

'misses'=20;@100:200;200;0;1000;[100..200];[200..inf]

Plan and Examples

Multiple parties are currently working on libraries that would support this format.

There is at least one Perl library that partially supports new specification. There is Java JNRPE library that also partially supports new specification.

There will be C library routines for parsing the threshold values.
There will be C library routines for the collection and output of the performance data.

There are plans to update check_procs and a other plugins by the plugin authors

check_procs

The new basic syntax would be: check_procs [filter options] [threshold options]

Where filter options are the current -u {username}, -C {command}, etc. This reduces the set of processes that are to be calculated. The new threshold metrics will be:

number - alert on number of matching processes. Performance data returns number of processes
rss-threshold - alert on rss size if any matching process is in range. Perf data returns average rss
rss-max - Same as --rss, but perf data returns max rss
rss-sum - alert on the total rss of all matching processes. Perf data returns rss_sum
vsz-threshold - alert on vsz size if any matching process is in range. Perf data returns average vsz
vsz-max - Same as --vsz, but perf data returns max rss
vsz-sum - alert on the total vsz of all matching processes. Perf data returns vsz_sum
cpu-threshold - alert on cpu % of all matching processes. Perf data returns average cpu
cpu-max - Same as --cpu, but perf data returns max cpu
cpu-sum - alert on total cpu. Perf data returns cpu_sum

check_http

Updated check_http example is:

check_http -H $HOSTADDRESS$ --th metric=time,ok=0..5 --th metric=size,ok=10..inf,prefix=Ki --th metric=age,ok=0..1,unit=d

We believe this is more readable (shows that user is interested in the time, the size and the age) and more consistent (alerting above 5, less than 10 and above 1, respectively).

Performance data will only be output if the metric has been specified. So only show time performance data if "--th metric=time,perf=yes" has been specified on the command line. Both warning_range or critical_range can be unspecified - this effectively means "I am not going to alert on this value, but I'd like to be informed about it in the performance data".

Other examples.

To check httpd processes are OK if the virtual size is under 8096 bytes. Warn until they reach 16182, but bigger than that is CRITICAL.

old: check_procs -w 8096 -c 16182 -C httpd --metric VSZ
new: check_procs -C httpd --th metric=vsize,ok=0..8096,warn=8097..16182

There should always be one and only one 'tnslsnr' process. Otherwise critical.

old: check_procs -w 1:1 -c 1:1 -C tnslsnr
new: check_procs -C tnslsnr --th metric=count,ok=1..1

Load averages (1,5,15 minute) should be within reasonable ranges.

old: check_load -w 1.0,0.8,0.7 -c 1.5,1.3,1.0
new: check_load --th metric=1min,ok=0..1.0,warn=1.0..1.5 --th metric=5min,ok=0..0.8,warn=0.8..1.3 --th metric=15min,ok=0..0.7,warn=0.7..1.0

Terminology

metric

Something that a check is going to be measured against. For example, for disk checks, it could be used or free or inodes_free; for http checks, it could be time [taken] or size; for process checks, it could be cpu or number [of processes] or vsz

###range This defines a continuous range of values when an alert would be raised

level

This is an alert level within Nagios - OK, WARNING or CRITICAL

threshold

This consists of a level with a range

Limitations

This assumes that you are always comparing numbers as the metric values.

There maybe some limitations in the precision of values. All internal logic should use double precision.

If there are multiple metrics, the alert will be on an OR basis, that is, any single metric which passes its threshold will cause the plugin to return a failed state.

Comments

Those who want to make a comment without editing main text of this proposal should do it below (the other option is to post on nagios-plugins mail list):

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Threshold Syntax

Draft Proposal for Nagios Plugins Development Guidelines and Threshold Syntax Specification 2.0

Problem

Proposal

Thresholds

metric

name

regex

label

perf_label

ok, warn, crit

awarn, acrit, aok

absent

display

perf

order

prefix

unit, uom

Levels, Ranges and Rules for determining state

Single Value in Range

Simple Range

Complex range

"Old-Style" Nagios Plugins Specification Range

Rules for determining state

Performance data

Plan and Examples

check_procs

check_http

Other examples.

Terminology

metric

level

threshold

Limitations

Comments

Uh oh!

Clone this wiki locally