Tool for monitoring hosts nvme devices smart-attributes.
This tool reads smart attributes which provided by nvme-cli tools from list of hosts and compare values with specified parameters in config.
Ensure the nvme-cli tool is installed on the hosts you need to monitor.
cd /optgit clone https://github.com/igor-kremin/automon.git nvmemonitor/opt/nvmemonitor/nvmemonitor --add example.com
cd /opt/nvmemonitorgit pull
- create config
/opt/nvmemonitor/nvmemonitor.conf, add hosts as many as you wish.
available_spare__less=10
percentage_used__grater=80
temperature__grater=60
[host1.exmaple.com] # host in Germany
device /dev/nvme0n1 # INTEL
device /dev/nvme1n1 # INTEL
percentage_used__grater=90
[host2.exmaple.com:4444] # host in Ireland
device /dev/nvme0n1 # SAMSUNG
device /dev/nvme1n1 # TOSHIBA
device /dev/nvme2n1 # SAMSUNG
Configuration file allow comments, from symbol # to end of line.
Syntax of nvmemonitor.conf file:
list of monitored paramethers with suffixes __less or __grater , which are the same for all listed hosts.
<param>__<less|grater>=<value> ex: available_spare__less=10 - will inform you if the smart value would be less 10
<host>[:port] ex: [server2.example.com:4444]
device /dev/<nvme> ex: device /dev/nvme0n1
list of monitored paramethers for the host previously defined.
<param>__<less|grater>=<value> ex: percentage_used__grater=90 - will inform you if the smart value would be grater 90 only on the server2.exmaple.com
sample output and list of available attributes from nvme-cli software
critical_warning : 0
temperature : 40 C (313 Kelvin)
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 6%
endurance group critical warning summary: 0
data_units_read : 1,127,748,199
data_units_written : 2,821,015,917
host_read_commands : 104,667,029,322
host_write_commands : 110,234,002,637
controller_busy_time : 296,158
power_cycles : 20
power_on_hours : 24,532
unsafe_shutdowns : 1
media_errors : 0
num_err_log_entries : 0
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1 : 40 C (313 Kelvin)
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0
Each attribute values coverts to int for comparing.
For temperature attributes Celsius value is selected.
alert command will use program /opt/bin/alert-via-telegram to send the test result to telegram
alert-via-telegram, thanks to Gena Makhomed, can be found here https://github.com/makhomed/automon/blob/master/bin/alert-via-telegram
alert program receive one command. See source of /opt/automon/bin/alert-via-telegram program for details.
Using /opt/automon/bin/alert-via-telegram as example you can write own alert program for sending alerts via email or SMS or via any other way.
nvmemonitor test shows smart test results if any warning occured.
nvmemonitor alert sends smart test results to alert program if any warning occured.
nvmemonitor --add <host>[:port] [<host>[:port]] adds host(s) and its devices to config nvmemonitor.conf.``
nvmemonitor --show <host>[:port] [<host>[:port]] shows devices on hosts.
nvmemonitor --list list all monitored devices.
nvmemonitor --progress shows progress information when passing.
If you want to use alert to telegram you have to to create Telegram bot and configure telegram-send script. Detalis see in https://pypi.python.org/pypi/telegram-send documentation.
To work properly you need to configure promptless ssh connection to necessary hosts.
It can be done via ssh-keygen -t rsa and copy public key from /root/.ssh/id_rsa.pub
to /root/.ssh/authorized_keys on monitored servers.
Also you need to check connection with monitored server with command ssh example.com and answer yes to ssh question:
or you can use the following commands
ssh-keygen -q -N ""
ssh-copy-id host1.example.com
Ensure configuration file exists /opt/nvmemonitor/nvmemonitor.conf and define hosts to check inside it.
After it configure cron job, for example, in file /etc/cron.d/nvmemonitor:
0 * * * * root /opt/nvmemonitor/nvmemonitor alert