ZhoubaWiki:IcingaIntroduction

From ZhoubaWiki
Jump to: navigation, search

Introduction to Icinga monitoring system (en)

Icinga logo.png

Icinga is a monitoring system checking hosts and services you specify and notifying you when things go wrong and when they recover. The systems to be monitored can be nearly anything connected to a network.

Installation

Icinga

apache, libjpg, libpng

sudo apt-get install apache2 build-essential libgd2-xpm-dev
sudo apt-get install libjpeg62 libjpeg62-dev libpng12 libpng12-dev

http://docs.icinga.org/latest/en/quickstart-idoutils.html#installpackages

icinga itself, mysql

sudo apt-get install icinga icinga-doc icinga-idoutils mysql-server libdbd-mysql mysql-client

nagios-plugins

sudo apt-get install nagios-plugins

https://wiki.icinga.org/display/howtos/Setting+up+Icinga+with+IDOUtils+on+Ubuntu

check_nrpe

*check_nrpe can be installed using packages installation also, but there is whole Nagios core 3.0 within its dependencies

PNP4Nagios

PNP4Nagios

sudo apt-get install pnp4nagios

https://wiki.icinga.org/display/howtos/Setting+up+PNP+with+Icinga+on+Debian

check_icingastats (check_nagiostats) - for monitoring and charting icinga process

wget https://www.monitoringexchange.org/attachment/download/Check-Plugins/Software/Nagios/check_nagiostats/check_nagiostats

https://wiki.icinga.org/display/howtos/check_icingastats

http://docs.icinga.org/latest/en/perfgraphs.html

https://www.monitoringexchange.org/inventory/Check-Plugins/Software/Nagios/check_nagiostats

  • don't forget to modify check_nrpe plugin performance data visualisation

http://docs.pnp4nagios.org/pnp-0.4/tpl_custom

  • swap chart's yellow colour for other one by easily modifying the php template

change $i=0; to $i=1; in template file /etc/pnp4nagios/templates/check_icingastats.php.

Check logic

As it is not vital to mention all the details of the check logic, only the important ones are going to be point out.

Active checks

Activechecks.png

Icinga is capable of monitoring hosts and services in two ways: actively and passively. Active checks are the most common method for monitoring hosts and services. The main features of actives checks are as follows:

  • Active checks are initiated by the Icinga process
  • Active checks are run on a regularly scheduled basis

Additional information on this topic can be found here http://docs.icinga.org/latest/en/activechecks.html .

State types

There are two state types in Icinga - SOFT states and HARD states. These state types are a crucial part of the monitoring logic, as they are used to determine when event handlers are executed and when notifications are initially sent out.

This document http://docs.icinga.org/latest/en/statetypes.html describes the difference between SOFT and HARD states, how they occur, and what happens when they occur.

Host and service checks

Checks, host and service states and other details could be read from here:

Notifications

This document http://docs.icinga.org/latest/en/notifications.html will attempt to explain exactly when and how host and service notifications are sent out, as well as who receives them.

Configuration

Configoverview.png

General

There are some directives in config files that require attention to make the system work as desired.

Main config

  • Command file needs to be specified explicitly
cfg_file=/etc/icinga/general-commands.cfg
  • Other object config files may be specified using the containing directory
cfg_dir=/etc/icinga/objects/
  • Resource file location
resource_file=/etc/icinga/resource.cfg
  • Icinga daemon will identify itself as this user and member of this group
icinga_user=nagios
icinga_group=nagios
  • Icinga daemon cannot be controlled using CGIs without enabling external commands
check_external_commands=1
  • Some plugins need longer time to complete..
service_check_timeout=150
  • Performance data need to be processed to enable charting
process_performance_data=1
  • Some plugins need this directive enabled
enable_environment_macros=1

http://docs.icinga.org/latest/en/configmain.html

cgi.cfg

There are authorisations and other directives involving CGIs behaviour set in this configuration file. Nothing explicitly important to make sure.

http://docs.icinga.org/latest/en/configcgi.html

ido2db.cfg

Contains broker module's settings and its database login data.

resource.cfg

Contains user defined macros (=variables)

  • Useful for sensitive data as CGIs don't have access to the file,
  • or for saving a semicolon (;) - this character signs the start of a comment in config files! -
  • or for long strings to preserve service or host definition short and clear.

Object

A few rules at the beginning..

  • Lines that start with a '#' character are taken to be comments and are not processed
  • Directive names are case-sensitive
  • Characters that appear after a semicolon (;) in configuration lines are treated as comments and are not processed

It's important to say that these config files can be arranged at will to make configuration clear and easy to maintain for anyone. Theoretically, all the object definitions could be in one file, or each definition could be placed in it's own file (not recommended :-) ).

How the configuration model works?

  • Let's define some hosts.
  • Now, to monitor this hosts' services, we have to define them and link them with the hosts they are running on and the commands which call the plugins eventually.
  • The command have only two directives: command_name and its command_line. The name is used to link with a service definition and the command line represents the plugin (with parameters) called to finally execute the check.

The last two objects dedicate to notifications:

  • Time periods (when people should be notified) are defined in its object
  • Contact definitions hold individual persons settings about when, in which case and what address the notifications should be sent to; in addition to that, contact definition allows the defined person to access and control the services he is assigned to from the CGIs. More on the contact's CGI privileges here: Bugweis:IcingaIntroduction#Userprivileges

Host definition

  • It is very useful to define a template (has register 0 directive) to predefine common attributes (could be overriden later).
define host{
   name                            generic-host    ; The name of this host template
   use                             pnp-hst         ; PNP4Nagios intergration
   notifications_enabled           1               ; Host notifications are enabled
   event_handler_enabled           1               ; Host event handler is enabled
   flap_detection_enabled          1               ; Flap detection is enabled
   failure_prediction_enabled      1               ; Failure prediction is enabled
   process_perf_data               1               ; Process performance data - for charting purposes
   retain_status_information       1               ; Retain status information across program restarts
   retain_nonstatus_information    1               ; Retain non-status information across program restarts
   check_command                   check-host-alive
   max_check_attempts              10
   notification_interval           30
   notification_period             24x7            ; Time period definition's name
   notification_options            d,u,r,f,s       ; down, unreachable, recovery, flapping, sheduled downtime
   contact_groups                  admins
   register                        0       ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
}
  • PNP4Nagios integration's action_url has to be defined to offer extended CGI functionality (charts).
define host {
   name       pnp-hst
   register   0
   action_url /pnp4nagios/graph?host=$HOSTNAME$' class='tips' rel='/pnp4nagios/popup?host=$HOSTNAME$&srv=_HOST_
}
  • Finally the host definition itself - the most important directives: host_name and address.
  • Note the use <template-name> directive - that is the connection to predefined template.
define host{
   use                     generic-host            ; Name of host template to use
   host_name               localhost
   address                 127.0.0.1
}

http://docs.icinga.org/latest/en/objectdefinitions.html#host

Service definition

Similarly as in host definitions, there is template model used in services also.

  • It is advantageous to define two templates, which differ in performance data processing and PNP's action_url integration. Since not all plugins generate performance data, it is useful to let Icinga generate chart links only for chartable services.
Plain service Charted service
define service{
   name                            generic-service
   active_checks_enabled           1
   parallelize_check               1
   notifications_enabled           1
   event_handler_enabled           1
   flap_detection_enabled          1
   failure_prediction_enabled      1
   process_perf_data               0
   retain_status_information       1
   retain_nonstatus_information    1
   is_volatile                     0
   check_period                    24x7
   normal_check_interval           5
   retry_check_interval            5
   max_check_attempts              5
   notification_interval           30
   notification_period             24x7
   notification_options            w,u,c,r
   contact_groups                  admins
   register                        0

}
define service{
-> name                            generic-service-charted   ; The 'name' of this service template
   active_checks_enabled           1             ; Active service checks are enabled
   parallelize_check               1             ; Active service checks should be parallelized
   notifications_enabled           1             ; Service notifications are enabled
   event_handler_enabled           1             ; Service event handler is enabled
   flap_detection_enabled          1             ; Flap detection is enabled
   failure_prediction_enabled      1             ; Failure prediction is enabled
-> process_perf_data               1             ; Performance data processing for charting
   retain_status_information       1             ; Retain status information across program restarts
   retain_nonstatus_information    1             ; Retain non-status information across program restarts
   is_volatile                     0
   check_period                    24x7
   normal_check_interval           5
   retry_check_interval            5
   max_check_attempts              5
   notification_interval           30            ; Only send notifications on status change by default.
   notification_period             24x7
   notification_options            w,u,c,r,f,s   ; warning, unknown, critical, recovery, flapping, sheduled downtime
   contact_groups                  admins
   register                        0
-> use                             pnp-svc       ; PNP integration
}
  • PNP4Nagios integration
define service {
   name       pnp-svc
   register   0
   action_url /pnp4nagios/graph?host=$HOSTNAME$&srv=$SERVICEDESC$' class='tips' rel='/pnp4nagios/popup?host=$HOSTNAME$&srv=$SERVICEDESC$
}
  • Service definition itself
  • Note the pre> directive of the specific service - it consists of check command name and exclamation mark separated list of arguments in format of command definition.
define service{
   use                             generic-service-charted ; Name of service template to use
   host_name                       localhost
   service_description             Time Offset
   check_command                   check_ntp_time!de.pool.ntp.org!0.25!0.5
}

http://docs.icinga.org/latest/en/objectdefinitions.html#service

Command definition

  • command_line specifies what to execute
define command{
   command_name    check_ntp_time
   command_line    $USER1$/check_ntp_time -H '$HOSTADDRESS$' -w '$ARG1$' -c '$ARG2'
}

http://docs.icinga.org/latest/en/objectdefinitions.html#command

Time periods

  • This definitions are mostly self-explanatory
  • There are four predefined values from installation time, which count Always, never, workhours (workdays from 09:00 to 17:00), nonworkhours (complement to workhours)

http://docs.icinga.org/latest/en/objectdefinitions.html#timeperiod

Contact definition

  • Nothing tricky here, just pay attention to notification thresholds - <host/service>_notification_options notation is the same as in IcingaIntroduction#Hostdefinition and IcingaIntroduction#Servicedefinition
define contact{
   contact_name                    admin
   alias                           Administrator
   service_notification_period     24x7
   host_notification_period        24x7
   service_notification_options    w,u,c,r
   host_notification_options       d,r
   service_notification_commands   notify-service-by-email
   host_notification_commands      notify-host-by-email
   email                           admin@domain.com
}

http://docs.icinga.org/latest/en/objectdefinitions.html#contact

Hostgroups, Servicegroups and Contactgroups

  • Some of the definitions may be aggregated to groups and assigned as a group. Follow these links for more details:

http://docs.icinga.org/latest/en/objectdefinitions.html#hostgroup

http://docs.icinga.org/latest/en/objectdefinitions.html#servicegroup

http://docs.icinga.org/latest/en/objectdefinitions.html#contactgroup

Restart and verification of config files

  • Every time you modify your configuration files, you also have to restart Icinga. It is important to run a sanity check on your configuration files because in case of an error Icinga will not be (re)started. In order to verify your configuration, run Icinga using the -v command line option:
/usr/sbin/icinga -v /etc/icinga/icinga.cfg
  • Restart the daemon finally. This may take some time, so be patient...
sudo /etc/init.d/icinga restart


  • Or you may use an easy script that verifies the config files correctness for you. It is located in Mom server's root directory.
sudo /res

Plugins

Plugins.png

Unlike many other monitoring tools, Icinga does not include any internal mechanisms for checking the status of hosts and services on your network. Instead, Icinga relies on external programs (called plugins) to do all the dirty work.

Plugins are compiled executables or scripts (Perl scripts, shell scripts, etc.) that can be run from a command line to check the status or a host or service. Icinga uses the results from plugins to determine the current status of hosts and services on your network.

Icinga will execute a plugin whenever there is a need to check the status of a service or host. The plugin does something (notice the very general term) to perform the check and then simply returns the results to Icinga. Icinga will process the results that it receives from the plugin and take any necessary actions (sending out notifications, etc).

http://docs.icinga.org/latest/en/plugins.html

Parameters passing

check_command check_sampleargument1parameter5
                                |         |    +--------------------------------------+
                                |         +---------------------------------+         |
                                +---------------------------------+         |         |
                                                                  |         |         |
 Host macro ----------------------------------------+             |         |         |
                                                    |             |         |         |
 User macro --------+                               |             |         |         |
                    |                               |             |         |         |
 command_line      $USER1$/sample-plugin.pl -H $HOSTADDRESS$ -a $ARG1$ -p $ARG2$ -n $ARG3$

results in:

 /usr/local/icinga/libexec/sample-plugin.pl -H 192.168.1.2 -a argument1 -p parameter -n 5

There is check_command directive of service definition on top, command_line directive of command definition lower on the diagram and resulting executed plugin at the bottom part. $USERxx$ macros are defined in Bugweis:IcingaIntroduction#resource.cfg and $HOSTADDRESS$ macro is part of standard macros, their list and description can be found here http://docs.icinga.org/latest/en/macrolist.html .

Plugin usage

Some words about particular plugins

Nagios plugins official pack

Community created plugins

check_nrpe

  • This plugin needs special attention: retrieves only locally detectable data from NRPE daemon running on a remote server.
  • $ARG1$ stands for a remotely defined command to execute
define command{
   command_name check_remote
   command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$
}

https://www.monitoringexchange.org/wiki/Plugin:check_nrpe

Additional information on this topic can be found here Bugweis:IcingaIntroduction#NRPE.

NRPE

Nrpe.png

Nagios Remote Plugin Executor (or NRPE for short) is an addon used to execute plugins to monitor "local" resources on remote Linux or Unix systems. Some resources cannot (or should not) be monitored via SNMP or using other agents across the network so you have to check them using programs installed locally on the machines to be monitored and transmit the results back to the Icinga server.

Installation

  • nagios-plugins
sudo apt-get install nagios-plugins
  • nagios-nrpe-server
sudo apt-get install nagios-nrpe-server
  • External system utilities needed to obtain checked object status like iostat, bc, smartmontools, etc.
sudo apt-get install sysstat bc smartmontools

Configuration

  • Edit /etc/nagios/nrpe.cfg

Important settings:

  • Adjust port if server is behind a firewall / router
server_port=5666
  • Add monitoring server's IP; localhost IP is still useful for plugins testing
allowed_hosts=31.31.73.149,127.0.0.1
  • Adjust time for plugins to be allowed to complete in time (before nrpe daemon returns response to monitoring server)
command_timeout=60
  • Server specific command definitions go here. All these local plugins are executed remotely by monitoring server.
  • Note the /usr/bin/sudo prefix. Some plugins need superuser privileges to retrieve its data (for example check_smart, check_raid, check_rdiff). We should make sure nagios user is among sudoers also; if not, add line nagios ALL=(ALL) NOPASSWD: /usr/lib/nagios/plugins/ into /etc/sudoers.
command[check_apt]=/usr/lib/nagios/plugins/check_apt
command[check_load]=/usr/lib/nagios/plugins/check_load -w 15,10,5 -c 30,25,20
command[check_raid]=/usr/bin/sudo /usr/lib/nagios/plugins/check_raid0
...
  • Don't forget to restart the daemon after the config has been changed!
sudo /etc/init.d/nagios-nrpe-server restart

Example Commands

command[check_apt]=/usr/lib/nagios/plugins/check_apt -t 55
command[check_load]=/usr/lib/nagios/plugins/check_load -w 15,10,5 -c 30,25,20
command[check_disk]=/usr/lib/nagios/plugins/check_disk -w 15% -c 10% -p /
command[check_swap]=/usr/lib/nagios/plugins/check_swap -w 10% -c 5%
command[check_raid]=/usr/bin/sudo /usr/lib/nagios/plugins/check_raid0
command[check_smart_sda]=/usr/bin/sudo /usr/lib/nagios/plugins/check_smart_noperf -d /dev/cciss/c0d0 -i scsi
command[check_mem]=/usr/lib/nagios/plugins/check_mem -w 90 -c 95 -u -C
command[check_iftrafficn]=/usr/lib/nagios/plugins/check_iftrafficn -i eth0 -u m -b 100
command[check_ntp_time]=/usr/lib/nagios/plugins/check_ntp_time -H 'de.pool.ntp.org' -w '0.25' -c '0.5' -t 55
command[check_process]=/usr/lib/nagios/plugins/check_process -w 500 -c 750
command[check_open_files]=/usr/lib/nagios/plugins/check_open_files -w '80' -c '90'
command[check_iostat_overall]=/usr/lib/nagios/plugins/check_iostat_overall -w 1750,100,150 -c 2000,150,175 -d cciss/c0d0
command[check_smart_raid0]=/usr/bin/sudo /usr/lib/nagios/plugins/check_smart_raid -d /dev/cciss/c0d0 -i cciss,0 -m scsi
command[check_smart_raid1]=/usr/bin/sudo /usr/lib/nagios/plugins/check_smart_raid -d /dev/cciss/c0d0 -i cciss,1 -m scsi
command[check_cubert_mysql]=/usr/lib/nagios/plugins/check_mysqld -H localhost -u nagios -p mysqlPassword -a uptime,threads_running,threads_connected,slow_queries,open_tables,questions -w ,24,32,400000,2048,384 -c ,32,48,500000,4096,512 -A 'com_commit,com_rollback,com_delete,com_update,com_insert,com_insert_select,com_select,qcache_hits,qcache_inserts,qcache_not_cached,questions,bytes_sent,bytes_received,connections,open_tables,threads_cached,threads_connected,threads_created,threads_running'
  • As we can see, all plugin parameters are hardcoded in the config. It is not the only possibility, but it is the most safe one in security concerns.
  • mysqlPassword isn't really the mysql password:-)

Testing

  • It is very useful to debug the settings still on a localhost. Since check_nrpe is a normal executable file, it is very easy.
  • At first try:
/usr/lib/nagios/plugins/check_nrpe -H localhost
  • This command should return check_nrpe's version. If it does, indiviual commands may be tested like this:
/usr/lib/nagios/plugins/check_nrpe -H localhost -u -t 60 -c check_apt
  • and very similar command will be issued on a monitoring machine to get remote results (with modified host IP and plugin path if necessary). Explanation of the check_nrpe switches: IcingaIntroduction#check_nrpe .

More on the NRPE topic to be found here http://docs.icinga.org/latest/en/nrpe.html .

Errors

Let's mention some common error messages returned by plugins, their possible causes and potential fixes.

Return code of 13 is out of bounds
  • Plugin uses a temporary file and nagios user doesn't have the privileges to read it or to write to it.
  • How to fix? Delete or chown the temp file.
(No output returned from plugin)
  • Generally plugin-involved problem, something in the script failed before output were printed or no output should have been printed (which is improbable).
  • How to fix? Go through the plugin script code - script failures should be treated with some kind of error messages at least.
Service check did not exit properly.
  • Current perl plugin is not compatible with icinga's embeded perl interpreter. (see rules for developing Perl plugins for use with the embedded Perl interpreter http://docs.icinga.org/latest/en/epnplugins.html )
  • How to fix? Add path to perl interpreter in front of a path to the particular command in command definition to bypass ePN; for example:
command_line    /usr/bin/perl /usr/lib/nagios/plugins/check_mem -w '$ARG1$' -c '$ARG2$' -u -C`
NRPE: Unable to read output
  • As far as I know, there are two possibilities:
    • Wrong command definiton in nrpe.cfg.
    • Particular plugin needs superuser privileges AND nagios user is not defined amongst sudoers.
CHECK_NRPE: Error - Could not complete SSL handshake
  • Monitoring server IP address is not allowed to access remote host.
  • How to fix? Allow it in nrpe.cfg. IcingaIntroduction#Configuration1
CHECK_NRPE: No output returned from daemon
  • Potential error(s) in nrpe.cfg
  • or insufficient permissions to read input files by plugin (nagios user).
CHECK_NRPE: Socket timeout after 60 seconds.
  • Plugin haven't completed in time - short network outages or remote plugin processing may be the cause of a failure.

User/contact addition and CGI privileges

User/contact addition

Icinga contact creation

  • Create a contact definiton in contacts.cfg,
define contact{
   contact_name                    john
   alias                           Johnny
   service_notification_period     24x7
   host_notification_period        24x7
   service_notification_options    n
   host_notification_options       n
   service_notification_commands   notify-service-by-email
   host_notification_commands      notify-host-by-email
   email                           johnny@web.com
}
  • then either assign the contact as a contact to host and services
define service{
   use                             generic-service-charted
   host_name                       localhost
   service_description             rdiff backup@Zapp
   check_command                   check_rdiff!/mnt/disk/backups/Zapp2324
   notification_interval           0
   contacts                        john
}
  • or make him a member of some contactgroup which is already assigned
define contactgroup{
   contactgroup_name       admins
   alias                   Nagios Administrators
   members                 root, john
}
  • or do both - contact can be assigned both directly and using contactgroup too.

Icinga CGI user creation

htpasswd /etc/icinga/htpasswd.users john
  • Note: CGI username and icinga contact_name should be equal.

User privileges

Now this user:

  • has access to CGIs at http://31.31.73.149/icinga
  • is authorised for each host he is a contact of to:
    • view host status information
    • view history and notifications for the host
    • issue host commands
    • view status information for all services on the host
    • view history and notification information for all services on the host
    • issue commands for all services on the host
  • is authorised for each service he is a contact of to:
    • view service status information
    • view history and notifications for the service
    • issue service commands
  • and is NOT authorised for these activities:
    • viewing the raw log file via the showlog CGI
    • viewing Icinga process information via the extended information CGI
    • issuing Icinga process commands via the command CGI
    • viewing host group, contact, contact group, time period, and command definitions via the configuration CGI
    • viewing host/service status information, history and notifications for services he is not contact of
    • issuing host/service commands for objects he is not contact of.

CGI privileges

Extended privileges

We can grant authenticated contacts or other authenticated users permission to additional information in the CGIs by adding them to various authorization variables in the CGI configuration file cgi.cfg.

authorized_for_full_command_resolution
  • Users/contacts/contactgroups assigned to this directive can view a command in config command expander as icinga would execute it. For example:

Expand.png

authorized_for_system_information
authorized_for_system_commands
authorized_for_configuration_information
authorized_for_all_hosts
authorized_for_all_host_commands
authorized_for_all_services
authorized_for_all_service_commands
  • The other directives are hopefully self-explanatory.

Read-only privileges

There is a possibility to create a user, who has read-only privileges. For example for flat panel PC hung on the wall displaying monitoring overview. It doesn't require to issue commands, but needs to have the authorisation to view all services, it doesn't have to be defined as a contact and assigned to various services, but only few directives in cgi.cfg should be altered.

authorized_for_read_only
  • user can view all host service, but can't issue commands
authorized_for_comments_read_only
  • if the user is read only, he can also see comments
authorized_for_downtimes_read_only
  • if the user is read only, he can see downtimes also..

More to be found at: http://docs.icinga.org/latest/en/cgiauth.html

Final notes

Flapping

Flapping occurs when a service or host changes state too frequently, resulting in a storm of problem and recovery notifications. More information can be found here: http://docs.icinga.org/latest/en/flapping.html

IP vs. FQDN in host definiton

Normally, an IP address is used, although it could really be anything you want (so long as it can be used to check the status of the host). You can use a FQDN to identify the host instead of an IP address, but if DNS services are not available this could cause problems. When used properly, the $HOSTADDRESS$ macro will contain this address.

Time-Saving Tricks For Object Definitions

This documentation http://docs.icinga.org/latest/en/objecttricks.html attempts to explain how you can exploit the (somewhat) hidden features of template-based object definitions to save your sanity. How so, you ask? Several types of objects allow you to specify multiple host names and/or hostgroup names in definitions, allowing you to "copy" the object definition to multiple hosts or services. More on groups to be found here: Bugweis:IcingaIntroduction#HostgroupsServicegroupsandContactgroups

PNP4Nagios do not draw charts for services having long check interval

This situation occurs, when the interval between checks of one service exceeds the RRD database heartbeat. The heartbeat defines maximal gap between input data for the database. There are two solutions for this problem. The first one is to simply lower the service check interval under the current heartbeat of the database; the other one is to increase the heartbeat setting in pnp4nagios' config file; It is controlled by directive RRD_HEARTBEAT = 12000, which can be found in the config file at /etc/pnp4nagios/process_perfdata.cfg (the value is in seconds).