In this post we will learn how to monitor computers and services with Nagios. This will allow us to receive timely and actionable alerts about issues in our infrastructure.
- What is Nagios.
- Nagios: How it works.
- Description of the scenario we are working with in this post.
- Installing Nagios.
- Monitoring services and computers.
- At-a-glance overview of our devices and services.
- Testing notifications.
- Closing thoughts.
What is Nagios.
Nagios is an application that monitors computers, services, networks and infrastructure. It sends alerts in case of trouble, and it sends notifications when the issue is resolved.
Unlike other commercial alternatives like SolarWinds, Nagios is both free and Open Source and it also offers the option of purchasing a support plan.
Nagios: How it works.
Nagios can obtain information about the monitored resources in two different ways:
-
Using the Nagios Agent: After deploying the Agent to the computers you want to monitor, these agents collect the information and perform the checks by themselves before sending them to the Nagios server. -The main advantage of this method is the flexbility it offers, since we can define custom checks and write code to monitor any resource we can think of. The disadvantage would be that we have to push the definition of these custom checks to the clients.
-
Agentless : The Nagios server uses the facilities provided by the device that is being monitored (for example, VMI or SNMP).
- The main advantage of this mode is that it reduces the complexity of our deployment, and it allows us to centralize all the configuration in the Nagios server. The disadvantage would be that it does not allow for custom checks, so if the device does not provide a facility for monitoring a resource we can’t monitor it this way.
Nagios is able to use both modes at the same time, combining the results from agent checks and from agentless checks.
Monitored service states, and transition flow between states.
The state of a resource is determined through two components:
- The state of the service or resource: OK, WARNING, UP, DOWN...
- The type of state the resource is in: soft or hard.
To prevent false positives caused by temporary circumstances, Nagios will perform the check a number of times before considering a problem to be “happening”. This number of retries is defined by the max_check_attempts variable, and if the limit is reached Nagios will change the issue from soft to hard. This also works in reverse, before marking an issue as solved the check is repeated this same number of times.
Unless we specify otherwise, notifications are only set when a resource is hard down.
We can write our own handler for these events to perfom custom actions depending on the resource state. For example, if the web server is soft down and the check has been retried 3 times already, we can trigger the deployment of a secondary instance of the web server and add it to the load balancer in anticipation of the main web server going down.
An in-depth description of the states is available in the manual.
A resource is “flapping”. What does “flapping” mean?
When a resource is rapidly and repeatedly transitioning between OK and a state that means an issue is happening, the resource is tagged as flapping. This could indicate a deeper issue, or be caused by perfectly normal conditions, for example we know for sure that the database becomes unresponsive at a certain time of the day because of heavy load, and we know it does not affect regular operations.
In any case, flapping would generate many notifications about the same issue. In order to mitigate this situation, we should increase max_check_attempts to a value that allows these false positives to fall under the threshold and be careful to not increase too much that issues go unreported.
Nagios contains logic that tries to address flapping, but given the nature of the problem it might not be enough to eliminate the problem in our specific environment.
Description of the scenario we are working with in this post.
We have three virtual machines in the same network, all three using Debian 8 (Jessie).
- nagiosHost: 192.168.100.1, will function as the Nagios server. It will also host a MongoDB instance, and an SSH server.
- servidorLamp: 192.168.100.2, will have a LAMP stack installed.
- escritorioUsuario: 192.168.100.3, will be a desktop for the end-user to use.
This scenario can be deployed automatically through the use of this Vagrantfile, including all services and configuration.
Installing Nagios.
nrpe-server is the name of the Agent software that is installed in the monitored machines, and will send status reports back to the Nagios server as well as executing remote checks. As we have explained at the beginning, these checks can be customized by the administrator, and nagios-plugins is a package that contains many useful pre-defined checks. Later on we’ll explain how to install additional components that will enable us to monitor a MongoDB database and how to create our own custom checks to, for example, check for the existence of zombie processes.
To be able to run checks on-demand, we have to make the following change on /etc/nagios3/nagios.cfg
We must also change the owner and mode of these two directories, so the nagios service is able to write temporary files to them. If we were to not do this, we would get error messages along the lines of Could not stat() command file ‘/var/lib/nagios3/rw/nagios.cmd’.
We will implement these changes at the package manager level, to preserve them in case of updates.
We are also going to create a new user called ‘nagiosadmin’ with the password ‘nagios’ to be able to access the web panel. Then, we restart the service so all our changes go into effect.
Monitoring services and computers.
Services exist inside of devices, logically speaking. We are going to install the monitoring agent in nagiosHost, and the process is exactly the same to monitor the other two devices on our network: servidorLamp and escritorioUsuario.
Installing the monitoring agent.
We enable remote custom commands.
And we whitelist the server.
Adding devices to the monitoring list
conf.d is a folder that is consumed in its entirety to build the runtime configuration. We can split the settings however we see fit, following the criteria that fits our infrastructure best.
These steps are to be perfomed on nagiosHost.
/etc/nagios3/conf.d/nodes.cfg contains the list and addresses of the computers we are going to monitor. In our specific scenario, these would be its contents.
Devices belong to groups, and monitoring policies can be applied to groups. We are going to store these definitions in /etc/nagios3/conf.d/hostgroups_nagios2.cfg.
Defining services to monitor
/etc/nagios3/conf.d/services_nagios2.cfg contains the definitions of the services we wish to monitor across groups of devices.
/etc/nagios3/conf.d/nagiosHost.cfg
This file will contain the services we wish to monitor on nagiosHost: number of currently logged-in users, number of processes running and cpu load over time.
/etc/nagios3/conf.d/escritorioUsuario.cfg
Likewise for escritorioUsuario: this will contain the relevant definitions to check for zombie processes.
Since this is a custom check, we need to explicitly define it.
Adding custom service checks.
Custom service checks are defined in two parts: A mandatory definition in the Nagios server, and an optional second definition in the monitored device.
In the Nagios server, we have to define the custom command name that will be run through on the server itself.
In the monitored device, the definition of the custom command we are going to run inside the monitorized device itself.
If we need to gather information about the state of the monitored device that can only be collected by running code on the device, then the secondary definition becomes necessary.
For example, in our scenario we are going to check for zombie processes and if a MongoDB server is available. To gather information about the list of processes on the monitored machine, we have to run code on the machine itself. To ascertain whether a MongoDB server is up and accepting connections, we can do that from the outside by trying to connect to the MongoDB server like any other client would.
Checking for zombie processes.
Nagios server
/etc/nagios3/commands.cfg
Monitored device.
/etc/nagios/nrpe_local.cfg
Installing additional plugins to monitor other services.
To be able to monitor the state of the MongoDB database on nagiosHost, we are going to install nagios-plugin-mongodb. The process is very similar to adding a custom check: the only explicit difference is that we don’t have a check_mongodb command already available in the PATH, thus we have to install it first.
Installing nagios-plugin-mongodb
Since the plugin is coded in Python, the install process is very simple. We are going to clone the repository on the Nagios server, install the requirements and then the python MongoDB connector.
We add the relevant service definition to the configuration. Since we may add additional MongoDB servers in the future, it makes sense to define it as a service that applies to a group of hosts.
/etc/nagios3/conf.d/services_nagios2.cfg
commands.cfg
The syntax of check_mongodb.py is explained in the documentation, and each plugin we install may have a different syntax.
Setting up email notifications
Nagios has the ability to send notifications emails to contact groups, which obviously contain contacts. Let’s create a contact called root that will belong to the contact group admins.
/etc/nagios3/conf.d/contacts_nagios2.cfg will contain the list of contacts. It also contains the name of the commands used to send the notifications, which we’ll define later.
And /etc/nagios3/conf.d/contact_groups_nagios2.cfg will contain the list of groups, along with their memberships.
Now, we need to add to each service the groups that we’d like to be notified in case of issues using the contact_groups variable. For the SSH service, it would look like this.
Customizing the email contents and the sending methods.
In this post we will use _sendemail, to be able to send the emails through the use of GMail’s SMTP servers.
We have to adapt the email-sending syntax on commands.cfg.
You could further customize the command line to include other channels of communications. We could add SMS notifications through Twilio or messages through Slack, for example.
Variables $USER4 and $USER5 will contain our GMail username and password, used to authenticate against the SMTP server.
/etc/nagios3/resource.cfg
WARNING: These credentials are stored in plaintext. Besides using proper file permissions, it is recommended we use service credentials generated explicitly and uniquely for Nagios. In GMail’s case, we can enable 2-Factor authentication and generate an app password.
At-a-glance overview of our devices and services.
We can get a bird’s-eye view of our infrastructure through the use of the map view in the web interface.
If we click on any of the devices, we get a quick view of the services’ health.
Testing notifications.
Device offline.
What happens if escritorioUsuario is offline? The expected behaviour is that Nagios will retry the connection a number of times, and then notify us that we have a problem if the device is still not responding.
After a number of retries, Nagios marks the device as officially offline and proceeds to send an email notification.
After the machines comes back online, Nagios detects that the problem has been solved and notifies us.
Zombie process.
To generate a zombie process, we execute the following line on the terminal of escritorioUsuario.
MongoDB server availability.
If we were to stop the MongoDB service to simulate a failure state, we can see that Nagios detects that new connections are not being established and sends an email notification about the matter.
Closing thoughts.
It is of vital importance that we are aware of issues in our infrastructure before the customers are.
Nagios allows us to condense down all the information about the state of our deployment into an easily glanceable dashboard, and the time investment required to set it up is worth the peace of mind of knowing our infrastructure is healthy and if that were to change we would be the first ones to know.