One of the problems that we have with Nagios Core is trying to ensure that there is a machanism that our backup nagios server can use to determine that it needs to start publishing records to the alert database. When the primary host is not longer around or able to produce records into the alert database.
Most thinking about HA implements a heartbeat between the primary/backup hosts and this helps the backup determine that the primary is no longer available. For a hosted service like a web server or a VRRP default gateway, this makes sense. But this makes a lot less sense for a service host whose job it is to perform examination of other devices and hosts.
Rather than implementing a heartbeat between the two servers, we could simply implement a rally point in the alerting database that helps the backup host determine whether the primary is healthy. Let's call this an Inert Alert Record since the purpose of the record will not be to alert anyone... it's just a record.
The primary host would need to implement a flow to publish to this well-known inert alert record every minute. The backup host could check that inert alert record and, in the situations where it has determined that the primary host has not updated the record for N or more intervals, it would start upserting records in the alert database. When the primary comes back online, the alert record starts reflecting more recent upserts from the primary host and the backup host stops upserting records.
What's nice about using a pattern like this is that you don't need a lot of custom software or communication flows... instead we exploit the system of record we are concerned with: the alerting database. We also avoid issues where both hosts are able to reach the alert database but unable to reach one another... no split brain with both hosts clobbering one another's changes.
This allows us to handle the following types of cases:
- primary host is down while backup host is not
- primary host is unable to access the alert database while the backup host still has access
Downsides include:
- the backup host has to monitor the inert record every minute
- the backup host has to perform redundant checking against all monitored hosts and devices
- does not solve for a situation where the primary host is able to reach the alerting database but unable to reach the hosts/devices it is responsible for monitoring