High Availability

The main cause for an Internet security system to fail is because of a hardware failure. The ability of any system to continue providing services after a failure is called failover. Sophos UTM provides high availability (HA) failover, allowing you to set up a hot standby system in case the primary system fails (active-passive). Alternatively, you can use Sophos UTM to set up a cluster, which operates by distributing dedicated network traffic to a collection of nodes (active-active) similar to conventional load-balancing approaches in order to get optimal resource utilization and decrease computing time.

The concepts high availability and cluster as implemented in Sophos UTM are closely related. For a high availability system can be considered a two-node cluster, which is the minimum requirement to provide redundancy.

Each node within the cluster can assume one of the following roles:

Master: The primary system in a hot standby/cluster setup. Within a cluster, the master is responsible for synchronizing and distributing of data.
Slave: The standby system in a hot standby/cluster setup which takes over operations if the master fails.
Worker: A simple cluster node, responsible for data processing only.

All nodes monitor themselves by means of a so-called heart-beat signal, a periodically sent multicast UDP packet used to check if the other nodes are still alive. If any node fails to send this packet due to a technical error, the node will be declared dead. Depending on the role the failed node had assumed, the configuration of the setup changes as follows:

If the master node fails, the slave will take its place and the worker node with the highest ID will become slave.
If the slave node fails, the worker node with the highest ID will become slave.
If a worker node fails, you may notice a performance decrease due to the lost processing power. However, the failover capability is not impaired.

Note – HA settings are part of the hardware configurations and cannot be saved in a backup. This also means that HA settings will not be overwritten by a backup restore.

Reporting

All reporting data is consolidated on the master node and is synchronized to the other cluster nodes at intervals of five minutes. In case of a takeover, you will therefore lose not more than five minutes of reporting data. However, there is a distinction in the data collection process. The graphs displayed in the Logging & Reporting > Hardware tabs only represent the data of the node currently being master. On the other hand, accounting information such as shown on the Logging & Reporting > Network Usage page represents data that was collected by all nodes involved. For example, today's CPU usage histogram shows the current processor utilization of the master node. In the case of a takeover, this would then be the data of the slave node. However, information about top accounting services, for example, is a collection of data from all nodes that were involved in the distributed processing of traffic that has passed the unit.

Notes

The Address Resolution Protocol (ARP) is only used by the actual master. That is to say, slave and worker nodes do not send or reply to ARP requests.
In case of a failover event, the unit that takes over operations performs an ARP announcement (also known as gratuitous ARP), which is usually an ARP request intended to update the ARP caches of other hosts which receive the request. Gratuitous ARP is utilized to announce that the IP of the master was moved to the slave.
All interfaces configured on the master must have a physical link, that is, the port must be properly connected to any network device.