Configuring Fault-Tolerance

The Traffic Managers in a cluster periodically check that they can communicate with the network using ICMP pings through each active network interface. They then broadcast a message describing their health (good or failed).

If one of the Traffic Managers in a cluster fails, the other Traffic Managers will take over any Traffic IP addresses that the failed Traffic Manager was managing.

Fault Tolerance Configuration Settings

You can configure fault tolerance using the settings in System > Fault Tolerance > General:

•flipper!autofailback: If a Traffic Manager that is hosting Traffic IP addresses fails, another Traffic Manager in the cluster raises the dropped addresses until the original Traffic Manager recovers.

Set to "Yes" to ensure that the Traffic IP addresses are automatically returned to the original Traffic Manager when it regains connectivity.

Set to "No" to force the original Traffic Manager into a pending mode upon recovery, in which case it waits to be instructed when it should raise the Traffic IP addresses.

The Admin UI displays a warning when a Traffic Manager is in the pending state. Use the Diagnose > Cluster Diagnosis page or the Services > Traffic IP Groups page to fully reactivate a pending Traffic Manager.

A pending Traffic Manager is automatically reactivated if all other Traffic Managers in the cluster fail. In this case, the Traffic Manager also takes over all Traffic IP addresses it is responsible for. A Traffic Manager also reactivates automatically if it would not cause any disruption to the services already running.

•flipper!autofailback_delay: The time interval, in seconds, after which a recovered Traffic Manager should re-raise dropped Traffic IP addresses.

If flipper!autofailback is set to "Yes", a previously-failed cluster member has to remain healthy for this period of time before failback can occur. This mechanism can help to ensure a recovered Traffic Manager is not still subject to some external factor blocking connectivity before it is required to raise Traffic IP addresses. If, during the waiting period, a further failure is detected, the delay timer restarts.

Manual failback is possible at any time during delayed auto-failback. To perform a manual failback, reactivate a pending Traffic Manager on the Diagnose page.

Changing the value of flipper!autofailback_delay has no effect on an already running delay timer until the timer is forced to restart through either a subsequent failure or a successful failback event. To enforce an immediate change, or just to cancel the current timer, disable auto-failback by setting flipper!autofailback to "No". If you then re-set flipper!autofailback to "Yes", the new delay period applies. Setting flipper!autofailback_delay to 0 (zero) in this situation stops a running timer and results in immediate failback.

Turning off auto-failback during a delay timer countdown always prevents a Traffic Manager from being used, even if the system had previously entered a healthy state and the delay timer was running.

•flipper!monitor_interval: The time interval, in milliseconds, that the Traffic Manager uses to periodically check if it can contact other devices on the network. The result is then announced to other Traffic Managers on a multicast address. Decrease the interval to ensure failures in the network are detected sooner, however this increases the workload on the system.

•flipper!monitor_timeout: The amount of time, in seconds, that the Traffic Manager should wait for a response when it has tried to contact one of the devices used to check connectivity. If a response is not received within this time, the Traffic Manager assumes it cannot contact the device. Decrease this value to allow the Traffic Manager to detect failures sooner, although the reliability of the system is reduced because slower connections might be incorrectly regarded as having failed.

•flipper!heartbeat_method: In a cluster, each Traffic Manager must send out periodic heartbeat messages to advise that all software and network services are running.

Use Unicast to instruct the Traffic Manager to send directed UDP messages to each Traffic Manager in your cluster. Use this setting if the switches in your network are unable to handle the potentially more efficient multicast packet method. Unicast messages can also reach machines on separate subnets; for example, Traffic Managers in disparate data centers. Use Communication Port: to specify the network port the Traffic Manager must listen on for unicast messages. This port must be contactable from all other Traffic Managers in the cluster.

Use Multicast to instruct the Traffic Manager to send these messages using multicast network packets, which should traverse all network ports on all switches that your Traffic Managers are connected to. Multicast messages only reach machines on the same subnet. Set the multicast address that all Traffic Managers should listen on using the Multicast address and port: field (by default, this is 239.100.1.1:9090).

•flipper!use_bindip: Should multicast messages be sent and received on the management interface alone.

Set this to “Yes” if your Traffic Managers communicate with each other on a separate, dedicated network to that used by the main traffic. Note that, however, the reliability of the system could be compromised as a result.

Set this to “No” (default) to instruct the Traffic Manager to send the multicast messages it uses to announce its connectivity out over every network interface. This increases the reliability of the system by ensuring that failure of one network interface does not cause other Traffic Managers in the cluster to assume that this machine has failed.

•flipper!arp_count: The number of ARP packets that the Traffic Manager sends when a network interface is raised. Although other devices on the network only requires one ARP request to acknowledge a new interface on a Traffic Manager, ARP packets can get lost or missed. To improve the reliability and speed of the broadcast, the Traffic Manager sends the number of ARP packets defined here when an interface is raised.

•flipper!igmp_interval: The time interval, in seconds, that the Traffic Manager uses to sends unsolicited IGMP Membership Report packets for Multi-Hosted Traffic IP groups. Use this interval to prevent switches and routers from forgetting the Traffic Manager's multicast subscription(s). To disable this feature, set the interval to 0.

•flipper!verbose: Should a Traffic Manager log all connectivity tests. Set this to Yes to output detailed information to the system log.

Ivanti recommends only using this for diagnostic purposes.

•flipper!frontend_check_addrs: The IP addresses to use for front-end connectivity checking. If the Traffic Manager cannot successfully ping these addresses, it decides that its network connectivity is broken and performs fail-over to the other Traffic Managers in the cluster.

If your Traffic Manager setup does not require external connectivity; for example, if it is part of an Intranet, set this field to blank to disable the external connectivity check.

Understanding Traffic Manager Fault Tolerance Checks

Each Traffic Manager checks that its network interfaces are operating correctly. It does this by:

•Periodically pinging the default gateway, to ensure that its front-end network interface is functioning.

•Periodically pinging each of the back-end nodes in the pools that are in use, to ensure that its back-end network interfaces are functioning.

The Traffic Manager concludes that it has failed if it cannot ping the default gateway, or if it cannot ping any of the nodes in any of the currently used pools.

Overriding Front-End Checks

You might need to override the front-end check if, for example, your gateway does not respond to ICMP pings. To override the front-end check, set flipper!frontend_check_addrs to an empty string.

Health Broadcasts

Each Traffic Manager regularly broadcasts the results of its local health checks, whether it is healthy or not.

By default, each machine broadcasts these heartbeat messages twice per second. You can configure this behavior using the flipper!monitor_interval setting.

You can choose to send broadcast messages by one of two methods, unicast or multicast. Select your preference with the flipper!heartbeat_method setting.

If you use unicast, you must ensure that “State Sharing” is enabled. To do this, set state_sync_time on the System > Global Settings page to a non-zero value.

In many circumstances, unicast is appropriate. However, if your network setup means that all Traffic Managers can receive multicast messages, for example, if your Traffic Manager cluster is within a single network segment, use multicast instead. If, after trying both methods, you still encounter issues, contact your support provider for assistance.

Ivanti recommends that users of Hyper-V-based Traffic Managers use the default unicast method. When two or more instances are clustered together, using multicast can, in some circumstances, result in packet loss.

To debug heartbeat communication problems, Ivanti recommends using the “tcpdump” program to capture all heartbeat messages on your network:

# tcpdump –i eth0 ip multicast

To capture only the heartbeat messages issued by your Traffic Managers, use:

# tcpdump –i eth0 dst host 239.100.1.1 and dst port 9090

This example uses port 9090, the default for Traffic Manager management communications. Substitute in the port number your Traffic Manager uses, if different. For more information on the Traffic Manager’s network port requirements, see Default Ports Used by the Traffic Manager.

Determining the Health of a Cluster

Each Traffic Manager listens for the health messages from all of the other Traffic Managers in the cluster. A Traffic Manager concludes that one of its peers has failed if any one of the following conditions is true:

•It receives an "I have failed" health message from the peer.

•It does not receive any messages from the peer within the value set in flipper!monitor_timeout.

•The peer reports that one of its child process is no longer servicing traffic. This occurs if any child process fails to respond within the value set in flipper!child_timeout.

The Traffic Manager concludes that a peer has recovered when it starts receiving "I am healthy" messages from the peer.