Monitoring and Troubleshooting

If you have a problem with a cluster, a representative from Ivanti Global Support Center might ask you to create a snapshot that includes group communication statistics to assist with debugging the cluster problem. When you enable the group communication monitor, the system records statistics related to all the cluster nodes. As the local node communicates with other nodes in the cluster, the system captures statistics related to intra cluster communication. The Group Communication tab is displayed only when you enable clustering on your system. On a standalone system, you do not have access to the Group Communication tab.

You can also enable the cluster networking troubleshooting server on the Network Connectivity page.

- Enabling  Monitor all cluster nodes from this node can impact system performance and stability. Perform extensive monitoring ONLY when directed by the Ivanti Support representative.
- Performing log synchronization across cluster nodes can impact your system performance and stability.

Group Communication

To enable group communication monitoring:

  • Enter the maximum size of the statistics log.
  • Enter the interval, in seconds, at which events are to be logged.

If you want to monitor all cluster nodes from the current local node, select the Monitor all cluster nodes from this node check box. If you do not check this option, the group communication monitor gathers statistics only for the local node.

Node monitor under Maintenance > Troubleshooting > Monitoring > Node Monitor tab which is enabled by default is NOT related to Monitor all cluster nodes from this node under Group Communication and is not known to cause any system impact.

If you select the Monitor all cluster nodes from this node option, the cluster nodes must be able to communicate over UDP port 6543

  1. Select the Enable group communication monitoring check box to start the monitoring tool.
  2. Click Save Changes.
  3. If you want to include the node monitoring results in the system snapshot, choose Maintenance > Troubleshooting > System Snapshot, and select the Include debug log check box.
  4. Take a system snapshot to retrieve the results.

Node Monitoring

To enable node monitoring:

  1. Select Maintenance > Troubleshooting > Monitoring > Node Monitor to enable the node monitor.
  2. Enter the maximum size for the node monitor log.
  3. Enter the interval, (in seconds) at which node statistics are to be captured.
  4. Select the Node monitoring enabled check box to start monitoring cluster nodes.
  5. For Maximum node monitor log size, enter the maximum size (in MB) of the log file. Valid values in the range of 1 - 30.
  6. Specify the interval (in seconds) that defines how often nodes are to be monitored.
  7. Select the commands to use to monitor the node.
  8. If you select dsstatdump, enter its parameters as well.
  9. Click Save Changes.
  10. To include the node monitoring results in the system snapshot, select Maintenance > Troubleshooting > System Snapshot, and select the Include debug log check box.
  11. Take a system snapshot to retrieve the results.

Network Connectivity Monitoring

Use the Network Connectivity tab to enable the cluster node troubleshooting server and to select a node on which to perform troubleshooting tasks. The troubleshooting tool allows you to determine the network connectivity between cluster nodes.

The server component of this tool runs on the node to which connectivity is being tested. The client component runs on the node from which connectivity is being tested. The basic scenario for testing connectivity is this:

  • The administrator starts the server component on the passive node.
  • The administrator tests the connectivity to the server node from the Active node, by starting the client component on the active node and then contacting the passive node running the server component.

The server component must be run on nodes that are configured as either standalone or in a cluster but disabled. Cluster services cannot be running on the same node as the server component.

To enable network connectivity monitoring:

  1. Select Maintenance > Troubleshooting > Cluster > Network Connectivity and Enable cluster network troubleshooting server check box to enable the server component.
  2. Click Save Changes.
  3. On another machine, select Maintenance > Troubleshooting > Cluster > Network Connectivity.

    Perform one of the following steps:

    • Select a node from the list.
    • Enter the IP address of the server node.
  4. Click Go to begin troubleshooting the machine on which the server component is running.
  5. Click the Details link below the fields to view the results.

Monitoring using SNMP Traps

You can monitor clusters using the standard logging tools. Specifically, you can use several cluster-specific SNMP traps to monitor events that occur on your cluster nodes, such as:

  • External interface down
  • Internal interface down
  • Disabled node
  • Changed virtual IP address (VIP)
  • Deleted cluster node (cluster stop)

In general, it is desirable to configure your SNMP traps on a clusterwide basis so that any given cluster node can send its generated traps to the right target. Setting up clusterwide configuration for the traps is particularly important when you also use a load balancer, because you might not know which node is responsible for a specific operation. In that case, the load balancer can independently determine which cluster node can manage an administrative session.

You can use SNMP traps that are included in the Ivanti Standard MIB to monitor these events. These traps include:

  • iveNetExternalInterfaceDownTrap—Supplies type of event that brought down the external interface.
  • iveNetInternalInterfaceDownTrap—Supplies type of event that brought down the internal interface.
  • iveClusterDisableNodeTrap—Supplies the cluster name on which nodes are disabled, along with a space-separated list of disabled node names.
  • iveClusterChangedVIPTrap—Supplies the type of the VIP, whether external or internal, and its value before and after the change.
  • iveClusterDelete—Supplies the name of the cluster node on which the cluster delete event was initiated.

These traps are always enabled and available in the MIB. You cannot disable the traps.

Troubleshooting

You can use a built-in feature on the clustering Status page to identify the status of each cluster node. Mouse over the Status light icon to display a tool tip containing a hexadecimal number. The hexadecimal number is a snapshot of the status of the system.

Value

Meaning

0x000001

System is in standalone mode.

0x000002

System is in cluster disabled state.

0x000004

System is in cluster enabled state.

0x000008

System is unreachable (because it is offline, has the wrong password, has a different cluster definition, different version, or other problem).

0x00002000

The node owns the VIPs (on) or not (off).

0x000100

System is synchronizing its state from another node (initial synchronizing phase).

0x000200

System is transitioning from one state to another.

0x00020000

Group communication subsystems at the local and remote nodes are disconnected from each other.

0x00040000

Management interface (mgt0) is displayed disconnected.

0x00080000

Management gateway is unreachable for ARP ping.

0x000800

Interface int0 displays disconnected (no carrier).

0x001000

Interface int1 displays disconnected (no carrier).

0x002000

System is syncing its state to another node that is joining.

0x004000

Initial Synchronization as master or slave is taking place.

0x008000

System is the leader of the cluster.

0x010000

The spread daemon is running and the cache server is connected to it.

0x020000

The gateway on int0 is unreachable for ARP pings (see log file).

0x040000

The gateway on int1 is unreachable for ARP pings (see log file).

0x080000

Leader election is taking place.

0x100000

Server lifecycle process is busy.

0x200000

System is performing post state synchronization activities.

0x30004

The spread daemon is running and the cache server is connected to it.

The gateway on int0 is unreachable for ARP pings (see log file).

System is in cluster enabled state.

0x38004

The spread daemon is running and the cache server is connected to it.

The gateway on int0 is unreachable for ARP pings (see log file).

System is the leader of the cluster.

System is in cluster enabled state.

Each code, as you see it in the system, may relate specifically to one state. However, each code may represent a combination of states, and so the actual code does not appear in the above table. Instead, the code you see in the system is the sum of several of the hexadecimal numbers shown above. You will need to factor out the codes, as in the following example:

  • 0x38004—The rightmost digit (4) in this hexadecimal number corresponds to:
    • 0x000004—The system is in a cluster enabled state.
  • 0x38004—The digit in the fourth position from the right (8) corresponds to:
    • 0x008000—This system is the leader of the cluster.
  • 0x38004—The leftmost digit (3) in this hexadecimal number does not exist in the table, which indicates that it corresponds to the sum of two other digits, in this case, 1 and 2, as shown in the following codes:
    • 0x020000—The gateway on int0 is unreachable for ARP pings (see log file).
    • 0x010000—The spread daemon is running and the cache server is connected to it.

Restarting or Rebooting Cluster Nodes

When you create a cluster of two or more nodes, the clustered nodes act as a logical entity. When you reboot one of the nodes using either the serial console or the admin console, all nodes in the cluster restart or reboot.

To reboot only one node:

  1. Select System > Clustering > Status to disable the node you want to restart or reboot within the cluster.
  2. Select Maintenance > System > Platform.
  3. Reboot the node, then enable the node within the cluster again.