When TSIM is configured in HA, how does it determine when to failover? Which of the processes are monitored and how long should it take to failover? |
TSIM HA monitors the following processes through an internal health check:
jserver
rate agent controller mcell database httpd services pronet.ha.availability.scan.frequency.in.secs
(default is 60 seconds)
Which is in the file:
<installationdirectory>/pw/custom/conf/pronet.conf
If any of these processes are down, it repeats the scan for the number times as set by the below attribute in the same pronet.conf file: pronet.ha.availability.max.retry.count
This gives the process the chance to recover and not trigger a failover too frequently.(default is 4 before 10.7FP3 and 6 after 10.7FP3) If the process is still down after pronet.ha.availability.scan.frequency.in.secs times pronet.ha.availability.max.retry.count (up to 6 minutes when using the defaults) then failover will occur. However, when ALL of the above processes are found to be down by the health check then failover occurs immediately regardless. For large TSIM environments we do not recommend to reduce these default values as there can sometimes be a delay when doing the process health check and this delay should not trigger failover. These default values should also not be increased without BMC analyzing the logs and checking the processes behavior which triggered undue failover. It is important to note that TSIM HA has the notion of a primary TSIM and a secondary TSIM. When the primary TSIM is up and running it will become the active node of the HA even if the secondary TSIM has been running. For troubleshooting TSIM HA deployment or TSIM failover issues, refer Troubleshooting Guide: Troubleshooting an Infrastructure Management high-availability deployment |