I am seeing a number of errors in the grid.log for my BMC Atrium Orchestrator (BAO) peers. The most common error is this:
08 Jul 2013 15:13:46,728 WARN HANode : Failed to obtain high availability node's lock within the configured timeout of 20000 ms while processing a heartbeat message from another high availability node. Cancelling this activity.
This error indicates that the peers were not able to communicate to share grid information, which can lead to system instability. For example, if the peers were not able to communicate, they would not be able to change the state of adapters because a peer cannot enable an adapter without first notifying all other peers. A peer may also try to reach other peers to determine the current job list. If it fails, the peer assumes the other peer is offline and elects itself as the peer in charge of managing jobs, which means that the peers are no longer coordinating job management. The root cause of this is usually that BAO was not able to send and receive messages as quickly as it needs, either due to high load or a slow network connection.
More recent releases of Atrium Orchestrator 7.6, particularly 7.6.02 Service Pack 6 and 7.6.03, allow BAO to run peer communications in parallel rather than serializing them, which leads to peers processing requests more quickly.