HP ProLiant BL660c HP BladeSystem c-Class Onboard Administrator Failover - Page 2

Purpose of this White Paper, OA Failover Process Description, OA Failover Testing, General

Page 2 highlights

Purpose of this White Paper HP c-Class BladeSystem Onboard Administrators (OA) are frequently configured in pairs to provide fault tolerance of the enclosure management. This white paper describes the interaction of the Active and Standby OAs when configured in fault tolerant pairs, and how they behave during failover events. OA Failover Process Description In the redundant OA configuration, the Standby OA is initialized to the state of "hot standby" ready to take over the Active OA when the situation warrants. The Active and Standby OAs communicate keep-alives via two separate communication links (Ethernet and Serial) in order to maintain redundancy. The Active OA also sends enclosure system configuration changes to the Standby OA so that both OAs contain the same configuration information. OA failover events can occur in two separate scenarios: (1) An Active OA hardware failure. When the Standby OA cannot communicate with the Active OA via either of communication links, the Standby OA initiates a takeover event. Actual OA hardware failures are extremely rare. (2) Customer initiated "forced" failover for administrative purposes In both cases, the actual failover processing is exactly the same. The Standby OA initializes itself as the new Active OA and resets the previously Active OA to make sure it is not in an indeterminate state (this phase takes approximately 15 seconds). It then proceeds to check the status and configuration of all the devices in the enclosure. During this process, any interconnect module which was in a powered off state prior to failover will be powered on if sufficient enclosure power is available. The duration of this phase depends on the configuration complexity - the lab measurements using large enclosure configurations show it completes within 7 minutes. Users will be able to log into the GUI/CLI within a minute of the failover initiation while the background enclosure device inventory is conducted. The original Active OA will initialize itself as the new Standby OA if the failover was not caused by an OA hardware failure. When the new Active and Standby OAs reestablish redundancy, the Active OA transfers the enclosure configuration data to the Standby OA in order to make sure any incremental changes are also stored in the new Standby OA. OA Failover Testing - General Discussion In normal customer operational situations, OA failures or forced failovers are not commonplace events. When a failover occurs it is usually because of a specific operational or administrative issue, although it could occasionally result from an actual hardware failure. In these situations, it would be extremely rare for multiple OA failovers to occur within a short time. However, it is possible using the OA CLI to trigger successive OA failovers in rapid succession. While this process does not make sense from an operational perspective, some customers may want to include repetitive OA failover testing as part of their qualification processes and have leveraged the CLIs this way to script back to back failovers. However, initiating multiple failover events within a short time span may cause intermittent undesirable results. The remaining section of this white paper will address these situations and provide best practices for testing OA failover and recovery in the data center. As described in the previous section, during an OA failover, the Standby OA will complete the basic failover operations very quickly. Within a minute, users can log back into the OA via GUI or CLI. Although the OA appears to be fully operational, there are other processes necessary for the resynchronization of enclosure device data and status, which may run for several minutes after the GUI and CLI appear operational. Thus, for repetitive OA failover testing, it is recommended to wait at least 7 minutes from the time a failover is initiated before attempting another OA failover, in order to insure the entire enclosure is fully synchronized. In addition, if the enclosure is configured with Virtual Connect (VC), there is additional recovery process that VC needs to perform after the OA failover. The following sections discuss the OA - VC recovery interactions in further details. 2

  • 1
  • 2
  • 3
  • 4
  • 5

2
Purpose of this White Paper
HP c-Class BladeSystem Onboard Administrators (OA) are frequently configured in pairs to provide fault tolerance of
the enclosure management. This white paper describes the interaction of the Active and Standby OAs when
configured in fault tolerant pairs, and how they behave during failover events.
OA Failover Process Description
In the redun
dant OA configuration, the Standby OA is initialized to the state of “hot standby” ready to take over the
Active OA when the situation warrants. The Active and Standby OAs communicate keep-alives via two separate
communication links (Ethernet and Serial) in order to maintain redundancy. The Active OA also sends enclosure
system configuration changes to the Standby OA so that both OAs contain the same configuration information.
OA failover events can occur in two separate scenarios:
(1)
An Active OA hardware failure. When the Standby OA cannot communicate with the Active OA via either
of communication links, the Standby OA initiates a takeover event. Actual OA hardware failures are
extremely rare.
(2)
Customer initiated “forced” failover for
administrative purposes
In both cases, the actual failover processing is exactly the same. The Standby OA initializes itself as the new Active
OA and resets the previously Active OA to make sure it is not in an indeterminate state (this phase takes
approximately 15 seconds). It then proceeds to check the status and configuration of all the devices in the enclosure.
During this process, any interconnect module which was in a powered off state prior to failover will be powered on if
sufficient enclosure power is available. The duration of this phase depends on the configuration complexity - the lab
measurements using large enclosure configurations show it completes within 7 minutes. Users will be able to log into
the GUI/CLI within a minute of the failover initiation while the background enclosure device inventory is conducted.
The original Active OA will initialize itself as the new Standby OA if the failover was not caused by an OA hardware
failure. When the new Active and Standby OAs reestablish redundancy, the Active OA transfers the enclosure
configuration data to the Standby OA in order to make sure any incremental changes are also stored in the new
Standby OA.
OA Failover Testing
General Discussion
In normal customer operational situations, OA failures or forced failovers are not commonplace events. When a
failover occurs it is usually because of a specific operational or administrative issue, although it could occasionally
result from an actual hardware failure. In these situations, it would be extremely rare for multiple OA failovers to
occur within a short time. However, it is possible using the OA CLI to trigger successive OA failovers in rapid
succession. While this process does not make sense from an operational perspective, some customers may want to
include repetitive OA failover testing as part of their qualification processes and have leveraged the CLIs this way to
script back to back failovers.
However, initiating multiple failover events within a short time span may cause
intermittent undesirable results. The remaining section of this white paper will address these situations and provide
best practices for testing OA failover and recovery in the data center.
As described in the previous section, during an OA failover, the Standby OA will complete the basic failover
operations very quickly. Within a minute, users can log back into the OA via GUI or CLI. Although the OA appears
to be fully operational, there are other processes necessary for the resynchronization of enclosure device data and
status, which may run for several minutes after the GUI and CLI appear operational. Thus, for repetitive OA failover
testing, it is recommended to wait at least 7 minutes from the time a failover is initiated before attempting another OA
failover, in order to insure the entire enclosure is fully synchronized.
In addition, if the enclosure is configured with Virtual Connect (VC), there is additional recovery process that VC
needs to perform after the OA failover. The following sections discuss the OA
VC recovery interactions in further
details.