HP Surestore E Disk Array XP256 HP XP P9000 External Storage Access Manager Us - Page 51

Disaster recovery, Main types of failures that can disrupt your system, The basic recovery process

Page 51 highlights

6 Disaster recovery On-site disasters, such as power supply failures, can disrupt the normal operation of your ESAM system. Being able to quickly identify the type of failure and recover the affected system or component helps to ensure that you can restore high-availability protection for host applications as soon as possible. Main types of failures that can disrupt your system The main types of failures that can disrupt the system are power failures, hardware failures, connection or communication failures, and software failures. These types of failures can cause system components to function improperly or stop functioning. System components typically affected by these types of failures include: • Main control unit (primary storage system) • Service processor (primary or secondary storage system) • Remote control unit (secondary storage system) • Volume pairs • Quorum disks The basic recovery process The basic process for recovering from an on-site disaster is the same, regardless of the type of failure that caused the disruption in the system. The recovery process involves: • Detecting failures • Determining the type of failure • Determining which recovery procedure to use • Completing the recovery procedure. System failure messages The system automatically generates messages that you can use to detect failures and determine the type of failure that occurred. The messages contain information about the type of failure. System information messages (SIM) Path failure messages Generated by the primary and secondary storage systems Generated by the multipath software on the host Detecting failures Detecting failures is the first task in the recovery process. Failure detection is essential because you need to know the type of failure before you can determine which recovery procedure to use. You have two options for detecting failures. You can check to see if failover has occurred and then determine the type of failure that caused it, or you can check to see if failures have occurred by using the SIM and path failure system messages. • "Option 1: Check for failover first" (page 51) • "Option 2: Check for failures only" (page 52) Option 1: Check for failover first You can use status information about the secondary volume and path status information to see if failover occurred. You can do this using RWC, RAID Manager, or multipath software. Main types of failures that can disrupt your system 51

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95

6 Disaster recovery
On-site disasters, such as power supply failures, can disrupt the normal operation of your ESAM
system. Being able to quickly identify the type of failure and recover the affected system or
component helps to ensure that you can restore high-availability protection for host applications
as soon as possible.
Main types of failures that can disrupt your system
The main types of failures that can disrupt the system are power failures, hardware failures,
connection or communication failures, and software failures. These types of failures can cause
system components to function improperly or stop functioning.
System components typically affected by these types of failures include:
Main control unit (primary storage system)
Service processor (primary or secondary storage system)
Remote control unit (secondary storage system)
Volume pairs
Quorum disks
The basic recovery process
The basic process for recovering from an on-site disaster is the same, regardless of the type of
failure that caused the disruption in the system. The recovery process involves:
Detecting failures
Determining the type of failure
Determining which recovery procedure to use
Completing the recovery procedure.
System failure messages
The system automatically generates messages that you can use to detect failures and determine
the type of failure that occurred. The messages contain information about the type of failure.
Generated by the primary and secondary storage systems
System information messages (SIM)
Generated by the multipath software on the host
Path failure messages
Detecting failures
Detecting failures is the first task in the recovery process. Failure detection is essential because you
need to know the type of failure before you can determine which recovery procedure to use.
You have two options for detecting failures. You can check to see if failover has occurred and then
determine the type of failure that caused it, or you can check to see if failures have occurred by
using the SIM and path failure system messages.
“Option 1: Check for failover first” (page 51)
“Option 2: Check for failures only” (page 52)
Option 1: Check for failover first
You can use status information about the secondary volume and path status information to see if
failover occurred. You can do this using RWC, RAID Manager, or multipath software.
Main types of failures that can disrupt your system
51