Compaq ProLiant 1000 Architecting and Deploying High-Availability Solutions - Page 6

What Causes Downtime?

Page 6 highlights

Architecting and Deploying High-Availability Solutions 6 Remote Hot Sites (functional locations geographically distant from the primary operations center) are an option if the Recovery Point and Recovery Time for an application are not very critical. An example might be a billing application where the monthly statements could be delayed in mailing with minimum impact on a business Electronic Vaulting (method of electronically storing, managing, and protecting data in a computer "vault" which is located off-site in a physically secure location) is an option if Recovery Point is more important than Recovery Time; if, for instance, an indeterminate amount of data cannot be lost or historical data needs to be available online for reference. An example might be an inventory application where the most current transactions are recoverable by other means and the application can be restarted where it left off using the historical data as a basis for inventory status. This is a good example of a data-centric operation. On-line Hot Backup (data backup that is conducted while the system is in full operation) is necessary if the Recovery Time is more critical than the Recovery Point A good example is an on-line traffic or production control system where history is not as important as the current state of the situation. In air-traffic control, where the planes were five minutes ago is not as critical as where they are now, because in five minutes they may have moved 50 miles each, but in what directions? This is a classic example of a transactioncentric operation. 24 x 365 (continuous availability) is the only viable option where both the Recovery Point and Recovery Time are critical for an application. Using the criteria of Recovery Point and Recovery Time, which state of availability is right for your organization? 3. What Causes Downtime? After looking at your information systems, the user community, and the cost of downtime, you can determine the level of availability you need. Now it is time to focus on the events that can have a negative impact on your ability to keep an application - and an organization - up and running. Component faults due to hardware, software, or interoperability issues. While the industry has come a long way in reducing Mean Time Between Failure (MTBF) rates for individual hardware, packaging , and mechanical components, the interdependent nature of today's multivendor and networked solutions makes them vulnerable to hardware, software, and network interoperability problems. Administrative intervention. Just because it's planned downtime doesn't mean it's not downtime. Management tasks like system maintenance, database backups, index builds, table reorganizations, cache changes, application/operating system updates, system re-configuration, and a physical move may require that a system be brought down. Or the intervention itself may cause a failure. Building-level incidents. In addition to system problems, disasters affecting a site or building, such as fire, power loss, or flooding, can interrupt service by damaging systems, robbing them of power, or preventing access to them. Metropolitan area disaster. Disasters, such as floods, fire, and blackouts, can also affect whole cities, impacting systems located throughout the metropolitan area. Regional events. Computing can also be interrupted by disasters that affect systems across an even a larger region. Hurricanes, earthquakes, or geopolitical disruptions can cause outages over hundreds of square miles. Do you know the probability of each of these events affecting your operation? Do you know what will happen to your applications, particularly those in the "24 x 365" zone in each of these cases? Do you know it can cost less than the alternative to minimize the negative impact that could occur? Understanding these factors is crucial to determining the level of availability required by your organization. ECG064/1198

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14

Architecting and Deploying High-Availability Solutions
6
ECG064/1198
Remote Hot Sites
(functional locations geographically distant from the primary operations center) are an
option if the Recovery Point and Recovery Time for an application are not very critical. An example might
be a billing application where the monthly statements could be delayed in mailing with minimum impact on
a business
Electronic Vaulting
(method of electronically storing, managing, and protecting data in a computer "vault"
which is located off-site in a physically secure location) is an option if Recovery Point is more important
than Recovery Time; if, for instance, an indeterminate amount of data cannot be lost or historical data
needs to be available online for reference. An example might be an inventory application where the most
current transactions are recoverable by other means and the application can be restarted where it left off
using the historical data as a basis for inventory status. This is a good example of a
data-centric
operation.
On-line Hot Backup
(data backup that is conducted while the system is in full operation) is necessary if the
Recovery Time is more critical than the Recovery Point A good example is an on-line traffic or production
control system where history is not as important as the current state of the situation. In air-traffic control,
where the planes were five minutes ago is not as critical as where they are now, because in five minutes
they may have moved 50 miles each, but in what directions? This is a classic example of a
transaction-
centric
operation.
24 x 365
(continuous availability) is the only viable option where both the Recovery Point and Recovery
Time are critical for an application.
Using the criteria of Recovery Point and Recovery Time, which state of availability is right for your
organization?
3. What Causes Downtime?
After looking at your information systems, the user community, and the cost of downtime, you can
determine the level of availability you need. Now it is time to focus on the events that can have a negative
impact on your ability to keep an application – and an organization – up and running.
Component faults due to hardware, software, or interoperability issues.
While the industry has come a long
way in reducing Mean Time Between Failure (MTBF) rates for individual hardware, packaging , and
mechanical components, the interdependent nature of today's multivendor and networked solutions makes
them vulnerable to hardware, software, and network interoperability problems.
Administrative intervention.
Just because it's planned downtime doesn't mean it's not downtime.
Management tasks like system maintenance, database backups, index builds, table reorganizations, cache
changes, application/operating system updates, system re-configuration, and a physical move may require
that a system be brought down. Or the intervention itself may cause a failure.
Building-level incidents.
In addition to system problems, disasters affecting a site or building, such as fire,
power loss, or flooding, can interrupt service by damaging systems, robbing them of power, or preventing
access to them.
Metropolitan area disaster.
Disasters, such as floods, fire, and blackouts, can also affect whole cities,
impacting systems located throughout the metropolitan area.
Regional events.
Computing can also be interrupted by disasters that affect systems across an even a larger
region. Hurricanes, earthquakes, or geopolitical disruptions can cause outages over hundreds of square
miles.
Do you know the probability of each of these events affecting your operation? Do you know what will
happen to your applications, particularly those in the “24 x 365” zone in each of these cases? Do you know
it can cost less than the alternative to minimize the negative impact that could occur? Understanding these
factors is crucial to determining the level of availability required by your organization.