Compaq DS20E Technical Guide - Page 9

Server Management, Reliability, Availability, and Maintainability - alpha

Page 9 highlights

Server Management The AlphaServer products support important operational and platform management requirements. Operational Management Server/Network Management. Comaq Insight Manager is included with every system. This software tool allows you to monitor and control Alpha based servers. Insight Manager consists of two components: a Windows-based console application and server- or client-based management data collection agents. Management agents monitor over 1,000 management parameters. Key subsystems are instrumented to make health, configuration, and performance data available to the agent software. The agents act upon that data, by initiating alarms in the event of faults and by providing updated management information, such as network interface or storage subsystem performance statistics. Remote Server Management. An integrated remote management console (RMC) lets the operator perform several tasks from a serial console: monitor the system power, temperature, and fans, and reset, halt, and power the system on or off, regardless of the operating system or hardware state. The monitoring can be done locally or remotely through a modem. Platform Management The AlphaServer DS20E systems support platform management tasks such as manipulating and monitoring hardware performance, configuration, and errors. For example, the operating systems provide a number of tools to characterize system performance and display errors logged in the system error log file. In addition, system console firmware provides hardware configuration tools and diagnostics to facilitate quick hardware installation and troubleshooting. The system operator can use simple console commands to show the system configuration, devices, boot and operational flags, and recorded errors. Also, the console aids in inventory support by giving access to serial numbers and revisions of hardware and firmware. Error Reporting Compaq Analyze, a diagnostic tool used to determine the cause of hardware failures, is installed with the operating systems. It provides automatic background analysis, as it constantly views and reads the error log file. It analyzes both single error/fault events and multiple events. When an error condition is detected, it collects the error information and sends it and an analysis to the user. The tool requires a graphics monitor for its output display. Reliability, Availability, and Maintainability The AlphaServer DS20E system achieves an unparalleled level of reliability and availability through the careful application of technologies that balance redundancy, error correction, and fault management. Reliability and availability features are built into the CPU, memory, and I/O, and implemented at the system level. Processor Features • CPU data cache provides error correction code (ECC) protection. • Parity protection on CPU cache tag store. • Multi-tiered power-up diagnostics to verify the functionality of the hardware. With two processors, when you power up or reset the system, each CPU, in parallel, runs a set of diagnostic tests. If any tests fail, the failing CPU is configured out of the system. Responsibility for initializing memory and booting the console firmware is transferred to the other CPU, and the boot process continues. This feature ensures that a system can still power up and boot the operating system in case of a CPU failure. LEDs on the control panel indicate test status and component failure information. Memory Features • The memory ECC scheme is designed to provide maximum protection for user data. The memory scheme corrects single-bit errors and detects double-bit errors and total DRAM failure. It also detects RAM address errors. • Memory failover. The power-up diagnostics are designed to provide the largest amount of usable memory, configuring around errors. I/O Features • ECC protection on the switch interconnect and parity protection on the PCI and SCSI buses. • Extensive error correction built into disk drives. • Optional internal RAID improves reliability and data security. • Disk hot swap. 7

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16

7
Server Management
The
AlphaServer
products support important operational and
platform management requirements.
Operational Management
Server/Network Management.
Comaq Insight Manager
is
included with every system.
This software tool allows you to
monitor and control Alpha based servers.
Insight Manager
consists of two components: a Windows-based console appli-
cation and server- or client-based management data collection
agents.
Management agents monitor over 1,000 management
parameters.
Key subsystems are instrumented to make health,
configuration, and performance data available to the agent
software.
The agents act upon that data, by initiating alarms in
the event of faults and by providing updated management in-
formation, such as network interface or storage subsystem per-
formance statistics.
Remote Server Management
.
An integrated remote manage-
ment console (RMC) lets the operator perform several tasks
from a serial console:
monitor the system power, temperature,
and fans, and reset, halt, and power the system on or off, re-
gardless of the operating system or hardware state.
The moni-
toring can be done locally or remotely through a modem.
Platform Management
The
AlphaServer
DS20E systems support platform
management tasks such as manipulating and monitoring
hardware performance, configuration, and errors.
For
example, the operating systems provide a number of tools to
characterize system performance and display errors logged in
the system error log file.
In addition, system console firmware provides hardware
configuration tools and diagnostics to facilitate quick hardware
installation and troubleshooting.
The system operator can use
simple console commands to show the system configuration,
devices, boot and operational flags, and recorded errors.
Also,
the console aids in inventory support by giving access to serial
numbers and revisions of hardware and firmware.
Error Reporting
Compaq Analyze, a diagnostic tool used to determine the
cause of hardware failures, is installed with the operating
systems.
It provides automatic background analysis, as it
constantly views and reads the error log file.
It analyzes both
single error/fault events and multiple events.
When an error
condition is detected, it collects the error information and
sends it and an analysis to the user.
The tool requires a
graphics monitor for its output display.
Reliability, Availability, and Maintainability
The
AlphaServer
DS20E system achieves an unparalleled level
of reliability and availability through the careful application of
technologies that balance redundancy, error correction, and
fault management.
Reliability and availability features are
built into the CPU, memory, and I/O, and implemented at the
system level.
Processor Features
CPU data cache provides error correction code (ECC)
protection.
Parity protection on CPU cache tag store.
Multi-tiered power-up diagnostics to verify the
functionality of the hardware.
With two processors, when you power up or reset the system,
each CPU, in parallel, runs a set of diagnostic tests.
If any
tests fail, the failing CPU is configured out of the system.
Responsibility for initializing memory and booting the console
firmware is transferred to the other CPU, and the boot process
continues.
This feature ensures that a system can still power
up and boot the operating system in case of a CPU failure.
LEDs on the control panel indicate test status and component
failure information.
Memory Features
The memory ECC scheme is designed to provide
maximum protection for user data.
The memory scheme
corrects single-bit errors and detects double-bit errors and
total DRAM failure.
It also detects RAM address errors.
Memory failover.
The power-up diagnostics are designed
to provide the largest amount of usable memory, config-
uring around errors.
I/O Features
ECC protection on the switch interconnect and parity
protection on the PCI and SCSI buses.
Extensive error correction built into disk drives.
Optional internal RAID improves reliability and data
security.
Disk hot swap.