Dell PowerEdge T140 EMC PowerEdge Servers Troubleshooting Guide - Page 100

Troubleshooting thermal issue, Preventing problems before they happen and solving punctures after

Page 100 highlights

Preventing problems before they happen and solving punctures after they occur Dell's RAID controllers contain a number of features to prevent many types of problems and to handle a variety of errors that do occur. The primary job of a RAID controller is to preserve the integrity of the data contained on its array(s). Even in the more extreme cases of damage (such as punctures), the array's data is often available and the server can remain in production. Part of any maintenance plan should be the proactive maintenance of the RAID arrays. Dell's RAID controllers are highly reliable and very good at managing its arrays without user intervention. Disregarding proper maintenance can cause even the most sophisticated technologies to experience problems over time. There are a number of things that can help maintain the health of arrays, and prevent the majority of data errors, double faults and punctures. It is highly recommended to perform routine and regular maintenance. Proactive maintenance can correct existing errors, and prevent some errors from occurring. It is not possible to prevent all errors from occurring, but most serious errors can be mitigated significantly with proactive maintenance. For storage and RAID subsystems these steps are: • Update drivers and firmware on controllers, hard drives, backplanes and other devices. • Perform routine Check Consistency operations (Dell recommends every 30 days). • Inspect cabling for signs of wear and damage and ensure good connections. • Review logs for indications of problems. This doesn't have to be a high level technical review, but could simply be a cursory view of the logs looking for extremely obvious indications of potential problems. Contact Dell Technical Support with any questions or concerns. Troubleshooting thermal issue Thermal issues can occur due to malfunctioning ambient temperature sensors, malfunctioning fans, dusty heat sinks, and malfunctioning thermal sensors and so on. To resolve the thermal issues: 1 Check the LCD and Embedded System Management (ESM) logs for any additional error messages to identify the faulty component. 2 Ensure that airflow to the machine is not blocked. Placing it in an enclosed area or blocking the air vent, can cause it to overheat. If installed in a rack, ensure that the rack cooling system is working normally. 3 Check for the ambient temperature is within acceptable levels. 4 Check the internal system fans for obstructions and ensure that all fans are spinning properly. Swap any failing fans with a known- good fan for testing. 5 Ensure that all the required shrouds and blanks are installed. 6 Check if all the fans are functioning properly, the heat sink is installed correctly, and thermal grease is applied. 100 Troubleshooting hardware issues

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132

Preventing problems before they happen and solving punctures after they
occur
Dell's RAID controllers contain a number of features to prevent many types of problems and to handle a variety of errors that do occur. The
primary job of a RAID controller is to preserve the integrity of the data contained on its array(s). Even in the more extreme cases of damage
(such as punctures), the array's data is often available and the server can remain in production. Part of any maintenance plan should be the
proactive maintenance of the RAID arrays. Dell's RAID controllers are highly reliable and very good at managing its arrays without user
intervention. Disregarding proper maintenance can cause even the most sophisticated technologies to experience problems over time.
There are a number of things that can help maintain the health of arrays, and prevent the majority of data errors, double faults and
punctures.
It is highly recommended to perform routine and regular maintenance. Proactive maintenance can correct existing errors, and prevent some
errors from occurring. It is not possible to prevent all errors from occurring, but most serious errors can be mitigated
significantly
with
proactive maintenance. For storage and RAID subsystems these steps are:
Update drivers and
firmware
on controllers, hard drives, backplanes and other devices.
Perform routine Check Consistency operations (Dell recommends every 30 days).
Inspect cabling for signs of wear and damage and ensure good connections.
Review logs for indications of problems.
This doesn’t have to be a high level technical review, but could simply be a cursory view of the logs looking for extremely obvious
indications of potential problems. Contact Dell Technical Support with any questions or concerns.
Troubleshooting thermal issue
Thermal issues can occur due to malfunctioning ambient temperature sensors, malfunctioning fans, dusty heat sinks, and malfunctioning
thermal sensors and so on.
To resolve the thermal issues:
1
Check the LCD and Embedded System Management (ESM) logs for any additional error messages to identify the faulty component.
2
Ensure that
airflow
to the machine is not blocked. Placing it in an enclosed area or blocking the air vent, can cause it to overheat. If
installed in a rack, ensure that the rack cooling system is working normally.
3
Check for the ambient temperature is within acceptable levels.
4
Check the internal system fans for obstructions and ensure that all fans are spinning properly. Swap any failing fans with a known-
good fan for testing.
5
Ensure that all the required shrouds and blanks are installed.
6
Check if all the fans are functioning properly, the heat sink is installed correctly, and thermal grease is applied.
100
Troubleshooting hardware issues