HP Cluster Platform Introduction v2010 Microsoft Windows HPC Server 2008 Insta - Page 64

Diagnostics Fail, 9.4 InfiniBand, Configuration, Images, Manage drivers, Maintain

Page 64 highlights

• The compute node fails to copy the operating system from the WDS server. When the compute node starts the WinPE environment, a command window displays. You should see it add any drivers (if you injected drivers into the OS), then start the provisioning. One of the early steps is copying the OS image to the compute node. If this copy fails, an error appears in the command window and commands will stop. - Make sure you added the correct network drivers for your compute node into the image. Check/Add drivers using the HPC Management Console. Select Configuration tab, and Images on the right side. Select Manage drivers. • The compute node may fail to join the domain. - Verify the username/password for domain credentials are correct. - Verify the username supplied has correct privileges to add domain accounts/machines. - Verify the DNS on the head node is correct. - Check the list of machines in the domain. Verify there are no machine name conflicts in the domain. NOTE: Sometimes removing the domain machine information from the domain and restarting the provisioning will cleanup artifact information on the machine. • The provisioning might appear to fail. Check the compute node console for reasons why the provisioning might appear to be stalled. Possible reasons include: - Looking for a license key or waiting for a license key to be entered. Enter a key at the console, or just click Next to continue. - Unsigned driver loading. Unsigned drivers added to the OS image might cause a hang on reboot. The screen will indicate this error and list the unsigned driver causing the issue. Remove the driver from your OS image. 9.3 Diagnostics Fail Running diagnostics to check cluster health is a valuable tool. Failures can be confusing. Some failures to be aware of: • Initial diagnostics of the head node might fail indicating the management service is not running. Many times the test is run too soon and the service is now running. Clear the error and rerun to verify the service is running. • The SOA test may fail. If you do not designate a broker node, this test fails. Do not run this test if you are not using SOA. • Windows Update check might fail. If you do not perform Windows Updates after the installation, this test fails. Initiate a Windows update, or schedule an update from the maintenance section on the compute node templates. Take the nodes offline and select Maintain to run the updates. • The DNS test might fail, indicating the private node is communicating through the head node. If the private node is not on the public network, the node's DNS entry should be proxy'ed through the head node and not show up in DNS. Sometimes a DNS entry will appear with the node's private address. These are incorrect and can be deleted. 9.4 InfiniBand MPI applications might not run on InfiniBand. • Verify the network topology uses the Application network. • Verify all IB network adapters have an IP address assigned. Use clusrun ipconfig to easily dump the IP configuration of all nodes in the cluster. • Verify the IB network adapter is named 'Application'. If not, rerun network configuration and reconfigure the cluster network. 64 Troubleshooting

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74

The compute node fails to copy the operating system from the WDS server. When the
compute node starts the WinPE environment, a command window displays. You should
see it add any drivers (if you injected drivers into the OS), then start the provisioning. One
of the early steps is copying the OS image to the compute node. If this copy fails, an error
appears in the command window and commands will stop.
Make sure you added the correct network drivers for your compute node into the image.
Check/Add drivers using the HPC Management Console. Select
Configuration
tab,
and
Images
on the right side. Select
Manage drivers
.
The compute node may fail to join the domain.
Verify the username/password for domain credentials are correct.
Verify the username supplied has correct privileges to add domain accounts/machines.
Verify the DNS on the head node is correct.
Check the list of machines in the domain. Verify there are no machine name conflicts
in the domain.
NOTE:
Sometimes removing the domain machine information from the domain and
restarting the provisioning will cleanup artifact information on the machine.
The provisioning might appear to fail. Check the compute node console for reasons why
the provisioning might appear to be stalled. Possible reasons include:
Looking for a license key or waiting for a license key to be entered. Enter a key at the
console, or just click
Next
to continue.
Unsigned driver loading. Unsigned drivers added to the OS image might cause a hang
on reboot. The screen will indicate this error and list the unsigned driver causing the
issue. Remove the driver from your OS image.
9.3 Diagnostics Fail
Running diagnostics to check cluster health is a valuable tool. Failures can be confusing. Some
failures to be aware of:
Initial diagnostics of the head node might fail indicating the management service is not
running. Many times the test is run too soon and the service is now running. Clear the error
and rerun to verify the service is running.
The SOA test may fail. If you do not designate a broker node, this test fails. Do not run this
test if you are not using SOA.
Windows Update check might fail. If you do not perform Windows Updates after the
installation, this test fails. Initiate a Windows update, or schedule an update from the
maintenance section on the compute node templates. Take the nodes offline and select
Maintain
to run the updates.
The DNS test might fail, indicating the private node is communicating through the head
node. If the private node is not on the public network, the node's DNS entry should be
proxy'ed through the head node and not show up in DNS. Sometimes a DNS entry will
appear with the node's private address. These are incorrect and can be deleted.
9.4 InfiniBand
MPI applications might not run on InfiniBand.
Verify the network topology uses the Application network.
Verify all IB network adapters have an IP address assigned. Use
clusrun ipconfig
to
easily dump the IP configuration of all nodes in the cluster.
Verify the IB network adapter is named 'Application'. If not, rerun
network
configuration
and reconfigure the cluster network.
64
Troubleshooting