HP 1032 ClusterPack V2.4 Tutorial - Page 126

Restart application from a checkpoint if a Compute Node crashes

Page 126 highlights

2.3.6 Restart application from a checkpoint if a Compute Node crashes If a Compute Node crashes, jobs submitted to an AppRS queue will automatically be restarted on a new node or set of nodes as those resources become available. No user intervention is necessary. Back to Top 2.3.7 Determine if the application fails to complete The job state of EXIT is assigned to jobs that end abnormally. Using the Clusterware Pro V5.1 Web Interface: From the Jobs tab: z Review the job states in the Jobs table. z Use the Previous and Next buttons to view more Jobs. Using the Clusterware Pro V5.1 CLI: % bjobs References: z 3.7.8 How do I access the Clusterware Pro V5.1 Web Interface? z 3.7.9 How do I access the Clusterware Pro V5.1 Command Line Interface? Back to Top 2.3.8 Check impact on the job if a Compute Node crashes In the event that a Compute Node crashes or becomes unavailable, it may be desirable to check on jobs that may be affected by the situation. Using the Clusterware Pro V5.1 CLI: z List your current and recently finished jobs: % bjobs -a z Request information on a particular job:

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140
  • 141
  • 142
  • 143
  • 144
  • 145
  • 146
  • 147
  • 148
  • 149
  • 150
  • 151
  • 152
  • 153
  • 154
  • 155
  • 156
  • 157
  • 158
  • 159
  • 160
  • 161
  • 162
  • 163
  • 164
  • 165
  • 166
  • 167
  • 168
  • 169
  • 170
  • 171
  • 172
  • 173

2.3.6 Restart application from a checkpoint if a Compute Node crashes
If a Compute Node crashes, jobs submitted to an AppRS queue will automatically be
restarted on a new node or set of nodes as those resources become available. No user
intervention is necessary.
Back to Top
2.3.7 Determine if the application fails to complete
The job state of EXIT is assigned to jobs that end abnormally.
Using the Clusterware Pro V5.1 Web Interface:
From the Jobs tab:
Review the job states in the Jobs table.
Use the Previous and Next buttons to view more Jobs.
Using the Clusterware Pro V5.1 CLI:
%
bjobs <job_ID>
References:
3.7.8 How do I access the Clusterware Pro V5.1 Web Interface?
3.7.9 How do I access the Clusterware Pro V5.1 Command Line Interface?
Back to Top
2.3.8 Check impact on the job if a Compute Node crashes
In the event that a Compute Node crashes or becomes unavailable, it may be desirable to
check on jobs that may be affected by the situation.
Using the Clusterware Pro V5.1 CLI:
List your current and recently finished jobs:
%
bjobs -a
Request information on a particular job: