HP 1032 ClusterPack V2.4 Tutorial - Page 124

Prepare application for checkpoint restart

Page 124 highlights

and should not be used while AppRS jobs are running. % apprs_clean all For jobs submitted to non-AppRS queues, the user's job submission script should include commands to remove files that are no longer needed when the job completes. In the event that the job fails to run to completion it may be necessary to remove these files manually. To find out what hosts the job executed on use the command: % bhist -l Included in the output is the list of hosts that the job executed on and the working directory used for execution. This information can be used to manually delete files from a job that was unable to complete successfully. References: z 3.7.9 How do I access the Clusterware Pro V5.1 Command Line Interface? Back to Top 2.3.5 Prepare application for checkpoint restart Any job submitted to an AppRS enabled queue is restarted on a new set of hosts if: z Any host allocated to the job becomes unavailable or unreachable by the other hosts while the job is executing. z The job is explicitly migrated using the LSF command bmig. z The user's job exits with exit code 3. (For more information on exit values, see the HP Application ReStart User's Guide) As long as an application can generate restart files and be restarted from those files, AppRS will ensure that files marked as Highly Available are present when the application is restarted. AppRS will requeue any application that exits with a status of either 2 or 3. If the application (or script that invokes the application) should not be requeued, an exit status other than 2 or 3 should be used. A job submission script for a checkpoint/restart application should follow the example in /opt/apprs/examples/job_template: #!/bin/sh #BSUB -n 2 # Number of processors requested #BSUB -e test.stderr # Standard error file #BSUB -o test.stdout # Standard output file #BSUB -q normal_apprs #APPRS INPUT # list input files separated by spaces #APPRS HIGHLYVISIBLE # list HV (progress) files #APPRS HIGHLYAVAILABLE # list HA (restart) files

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140
  • 141
  • 142
  • 143
  • 144
  • 145
  • 146
  • 147
  • 148
  • 149
  • 150
  • 151
  • 152
  • 153
  • 154
  • 155
  • 156
  • 157
  • 158
  • 159
  • 160
  • 161
  • 162
  • 163
  • 164
  • 165
  • 166
  • 167
  • 168
  • 169
  • 170
  • 171
  • 172
  • 173

and should not be used while AppRS jobs are running.
%
apprs_clean all
For jobs submitted to non-AppRS queues, the user's job submission script should include
commands to remove files that are no longer needed when the job completes. In the event
that the job fails to run to completion it may be necessary to remove these files manually. To
find out what hosts the job executed on use the command:
%
bhist -l <jobid>
Included in the output is the list of hosts that the job executed on and the working directory
used for execution. This information can be used to manually delete files from a job that was
unable to complete successfully.
References:
3.7.9 How do I access the Clusterware Pro V5.1 Command Line Interface?
Back to Top
2.3.5 Prepare application for checkpoint restart
Any job submitted to an AppRS enabled queue is restarted on a new set of hosts if:
Any host allocated to the job becomes unavailable or unreachable by the
other hosts while the job is executing.
The job is explicitly migrated using the LSF command bmig.
The user's job exits with exit code 3.
(For more information on exit values, see the HP Application ReStart User's Guide)
As long as an application can generate restart files and be restarted from those files, AppRS
will ensure that files marked as Highly Available are present when the application is
restarted. AppRS will requeue any application that exits with a status of either 2 or 3. If the
application (or script that invokes the application) should not be requeued, an exit status
other than 2 or 3 should be used.
A job submission script for a checkpoint/restart application should follow the example
in /opt/apprs/examples/job_template:
#!/bin/sh
#BSUB -n 2 # Number of processors requested
#BSUB -e test.stderr # Standard error file
#BSUB -o test.stdout # Standard output file
#BSUB -q normal_apprs
#APPRS INPUT # list input files separated by spaces
#APPRS HIGHLYVISIBLE # list HV (progress) files
#APPRS HIGHLYAVAILABLE # list HA (restart) files