HP 1032 ClusterPack V2.4 Tutorial - Page 143

ClusterPack, Application ReStart AppRS Overview

Page 143 highlights

Application ReStart (AppRS) Overview ClusterPack Application ReStart (AppRS) Overview Index | Administrators Guide | Users Guide | Tool Overview | Related Documents | Dictionary 3.4.1 What is AppRS? 3.4.1 What is AppRS? AppRS is a collection of software that works in conjunction with Platform Computing's Clusterware™ to provide a fail-over system that preserves the current working directory (CWD) contents of applications in the event of a fail-over. Many technical applications provide application-level checkpoint/restart facilities in which the application can save and restore its state from a file set. Checkpoint/restart is particularly helpful for long running applications because it can minimize lost computing time due to computer failure. The usefulness of this capability is diminished however by two factors. First, computer failure frequently leaves the restart files inaccessible. Using a shared file system does not preclude data loss and can introduce performance degradation. Redundant hardware solutions are often financially impractical for large clusters used in technical computing. Secondly, applications affected by computer failure generally require human detection and intervention in order to be restarted from restart files. Valuable compute time is often lost between the time that the job fails and a user is made aware of the failure. Clusterware™ + AppRS provides functionality to migrate and restart applications affected by an unreachable host and ensure that the content of the CWD of such applications is preserved across a migration. AppRS is accessed by submitting jobs to AppRS-enabled queues. Such queues generally end in "_apprs". A number of utilities are also available for monitoring a job and its files: z apprs_hist z apprs_ls z apprs_clean z apprs_mpijob More information is available in the man page or HP Application ReStart User's Guide. % man apprs

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140
  • 141
  • 142
  • 143
  • 144
  • 145
  • 146
  • 147
  • 148
  • 149
  • 150
  • 151
  • 152
  • 153
  • 154
  • 155
  • 156
  • 157
  • 158
  • 159
  • 160
  • 161
  • 162
  • 163
  • 164
  • 165
  • 166
  • 167
  • 168
  • 169
  • 170
  • 171
  • 172
  • 173

Application ReStart (AppRS) Overview
ClusterPack
Application ReStart (AppRS) Overview
Index
|
Administrators Guide
|
Users Guide
|
Tool Overview
|
Related Documents
|
Dictionary
3.4.1 What is AppRS?
3.4.1 What is AppRS?
AppRS is a collection of software that works in conjunction with Platform Computing's
Clusterware™ to provide a fail-over system that preserves the current working directory
(CWD) contents of applications in the event of a fail-over. Many technical applications
provide application-level checkpoint/restart facilities in which the application can save and
restore its state from a file set. Checkpoint/restart is particularly helpful for long running
applications because it can minimize lost computing time due to computer failure. The
usefulness of this capability is diminished however by two factors. First, computer failure
frequently leaves the restart files inaccessible. Using a shared file system does not preclude
data loss and can introduce performance degradation. Redundant hardware solutions are often
financially impractical for large clusters used in technical computing. Secondly, applications
affected by computer failure generally require human detection and intervention in order to be
restarted from restart files. Valuable compute time is often lost between the time that the job
fails and a user is made aware of the failure. Clusterware™ + AppRS provides functionality to
migrate and restart applications affected by an unreachable host and ensure that the content of
the CWD of such applications is preserved across a migration.
AppRS is accessed by submitting jobs to AppRS-enabled queues. Such queues generally end
in "_apprs". A number of utilities are also available for monitoring a job and its files:
apprs_hist
apprs_ls
apprs_clean
apprs_mpijob
More information is available in the man page or HP Application ReStart User's Guide.
%
man apprs