1CHECKPOINT(5) Grid Engine File Formats CHECKPOINT(5)
2
3
4
6 checkpoint - Grid Engine checkpointing environment configuration file
7 format
8
10 Checkpointing is a facility to save the complete status of an executing
11 program or job and to restore and restart from this so called check‐
12 point at a later point of time if the original program or job was
13 halted, e.g. through a system crash.
14
15 Grid Engine provides various levels of checkpointing support (see
16 ge_ckpt(1)). The checkpointing environment described here is a means
17 to configure the different types of checkpointing in use for your Grid
18 Engine cluster or parts thereof. For that purpose you can define the
19 operations which have to be executed in initiating a checkpoint genera‐
20 tion, a migration of a checkpoint to another host or a restart of a
21 checkpointed application as well as the list of queues which are eligi‐
22 ble for a checkpointing method.
23
24 Supporting different operating systems may easily force Grid Engine to
25 introduce operating system dependencies for the configuration of the
26 checkpointing configuration file and updates of the supported operating
27 system versions may lead to frequently changing implementation details.
28 Please refer to the <ge_root>/ckpt directory for more information.
29
30 Please use the -ackpt, -dckpt, -mckpt or -sckpt options to the qconf(1)
31 command to manipulate checkpointing environments from the command-line
32 or use the corresponding qmon(1) dialogue for X-Windows based interac‐
33 tive configuration.
34
35 Note, Grid Engine allows backslashes (\) be used to escape newline
36 (\newline) characters. The backslash and the newline are replaced with
37 a space (" ") character before any interpretation.
38
40 The format of a checkpoint file is defined as follows:
41
42 ckpt_name
43 The name of the checkpointing environment as defined for ckpt_name in
44 sge_types(1). qsub(1) -ckpt switch or for the qconf(1) options men‐
45 tioned above.
46
47 interface
48 The type of checkpointing to be used. Currently, the following types
49 are valid:
50
51 hibernator
52 The Hibernator kernel level checkpointing is interfaced.
53
54 cpr The SGI kernel level checkpointing is used.
55
56 cray-ckpt
57 The Cray kernel level checkpointing is assumed.
58
59 transparent
60 Grid Engine assumes that the jobs submitted with reference to
61 this checkpointing interface use a checkpointing library such as
62 provided by the public domain package Condor.
63
64 userdefined
65 Grid Engine assumes that the jobs submitted with reference to
66 this checkpointing interface perform their private checkpointing
67 method.
68
69 application-level
70 Uses all of the interface commands configured in the checkpoint‐
71 ing object like in the case of one of the kernel level check‐
72 pointing interfaces (cpr, cray-ckpt, etc.) except for the
73 restart_command (see below), which is not used (even if it is
74 configured) but the job script is invoked in case of a restart
75 instead.
76
77 ckpt_command
78 A command-line type command string to be executed by Grid Engine in
79 order to initiate a checkpoint.
80
81 migr_command
82 A command-line type command string to be executed by Grid Engine during
83 a migration of a checkpointing job from one host to another.
84
85 restart_command
86 A command-line type command string to be executed by Grid Engine when
87 restarting a previously checkpointed application.
88
89 clean_command
90 A command-line type command string to be executed by Grid Engine in
91 order to cleanup after a checkpointed application has finished.
92
93 ckpt_dir
94 A file system location to which checkpoints of potentially considerable
95 size should be stored.
96
97 ckpt_signal
98 A Unix signal to be sent to a job by Grid Engine to initiate a check‐
99 point generation. The value for this field can either be a symbolic
100 name from the list produced by the -l option of the kill(1) command or
101 an integer number which must be a valid signal on the systems used for
102 checkpointing.
103
104 when
105 The points of time when checkpoints are expected to be generated.
106 Valid values for this parameter are composed by the letters s, m, x and
107 r and any combinations thereof without any separating character in
108 between. The same letters are allowed for the -c option of the qsub(1)
109 command which will overwrite the definitions in the used checkpointing
110 environment. The meaning of the letters is defined as follows:
111
112 s A job is checkpointed, aborted and if possible migrated if the
113 corresponding ge_execd(8) is shut down on the job's machine.
114
115 m Checkpoints are generated periodically at the min_cpu_interval
116 interval defined by the queue (see queue_conf(5)) in which a job
117 executes.
118
119 x A job is checkpointed, aborted and if possible migrated as soon
120 as the job gets suspended (manually as well as automatically).
121
122 r A job will be rescheduled (not checkpointed) when the host on
123 which the job currently runs went into unknown state and the
124 time interval reschedule_unknown (see ge_conf(5)) defined in the
125 global/local cluster configuration will be exceeded.
126
127
129 Note, that the functionality of any checkpointing, migration or restart
130 procedures provided by default with the Grid Engine distribution as
131 well as the way how they are invoked in the ckpt_command, migr_command
132 or restart_command parameters of any default checkpointing environments
133 should not be changed or otherwise the functionality remains the full
134 responsibility of the administrator configuring the checkpointing envi‐
135 ronment. Grid Engine will just invoke these procedures and evaluate
136 their exit status. If the procedures do not perform their tasks prop‐
137 erly or are not invoked in a proper fashion, the checkpointing mecha‐
138 nism may behave unexpectedly, Grid Engine has no means to detect this.
139
141 ge_intro(1), ge_ckpt(1), ge__types(1), qconf(1), qmod(1), qsub(1),
142 ge_execd(8).
143
145 See ge_intro(1) for a full statement of rights and permissions.
146
147
148
149GE 6.2u5 $Date: 2007/02/14 12:58:39 $ CHECKPOINT(5)