1CHECKPOINT(5)              Grid Engine File Formats              CHECKPOINT(5)
2
3
4

NAME

6       checkpoint  -  Grid Engine checkpointing environment configuration file
7       format
8

DESCRIPTION

10       Checkpointing is a facility to save the complete status of an executing
11       program  or  job  and to restore and restart from this so called check‐
12       point at a later point of time if  the  original  program  or  job  was
13       halted, e.g.  through a system crash.
14
15       Grid  Engine  provides  various  levels  of  checkpointing support (see
16       ge_ckpt(1)).  The checkpointing environment described here is  a  means
17       to  configure the different types of checkpointing in use for your Grid
18       Engine cluster or parts thereof. For that purpose you  can  define  the
19       operations which have to be executed in initiating a checkpoint genera‐
20       tion, a migration of a checkpoint to another host or  a  restart  of  a
21       checkpointed application as well as the list of queues which are eligi‐
22       ble for a checkpointing method.
23
24       Supporting different operating systems may easily force Grid Engine  to
25       introduce  operating  system  dependencies for the configuration of the
26       checkpointing configuration file and updates of the supported operating
27       system versions may lead to frequently changing implementation details.
28       Please refer to the <ge_root>/ckpt directory for more information.
29
30       Please use the -ackpt, -dckpt, -mckpt or -sckpt options to the qconf(1)
31       command  to manipulate checkpointing environments from the command-line
32       or use the corresponding qmon(1) dialogue for X-Windows based  interac‐
33       tive configuration.
34
35       Note,  Grid  Engine  allows  backslashes  (\) be used to escape newline
36       (\newline) characters. The backslash and the newline are replaced  with
37       a space (" ") character before any interpretation.
38

FORMAT

40       The format of a checkpoint file is defined as follows:
41
42   ckpt_name
43       The  name  of the checkpointing environment as defined for ckpt_name in
44       sge_types(1).  qsub(1) -ckpt switch or for the  qconf(1)  options  men‐
45       tioned above.
46
47   interface
48       The  type  of  checkpointing to be used. Currently, the following types
49       are valid:
50
51       hibernator
52              The Hibernator kernel level checkpointing is interfaced.
53
54       cpr    The SGI kernel level checkpointing is used.
55
56       cray-ckpt
57              The Cray kernel level checkpointing is assumed.
58
59       transparent
60              Grid Engine assumes that the jobs submitted  with  reference  to
61              this checkpointing interface use a checkpointing library such as
62              provided by the public domain package Condor.
63
64       userdefined
65              Grid Engine assumes that the jobs submitted  with  reference  to
66              this checkpointing interface perform their private checkpointing
67              method.
68
69       application-level
70              Uses all of the interface commands configured in the checkpoint‐
71              ing  object  like  in the case of one of the kernel level check‐
72              pointing  interfaces  (cpr,  cray-ckpt,  etc.)  except  for  the
73              restart_command  (see  below),  which is not used (even if it is
74              configured) but the job script is invoked in case of  a  restart
75              instead.
76
77   ckpt_command
78       A  command-line  type  command  string to be executed by Grid Engine in
79       order to initiate a checkpoint.
80
81   migr_command
82       A command-line type command string to be executed by Grid Engine during
83       a migration of a checkpointing job from one host to another.
84
85   restart_command
86       A  command-line  type command string to be executed by Grid Engine when
87       restarting a previously checkpointed application.
88
89   clean_command
90       A command-line type command string to be executed  by  Grid  Engine  in
91       order to cleanup after a checkpointed application has finished.
92
93   ckpt_dir
94       A file system location to which checkpoints of potentially considerable
95       size should be stored.
96
97   ckpt_signal
98       A Unix signal to be sent to a job by Grid Engine to initiate  a  check‐
99       point  generation.  The  value  for this field can either be a symbolic
100       name from the list produced by the -l option of the kill(1) command  or
101       an  integer number which must be a valid signal on the systems used for
102       checkpointing.
103
104   when
105       The points of time when  checkpoints  are  expected  to  be  generated.
106       Valid values for this parameter are composed by the letters s, m, x and
107       r and any combinations thereof  without  any  separating  character  in
108       between.  The same letters are allowed for the -c option of the qsub(1)
109       command which will overwrite the definitions in the used  checkpointing
110       environment.  The meaning of the letters is defined as follows:
111
112       s      A  job  is checkpointed, aborted and if possible migrated if the
113              corresponding ge_execd(8) is shut down on the job's machine.
114
115       m      Checkpoints are generated periodically at  the  min_cpu_interval
116              interval defined by the queue (see queue_conf(5)) in which a job
117              executes.
118
119       x      A job is checkpointed, aborted and if possible migrated as  soon
120              as the job gets suspended (manually as well as automatically).
121
122       r      A  job  will  be rescheduled (not checkpointed) when the host on
123              which the job currently runs went into  unknown  state  and  the
124              time interval reschedule_unknown (see ge_conf(5)) defined in the
125              global/local cluster configuration will be exceeded.
126
127

RESTRICTIONS

129       Note, that the functionality of any checkpointing, migration or restart
130       procedures  provided  by  default  with the Grid Engine distribution as
131       well as the way how they are invoked in the ckpt_command,  migr_command
132       or restart_command parameters of any default checkpointing environments
133       should not be changed or otherwise the functionality remains  the  full
134       responsibility of the administrator configuring the checkpointing envi‐
135       ronment.  Grid Engine will just invoke these  procedures  and  evaluate
136       their  exit  status. If the procedures do not perform their tasks prop‐
137       erly or are not invoked in a proper fashion, the  checkpointing  mecha‐
138       nism may behave unexpectedly, Grid Engine has no means to detect this.
139

SEE ALSO

141       ge_intro(1),  ge_ckpt(1),  ge__types(1),  qconf(1),  qmod(1),  qsub(1),
142       ge_execd(8).
143
145       See ge_intro(1) for a full statement of rights and permissions.
146
147
148
149GE 6.2u5                 $Date: 2007/02/14 12:58:39 $            CHECKPOINT(5)
Impressum