1Grid(Engine_CKPT)                   GE 6.1                   Grid(Engine_CKPT)
2
3
4

NAME

6       Grid Engine Checkpointing - the Grid Engine checkpointing mechanism and
7       checkpointing support
8

DESCRIPTION

10       Grid Engine supports two levels of checkpointing: the user level and  a
11       operating  system  provided transparent level. User level checkpointing
12       refers to applications, which do their  own  checkpointing  by  writing
13       restart  files  at  certain  times or algorithmic steps and by properly
14       processing these restart files when restarted.
15
16       Transparent checkpointing has to be provided by  the  operating  system
17       and  is  usually  integrated in the operating system kernel. An example
18       for a kernel integrated checkpointing facility is the Hibernator  pack‐
19       age from Softway for SGI IRIX platforms.
20
21       Checkpointing  jobs  need to be identified to the Grid Engine system by
22       using the -ckpt option of the qsub1() command.  The  argument  to  this
23       flag   refers   to  a  checkpointing  environment,  which  defines  the
24       attributes of the checkpointing method to be  used  (see  checkpoint5()
25       for  details).   Checkpointing  environments  are setup by the qconf1()
26       options -ackpt, -dckpt, -mckpt and -sckpt. The qsub1() option -c can be
27       used  to overwrite the -when attribute for the referenced checkpointing
28       environment.
29
30       If a queue is of the type CHECKPOINTING, jobs need to have  the  check‐
31       pointing attribute flagged (see the -ckpt option to qsub1()) to be per‐
32       mitted to run in such a queue.  Checkpointing jobs  are  aborted  under
33       conditions,  for  which batch or interactive jobs are suspended or even
34       stay unaffected. These conditions are:
35
36       ·  Explicit suspension of the queue or job via qmod1() by  the  cluster
37          administration  or  a  queue owner if the  x occasion specifier (see
38          qsub1() -c and checkpoint5()) was assigned to the job.
39
40       ·  A load average value exceeding the migration threshold as configured
41          for the corresponding queues (see queue_conf5().)
42
43       ·  Shutdown  of  the  Grid  Engine  execution daemon sge_execd8() being
44          responsible for the checkpointing job.
45
46       After abortion, the jobs will migrate to other queues unless they  were
47       submitted  to  one  specific  queue  by  an explicit user request.  The
48       migration of jobs leads to a dynamic load balancing.  Note:  The  abor‐
49       tion  of checkpointed jobs will free all resources (memory, swap space)
50       which the job occupies at that time. This is opposed to  the  situation
51       for suspended regular jobs, which still cover swap space.
52

RESTRICTIONS

54       When  a  job migrates to a queue on another machine at present no files
55       are transferred automatically to that  machine.  This  means  that  all
56       files which are used throughout the entire job including restart files,
57       executables and scratch files must be visible or transferred explicitly
58       (e.g. at the beginning of the job script).
59
60       There  are  also some practical limitations regarding use of disk space
61       for transparently checkpointing jobs. Checkpoints  of  a  transparently
62       checkpointed  application  are  usually  stored in a checkpoint file or
63       directory by the operating system. The file or directory  contains  all
64       the  text, data, and stack space for the process, along with some addi‐
65       tional control information. This means jobs which use a very large vir‐
66       tual  address space will generate very large checkpoint files. Also the
67       workstations on which the jobs will actually execute  may  have  little
68       free disk space. Thus it is not always possible to transfer a transpar‐
69       ent checkpointing job to a machine, even though that machine  is  idle.
70       Since  large  virtual  memory jobs must wait for a machine that is both
71       idle, and has a sufficient amount of free disk  space,  such  jobs  may
72       suffer long turnaround times.
73

SEE ALSO

75       sge_intro1(,)  qconf1(,)  qmod1(,)  qsub1(,) checkpoint5(,) Grid Engine
76       Installation and Administration Guide, Grid Engine User's Guide
77
79       See sge_intro1() for a full statement of rights and permissions.
80
81
82
83$Date: 2007/11/06 18:18:12 $           1                     Grid(Engine_CKPT)
Impressum