1Grid(Engine_CKPT)                  GE 6.2u5                  Grid(Engine_CKPT)
2
3
4

NAME

6       Grid Engine Checkpointing - the Grid Engine checkpointing mechanism and
7       checkpointing support
8

DESCRIPTION

10       Grid Engine supports two levels of checkpointing: the user level and  a
11       operating  system  provided transparent level. User level checkpointing
12       refers to applications, which do their  own  checkpointing  by  writing
13       restart  files  at  certain  times or algorithmic steps and by properly
14       processing these restart files when restarted.
15
16       Transparent checkpointing has to be provided by  the  operating  system
17       and  is  usually  integrated in the operating system kernel. An example
18       for a kernel integrated checkpointing facility is the Hibernator  pack‐
19       age from Softway for SGI IRIX platforms.
20
21       Checkpointing  jobs  need to be identified to the Grid Engine system by
22       using the -ckpt option of the qsub1() command.  The  argument  to  this
23       flag refers to a so called checkpointing environment, which defines the
24       attributes of the checkpointing method to be  used  (see  checkpoint5()
25       for  details).   Checkpointing  environments  are setup by the qconf1()
26       options -ackpt, -dckpt, -mckpt and -sckpt. The qsub1() option -c can be
27       used  to  overwrite the when attribute for the referenced checkpointing
28       environment.
29
30       If a queue is of the type CHECKPOINTING, jobs need to have  the  check‐
31       pointing attribute flagged (see the -ckpt option to qsub1()) to be per‐
32       mitted to run in such a queue. As opposed to the behavior  for  regular
33       batch  jobs, checkpointing jobs are aborted under conditions, for which
34       batch or interactive jobs are suspended or even stay unaffected.  These
35       conditions are:
36
37       ·  Explicit  suspension  of the queue or job via qmod1() by the cluster
38          administration or a queue owner if the  x  occasion  specifier  (see
39          qsub1() -c and checkpoint5()) was assigned to the job.
40
41       ·  A  load  average value exceeding the suspend threshold as configured
42          for the corresponding queues (see queue_conf5().)
43
44       ·  Shutdown of the  Grid  Engine  execution  daemon  ge_execd8()  being
45          responsible for the checkpointing job.
46
47       After  abortion, the jobs will migrate to other queues unless they were
48       submitted to one specific queue  by  an  explicit  user  request.   The
49       migration  of  jobs leads to a dynamic load balancing.  Note: The abor‐
50       tion of checkpointed jobs will free all resources (memory, swap  space)
51       which  the  job occupies at that time. This is opposed to the situation
52       for suspended regular jobs, which still cover swap space.
53

RESTRICTIONS

55       When a job migrates to a queue on another machine at present  no  files
56       are  transferred  automatically  to  that  machine. This means that all
57       files which are used throughout the entire job including restart files,
58       executables and scratch files must be visible or transferred explicitly
59       (e.g. at the beginning of the job script).
60
61       There are also some practical limitations regarding use of  disk  space
62       for  transparently  checkpointing  jobs. Checkpoints of a transparently
63       checkpointed application are usually stored in  a  checkpoint  file  or
64       directory  by  the operating system. The file or directory contains all
65       the text, data, and stack space for the process, along with some  addi‐
66       tional control information. This means jobs which use a very large vir‐
67       tual address space will generate very large checkpoint files. Also  the
68       workstations  on  which  the jobs will actually execute may have little
69       free disk space. Thus it is not always possible to transfer a transpar‐
70       ent  checkpointing  job to a machine, even though that machine is idle.
71       Since large virtual memory jobs must wait for a machine  that  is  both
72       idle,  and  has  a  sufficient amount of free disk space, such jobs may
73       suffer long turnaround times.
74

SEE ALSO

76       ge_intro1(,) qconf1(,) qmod1(,)  qsub1(,)  checkpoint5(,)  Grid  Engine
77       Installation and Administration Guide, Grid Engine User's Guide
78
80       See ge_intro1() for a full statement of rights and permissions.
81
82
83
84$Date: 2009/06/16 13:58:24 $           1                     Grid(Engine_CKPT)
Impressum