1Grid(Engine_CKPT) GE 6.2u5 Grid(Engine_CKPT)
2
3
4
6 Grid Engine Checkpointing - the Grid Engine checkpointing mechanism and
7 checkpointing support
8
10 Grid Engine supports two levels of checkpointing: the user level and a
11 operating system provided transparent level. User level checkpointing
12 refers to applications, which do their own checkpointing by writing
13 restart files at certain times or algorithmic steps and by properly
14 processing these restart files when restarted.
15
16 Transparent checkpointing has to be provided by the operating system
17 and is usually integrated in the operating system kernel. An example
18 for a kernel integrated checkpointing facility is the Hibernator pack‐
19 age from Softway for SGI IRIX platforms.
20
21 Checkpointing jobs need to be identified to the Grid Engine system by
22 using the -ckpt option of the qsub1() command. The argument to this
23 flag refers to a so called checkpointing environment, which defines the
24 attributes of the checkpointing method to be used (see checkpoint5()
25 for details). Checkpointing environments are setup by the qconf1()
26 options -ackpt, -dckpt, -mckpt and -sckpt. The qsub1() option -c can be
27 used to overwrite the when attribute for the referenced checkpointing
28 environment.
29
30 If a queue is of the type CHECKPOINTING, jobs need to have the check‐
31 pointing attribute flagged (see the -ckpt option to qsub1()) to be per‐
32 mitted to run in such a queue. As opposed to the behavior for regular
33 batch jobs, checkpointing jobs are aborted under conditions, for which
34 batch or interactive jobs are suspended or even stay unaffected. These
35 conditions are:
36
37 · Explicit suspension of the queue or job via qmod1() by the cluster
38 administration or a queue owner if the x occasion specifier (see
39 qsub1() -c and checkpoint5()) was assigned to the job.
40
41 · A load average value exceeding the suspend threshold as configured
42 for the corresponding queues (see queue_conf5().)
43
44 · Shutdown of the Grid Engine execution daemon ge_execd8() being
45 responsible for the checkpointing job.
46
47 After abortion, the jobs will migrate to other queues unless they were
48 submitted to one specific queue by an explicit user request. The
49 migration of jobs leads to a dynamic load balancing. Note: The abor‐
50 tion of checkpointed jobs will free all resources (memory, swap space)
51 which the job occupies at that time. This is opposed to the situation
52 for suspended regular jobs, which still cover swap space.
53
55 When a job migrates to a queue on another machine at present no files
56 are transferred automatically to that machine. This means that all
57 files which are used throughout the entire job including restart files,
58 executables and scratch files must be visible or transferred explicitly
59 (e.g. at the beginning of the job script).
60
61 There are also some practical limitations regarding use of disk space
62 for transparently checkpointing jobs. Checkpoints of a transparently
63 checkpointed application are usually stored in a checkpoint file or
64 directory by the operating system. The file or directory contains all
65 the text, data, and stack space for the process, along with some addi‐
66 tional control information. This means jobs which use a very large vir‐
67 tual address space will generate very large checkpoint files. Also the
68 workstations on which the jobs will actually execute may have little
69 free disk space. Thus it is not always possible to transfer a transpar‐
70 ent checkpointing job to a machine, even though that machine is idle.
71 Since large virtual memory jobs must wait for a machine that is both
72 idle, and has a sufficient amount of free disk space, such jobs may
73 suffer long turnaround times.
74
76 ge_intro1(,) qconf1(,) qmod1(,) qsub1(,) checkpoint5(,) Grid Engine
77 Installation and Administration Guide, Grid Engine User's Guide
78
80 See ge_intro1() for a full statement of rights and permissions.
81
82
83
84$Date: 2009/06/16 13:58:24 $ 1 Grid(Engine_CKPT)