1GE_SHEPHERD(8) Grid Engine Administrative Commands GE_SHEPHERD(8)
2
3
4
6 ge_shepherd - Grid Engine single job controlling agent
7
9 ge_shepherd
10
12 ge_shepherd provides the parent process functionality for a single Grid
13 Engine job. The parent functionality is necessary on UNIX systems to
14 retrieve resource usage information (see getrusage(2)) after a job has
15 finished. In addition, the ge_shepherd forwards signals to the job,
16 such as the signals for suspension, enabling, termination and the Grid
17 Engine checkpointing signal (see ge_ckpt(1) for details).
18
19 The ge_shepherd receives information about the job to be started from
20 the ge_execd(8). During the execution of the job it actually starts up
21 to 5 child processes. First a prolog script is run if this feature is
22 enabled by the prolog parameter in the cluster configuration. (See
23 ge_conf(5).) Next a parallel environment startup procedure is run if
24 the job is a parallel job. (See sge_pe(5) for more information.) After
25 that, the job itself is run, followed by a parallel environment shut‐
26 down procedure for parallel jobs, and finally an epilog script if
27 requested by the epilog parameter in the cluster configuration. The
28 prolog and epilog scripts as well as the parallel environment startup
29 and shutdown procedures are to be provided by the Grid Engine adminis‐
30 trator and are intended for site-specific actions to be taken before
31 and after execution of the actual user job.
32
33 After the job has finished and the epilog script is processed, ge_shep‐
34 herd retrieves resource usage statistics about the job, places them in
35 a job specific subdirectory of the ge_execd(8) spool directory for
36 reporting through ge_execd(8) and finishes.
37
38 ge_shepherd also places an exit status file in the spool directory.
39 This exit status can be viewed with qacct -j JobId (see qacct(1)); it
40 is not the exit status of ge_shepherd itself but of one of the methods
41 executed by ge_shepherd. This exit status can have several meanings,
42 depending on in which method an error occurred (if any). The possible
43 methods are: prolog, parallel start, job, parallel stop, epilog, sus‐
44 pend, restart, terminate, clean, migrate, and checkpoint.
45
46 The following exit values are returned:
47
48 0 All methods: Operation was executed successfully.
49
50 99 Job script, prolog and epilog: When FORBID_RESCHEDULE is not set
51 in the configuration (see ge_conf(5)), the job gets re-queued.
52 Otherwise see "Other".
53
54 100 Job script, prolog and epilog: When FORBID_APPERROR is not set
55 in the configuration (see ge_conf(5)), the job gets re-queued.
56 Otherwise see "Other".
57
58 Other Job script: This is the exit status of the job itself. No action
59 is taken upon this exit status because the meaning of this exit
60 status is not known.
61 Prolog, epilog and parallel start: The queue is set to error
62 state and the job is re-queued.
63 Parallel stop: The queue is set to error state, but the job is
64 not re-queued. It is assumed that the job itself ran success‐
65 fully and only the clean up script failed.
66 Suspend, restart, terminate, clean, and migrate: Always success‐
67 ful.
68 Checkpoint: Success, except for kernel checkpointing: checkpoint
69 was not successful, did not happen (but migration will happen by
70 Grid Engine).
71
73 ge_shepherd should not be invoked manually, but only by ge_execd(8).
74
76 sgepasswd contains a list of user names and their correspond‐
77 ing encrypted passwords. If available, the password file will be
78 used by sge_shepherd. To change the contents of this file please use
79 the sgepasswd command. It is not advised to change that file manually.
80 <execd_spool>/job_dir/<job_id> job specific directory
81
83 ge_intro(1), ge_conf(5), ge_execd(8).
84
86 See ge_intro(1) for a full statement of rights and permissions.
87
88
89
90GE 6.2u5 $Date: 2007/07/19 09:04:33 $ GE_SHEPHERD(8)