1pbs_mom(8B) PBS pbs_mom(8B)
2
3
4
6 pbs_mom - start a pbs batch execution mini-server
7
9 pbs_mom [-a alarm] [-C chkdirectory] [-c config] [-D] [-d directory]
10 [-F] [-h help] [-H hostname] [-L logfile] [-M MOMport] [-R RPPport]
11 [-p|-q|-r] [-w] [-x]
12
13 SH DESCRIPTION The pbs_mom command starts the operation of a batch Ma‐
14 chine Oriented Mini-server, MOM, on the local host. Typically, this
15 command will be in a local boot file such as /etc/rc.local . To insure
16 that the pbs_mom command is not runnable by the general user community,
17 the server will only execute if its real and effective uid is zero.
18
19 One function of pbs_mom is to place jobs into execution as directed by
20 the server, establish resource usage limits, monitor the job's usage,
21 and notify the server when the job completes. If they exist, pbs_mom
22 will execute a prologue script before executing a job and an epilogue
23 script after executing the job. The next function of pbs_mom is to re‐
24 spond to resource monitor requests. This was done by a separate
25 process in previous versions of PBS but has now been combined into one
26 process. The resource monitor function is provided mainly for the PBS
27 scheduler. It provides information about the status of running jobs,
28 memory available etc. The next function of pbs_mom is to respond to
29 task manager requests. This involves communicating with running tasks
30 over a tcp socket as well as communicating with other MOMs within a job
31 (aka a "sisterhood").
32
33 Pbs_mom will record a diagnostic message in a log file for any error
34 occurrence. The log files are maintained in the mom_logs directory be‐
35 low the home directory of the server. If the log file cannot be
36 opened, the diagnostic message is written to the system console.
37
39 -A alias Used with -m (multi-mom option) to give the alias name
40 of this instance of pbs_mom
41
42 -a alarm Specifies the alarm timeout in seconds for computing a
43 resource. Every time a resource request is processed,
44 an alarm is set for the given amount of time. If the
45 request has not completed before the given time, an
46 alarm signal is generated. The default is 5 seconds.
47
48 -C chkdirectory Specifies the path of the directory used to hold check‐
49 point files. [Currently this is only valid on Cray
50 systems.] The default directory is
51 PBS_HOME/spool/checkpoint, see the -d option. The di‐
52 rectory specified with the -C option must be owned by
53 root and accessible (rwx) only by root to protect the
54 security of the checkpoint files.
55
56 -c config Specifies an alternative configuration file, see de‐
57 scription below. If this is a relative file name it
58 will be relative to PBS_HOME/mom_priv, see the -d op‐
59 tion. If the specified file cannot be opened, pbs_mom
60 will abort. If the -c option is not supplied, pbs_mom
61 will attempt to open the default
62 configuration file "config" in PBS_HOME/mom_priv. If
63 this file is not present, pbs_mom will log the fact and
64 continue.
65
66 -h help Displays the help/usage message.
67
68 -H hostname Sets the MOM's hostname. This can be useful on multi-
69 homed networks.
70
71 -D Debug mode. Do not fork.
72
73 -d directory Specifies the path of the directory which is the home
74 of the servers working files, PBS_HOME. This option is
75 typically used along with -M when debugging MOM. The
76 default directory is given by $PBS_SERVER_HOME which is
77 typically /usr/spool/PBS.
78
79 -F Do not fork. Use when running under systemd.
80
81 -L logfile Specifies an absolute path name for use as the log
82 file. If not specified, MOM will open a file named for
83 the current date in the PBS_HOME/mom_logs directory,
84 see the -d option.
85
86 -m Directs the MOM to start in multi-mom mode. In addition
87 to using -m the -M, -R and -A options need to be used
88 to properly start a MOM in multi-mom mode. For example
89 pbs_mom -m -M 30002 -R 30003 -A alias-host will start
90 pbs_mom with the service port on port 30002, the man‐
91 ager port at 30003 and with the name alias-host.
92
93 -M port Specifies the port number on which the mini-server
94 (MOM) will listen for batch requests.
95
96 -R port Specifies the port number on which the mini-server
97 (MOM) will listen for resource monitor requests, task
98 manager requests and inter-MOM messages.
99
100 -p (Default after version 2.4.0) (Preserve running jobs)
101 -- Specifies the impact on jobs which were in execution
102 when the mini-server shut-down. The -p option tries
103 to preserve any running jobs when the MOM restarts.
104 The new mini-server will not be the parent of any run‐
105 ning jobs, MOM has lost control of her offspring (not
106 a new situation for a mother). The MOM will allow the
107 jobs to continue to run and monitor them indirectly via
108 polling. All recovered jobs will report an exit code of
109 0 when they are complete. The -p option is mutually ex‐
110 clusive with the -r, -P and -q options.
111
112 -P (Terminate all jobs and remove them from the queue) --
113 Specifies the impact on jobs which were in execution
114 when the mini-server shut-down. With the -P option, it
115 is assumed that either the entire system has been
116 restarted or the MOM has been down so long that it can
117 no longer guarantee that the pid of any running process
118 is the same as the recorded job process pid of a recov‐
119 ering job. Unlike the -p option no attempt is made to
120 try and preserve or recover running jobs. All jobs are
121 terminated and removed from the queue. The -q option
122 is mutually exclusive with the -p, -q and -r options.
123
124 -q (Requeue all jobs - This is the default behavior in
125 versions prior to 2.4.0) -- Specifies the impact on
126 jobs which were in execution when the mini-servershut-
127 down. Do not terminate running processes. With the -q
128 option, it is assumed that either the entire system has
129 been restarted or the MOM has been down so long that it
130 can no longer guarantee that the pid of any running
131 process is the same as the recorded job process pid of
132 a recovering job. No attempt is made to kill job pro‐
133 cesses. The MOM will mark the jobs as terminated and
134 notify the batch server which owns the job. Re-runnable
135 jobs will be requeued. The -q option is mutually ex‐
136 clusive with the -p, -P and -r options.
137
138 -r (Terminate running processes and requeue all jobs) --
139 Specifies the impact on jobs which were in execution
140 when the mini-server shut-down. With the -r option, MOM
141 will kill any processes belonging to running jobs, mark
142 the jobs as terminated and notify the batch server that
143 owns the job. Re-runnable jobs are reset to a queued
144 state so they can be run again. The -r option is mutu‐
145 ally exclusive with the -p, -P and -q options.
146
147 If the -r option is used following a reboot, process
148 IDs (pids) may be reused and MOM may kill a process
149 that is not a batch session.
150
151 -S port Specifies the port number on which the pbs_server is
152 listening for requests. If pbs_server is started with
153 a -p option, pbs_mom will need to use the -S option and
154 match the port value which was used to start
155 pbs_server.
156
157 -w When started with -w, pbs_moms wait until they get
158 their MOM hierarchy file from pbs_server to send their
159 first update, or until 10 minutes pass. This reduces
160 network traffic on startup and can bring up clusters
161 faster.
162
163 -x Disables the check for privileged port resource monitor
164 connections. This is used mainly for testing since the
165 privileged port is the only mechanism used to prevent
166 any ordinary user from connecting.
167
169 The configuration file may be specified on the command line at program
170 start with the -c flag. The use of this file is to provide several
171 types of run time information to pbs_mom: static resource names and
172 values, external resources provided by a program to be run on request
173 via a shell escape, and values to pass to internal set up functions at
174 initialization (and re-initialization).
175
176 Each item type is on a single line with the component parts separated
177 by white space. If the line starts with a hash mark (pound sign, #),
178 the line is considered to be a comment and is skipped.
179
180 Static Resources
181 For static resource names and values, the configuration file
182 contains a list of resource names/values pairs, one pair per
183 line and separated by white space. An Example of static re‐
184 source names and values could be the number of tape drives of
185 different types and could be specified by
186
187 tape3480 4
188 tape3420 2
189 tapedat 1
190 tape8mm 1
191
192 Shell Commands
193 If the first character of the value is an exclamation mark (!),
194 the entire rest of the line is saved to be executed through the
195 services of the system(3) standard library routine.
196
197 The shell escape provides a means for the resource monitor to
198 yield arbitrary information to the scheduler. Parameter substi‐
199 tution is done such that the value of any qualifier sent with
200 the query, as explained below, replaces a token with a percent
201 sign (%) followed by the name of the qualifier. For example,
202 here is a configuration file line which gives a resource name of
203 "escape":
204
205 escape !echo %xxx %yyy
206
207 If a query for "escape" is sent with no qualifiers, the command
208 executed would be "echo %xxx %yyy". If one qualifier is sent,
209 "escape[xxx=hi there]", the command executed would be "echo hi
210 there %yyy". If two qualifiers are sent, "es‐
211 cape[xxx=hi][yyy=there]", the command executed would be "echo hi
212 there". If a qualifier is sent with no matching token in the
213 command line, "escape[zzz=snafu]", an error is reported.
214
215 size[fs=<FS>]
216 Specifies that the available and configured disk space in the
217 <FS> filesystem is to be reported to the pbs_server and sched‐
218 uler. NOTE: To request disk space on a per job basis, specify
219 the file resource as in 'qsub -l nodes=1,file=1000kb' For exam‐
220 ple, the available and configured disk space in the /lo‐
221 calscratch filesystem will be reported:
222
223 size[fs=/localscratch]
224
225 Initialization Value
226 An initialization value directive has a name which starts with a
227 dollar sign ($) and must be known to MOM via an internal table.
228 The entries in this table now are:
229
230 auto_ideal_load
231 if jobs are running, sets idea_load based on a simple ex‐
232 pression. The expressions start with the variable 't'
233 (total assigned CPUs) or 'c' (existing CPUs), an operator
234 (+ - / *), and followed by a float constant.
235
236 $auto_ideal_load t-0.2
237
238 auto_max_load
239 if jobs are running, sets max_load based on a simple ex‐
240 pression. The expressions start with the variable 't'
241 (total assigned CPUs) or 'c' (existing CPUs), an operator
242 (+ - / *), and followed by a float constant.
243
244 cputmult
245 which sets a factor used to adjust cpu time used by a
246 job. This is provided to allow adjustment of time
247 charged and limits enforced where the job might run on
248 systems with different cpu performance. If Mom's system
249 is faster than the reference system, set cputmult to a
250 decimal value greater than 1.0. If Mom's system is
251 slower, set cputmult to a value between 1.0 and 0.0. For
252 example:
253
254 $cputmult 1.5
255 $cputmult 0.75
256
257 configversion
258 specifies the version of the config file data, a string.
259
260 check_poll_time
261 specifies the MOM interval in seconds. MOM checks each
262 job for updated resource usages, exited processes, over-
263 limit conditions, etc. once per interval. This value
264 should be equal or lower to pbs_server's job_stat_rate.
265 High values result in stale information reported to
266 pbs_server. Low values result in increased system usage
267 by MOM. Default is 45 seconds.
268
269 down_on_error
270 causes MOM to report itself as state "down" to pbs_server
271 in the event of a failed health check. This feature is
272 EXPERIMENTAL and likely to be removed in the future. See
273 HEALTH CHECK below.
274
275 enablemomrestart
276 enable automatic restarts of MOM. If enabled, MOM will
277 check if its binary has been updated and restart itself
278 at a safe point when no jobs are running; thus making up‐
279 grades easier. The check is made by comparing the mtime
280 of the pbs_mom executable. Command-line args, the
281 process name, and the PATH env variable are preserved
282 across restarts. It is recommended that this not be en‐
283 abled in the config file, but enabled when desired with
284 momctl (see RESOURCES for more information.)
285
286 ideal_load
287 ideal processor load. Represents a low water mark for
288 the load average. Nodes that are currently busy will
289 consider itself free after falling below ideal_load.
290
291 igncput
292 Ignore cpu time violations on this mom, meaning jobs will
293 not be cancelled due to exceeding their limits for cpu
294 time.
295
296 ignmem Ignore memory violations on this mom, meaning jobs will
297 not be cancelled due to exceeding their memory limits.
298
299 ignvmem
300 If set to true, then pbs_mom will ignore vmem/pvmem limit
301 enforcement.
302
303 ignwalltime
304 If set to true, then pbs_mom will ignore walltime limit
305 enforcement.
306
307 job_output_file_mask
308 Specifies a mask for creating job output and error files.
309 Values can be specified in base 8, 10, or 16; leading 0
310 implies octal and leading 0x or 0X hexadecimal. A value
311 of "userdefault" will use the user's default umask.
312 $job_output_file_mask 027
313
314 log_directory
315 Changes the log directory. Default is $TORQUE‐
316 HOME/mom_logs/. $TORQUEHOME default is /var/spool/torque/
317 but can be changed in the ./configure script. The value
318 is a string and should be the full path to the desired
319 mom log directory. $log_directory /opt/torque/mom_logs/
320
321 logevent
322 which sets the mask that determines which event types are
323 logged by pbs_mom. For example:
324
325 $logevent 0x1fff
326 $logevent 255
327
328 The first example would set the log event mask to 0x1ff
329 (511) which enables logging of all events including debug
330 events. The second example would set the mask to 0x0ff
331 (255) which enables all events except debug events.
332
333 log_file_suffix
334 Optional suffix to append to log file names. If %h is the
335 suffix, pbs_mom appends the hostname for where the log
336 files are stored if it knows it, otherwise it will append
337 the hostname where the mom is running. $log_file_suffix
338 tom = 20100223.tom
339
340 log_keep_days
341 Specifies how many days to keep log files. pbs_mom
342 deletes log files older than the specified number of
343 days. If not specified, pbs_mom won't delete log files
344 based on their age.
345
346 loglevel
347 specifies the verbosity of logging with higher numbers
348 specifying more verbose logging. Values may range be‐
349 tween 0 and 7.
350
351 log_file_max_size
352 If this is set to a value > 0 then pbs_mom will roll
353 the current log file to log-file-name.1 when its size is
354 greater than or equal to the value of
355 log_file_max_size. This value is interpreted as kilo‐
356 bytes.
357
358 log_file_roll_depth
359 If this is set to a value >=1 and log_file_max_size is
360 set then pbs_mom will continue rolling the log files to
361 log-file-name.log_file_roll_depth.
362
363 max_load
364 maximum processor load. Nodes over this load average are
365 considered busy (see ideal_load above).
366
367 memory_pressure_threshold
368 The option is only available, if pbs_mom is enabled to
369 use cpusets. If set to a value > 0, a job gets killed if
370 its memory pressure exceeds this value, and if $mem‐
371 ory_pressure_duration is set. The default is 0 (memory
372 pressure recording is off).
373 See cpuset(7) for more information about memory pressure.
374
375 memory_pressure_duration
376 The option is only available, if pbs_mom is enabled to
377 use cpusets. Specifies the number of subsequent MOM in‐
378 tervals a job's memory pressure must be above $mem‐
379 ory_pressure_threshold to get killed. The default is 0
380 (jobs are never killed due to memory pressure). set
381 See cpuset(7) for more information about memory pressure.
382
383 node_check_script
384 specifies the fully qualified pathname of the health
385 check script to run (see HEALTH CHECK for more informa‐
386 tion).
387
388 node_check_interval
389 specifies when to run the MOM health check. The check
390 can be either periodic, event-driver, or both. The value
391 starts with an integer specifying the number of MOM in‐
392 tervals between subsequent executions of the specified
393 health check. After the integer is an optional comma-
394 separated list of event names. Currently supported are
395 "jobstart" and "jobend". This value defaults to 1 with
396 no events indicating the check is run every MOM interval.
397 (see HEALTH CHECK for more information)
398
399 $node_check_interval 0 #Disabled.
400 $node_check_interval 0,jobstart #Only runs at job starts
401 $node_check_interval 10,jobstart,jobend
402
403 nodefile_suffix
404 Specifies the suffix to append to a host names to denote
405 the data channel network adapter in a multihomed compute
406 node. $nodefile_suffix i With the suffix of 'i' and the
407 control channel adapter with the name node01, the data
408 channel would have a hostname of node01i.
409
410 nospool_dir_list
411 If the job's output file should be in one of the paths
412 specified here, then it will be spooled directly in that
413 directory instead of the normal spool directory.
414 Specified in the format path1, path2, etc.
415 $nospool_dir_list/home/mike/*,/var/tmp/spool/
416
417 pbsclient
418 which causes a host name to be added to the list of hosts
419 which will be allowed to connect to MOM as long as they
420 are using a privilaged port for the purposes of resource
421 monitor requests. For example, here are two configura‐
422 tion file lines which will allow the hosts "fred" and
423 "wilma" to connect:
424
425 $pbsclient fred
426 $pbsclient wilma
427
428 Two host name are always allowed to connection to
429 pbs_mom, "localhost" and the name returned to pbs_mom by
430 the system call gethostname(). These names need not be
431 specified in the configuration file. The hosts listed as
432 "clients" can issue Resource Monitor (RM) requests.
433 Other MOM nodes and servers do not need to be listed as
434 clients.
435
436 pbsserver
437 which defines hostnames running pbs_server that will be
438 allowed to submit jobs, issue Resource Monitor (RM) re‐
439 quests, and get status updates. MOM will continually at‐
440 tempt to contact all server hosts for node status and
441 state updates. Like $PBS_SERVER_HOME/server_name, the
442 hostname may be followed by a colon and a port number.
443 This parameter replaces the oft-confused $clienthost pa‐
444 rameter from TORQUE 2.0.0p0 and earlier. Note that the
445 hostname in $PBS_SERVER_HOME/server_name is used if no
446 $pbsserver parameters are found
447
448 prologalarm
449 Specifies maximum duration (in seconds) which the MOM
450 will wait for the job prolog or job job epilog to com‐
451 plete. This parameter default to 300 seconds (5 minutes)
452
453 rcpcmd Specify the the full path and argument to be used for re‐
454 mote file copies. This overrides the compile-time de‐
455 fault found in configure. This must contain 2 words: the
456 full path to the command and the switches. The copy com‐
457 mand must be able to recursively copy files to the remote
458 host and accept arguments of the form "user@host:files"
459 For example:
460
461 $rcpcmd /usr/bin/rcp -rp
462 $rcpcmd /usr/bin/scp -rpB
463
464 restricted
465 which causes a host name to be added to the list of hosts
466 which will be allowed to connect to MOM without needing
467 to use a privilaged port. These names allow for wildcard
468 matching. For example, here is a configuration file line
469 which will allow queries from any host from the domain
470 "ibm.com".
471
472 $restricted *.ibm.com
473
474 The restriction which applies to these connections is
475 that only internal queries may be made. No resources
476 from a config file will be found. This is to prevent any
477 shell commands from being run by a non-root process.
478 This parameter is generally not required except for some
479 versions of OSX.
480
481 remote_checkpoint_dirs
482 Specifies what server checkpoint directories are remotely
483 mounted. This directive is used to tell the MOM which
484 directories are shared with the server. Using remote
485 checkpoint directories eliminates the need to copy the
486 checkpoint files back and forth between the MOM and the
487 server. This parameter is available in 2.4.1 and later.
488
489 $remote_checkpoint_dirs /var/spool/torque/checkpoint
490
491 remote_reconfig
492 Enables the ability to remotely reconfigure pbs_mom with
493 a new config file. Default is disabled. This parameter
494 accepts various forms of true, yes, and 1.
495
496 source_login_batch
497 Specifies whether or not mom will source the /etc/pro‐
498 file, etc. type files for batch jobs. Parameter accepts
499 various forms of true, false, yes, no, 1 and 0. Default
500 is True.
501
502 source_login_interactive
503 Specifies whether or not mom will source the /etc/pro‐
504 file, etc. type files for interactive jobs. Parameter ac‐
505 cepts various forms of true, false, yes, no, 1 and 0. De‐
506 fault is True.
507
508 spool_as_final_name
509 If set to true, jobs will spool directly as their output
510 files, with no intermediate locations or steps. This is
511 mostly useful for shared filesystems with fast writing
512 capability.
513
514 status_update_time
515 Specifies (in seconds) how often MOM updates its status
516 information to pbs_server. This value should correlate
517 with the server's scheduling interval. High values in‐
518 crease the load of pbs_server and the network. Low val‐
519 ues cause pbs_server to report stale information. De‐
520 fault is 45 seconds.
521
522 tmpdir Sets the directory basename for a per-job temporary di‐
523 rectory. Before job launch, MOM will append the jobid to
524 the tmpdir basename and create the directory. After the
525 job exit, MOM will recursively delete it. The env vari‐
526 able TMPDIR will be set for all pro/epilog scripts, the
527 job script, and TM tasks.
528 Directory creation and removal is done as the job owner
529 and group, so the owner must have write permission to
530 create the directory. If the directory already exists
531 and is owned by the job owner, it will not be deleted af‐
532 ter the job. If the directory already exists and is NOT
533 owned by the job owner, the job start will be rejected.
534
535 timeout
536 Specifies the number of seconds before TCP messages will
537 time out. TCP messages include job obituaries, and TM
538 requests if RPP is disabled. Default is 60 seconds.
539
540 usecp specifies which directories should be staged with cp in‐
541 stead of rcp/scp. If a shared filesystem is available on
542 all hosts in a cluster, this directive is used to make
543 these filesystems known to MOM. For example, if /home is
544 NFS mounted on all nodes in a cluster:
545
546 $usecp *:/home /home
547
548 varattr
549 This is similar to a shell escape above, but includes a
550 TTL. The command will only be run every TTL seconds. A
551 TTL of -1 will cause the command to be executed only
552 once. A TTL of 0 will cause the command to be run every‐
553 time varattr is requested. This parameter may be used
554 multiple times, but all output will be grouped into a
555 single "varattr" attribute in the request and status out‐
556 put. The command should output data in the form of
557 varattrname=va1ue1[+value2]...
558
559 $varattr 3600 /path/to/script [<ARGS>]...
560
561 use_smt
562 This option is only available, if pbs_mom is enabled to
563 use cpusets. It has only effect, if there are more that
564 one logical processor per physical core in the system
565 (simultaneous multithreading or hyperthreading is enabled
566 via BIOS settings). If set to true, all logical proces‐
567 sors of allocated cores are added to the cpuset of a job.
568 If set to false, only the first logical processor per al‐
569 located core is contained in the cpuset of a job. The
570 default is true.
571
572 wallmult
573 which sets a factor used to adjust wall time usage by to
574 job to a common reference system. The factor is used for
575 walltime calculations and limits the same as cputmult is
576 used for cpu time.
577
578 The configuration file must be executable and "secure". It must be
579 owned by a user id and group id less than 10 and not be world writable.
580 Output from this file must be in the format $VAR=$VAL, i.e.,
581
582 dataset13=20070104
583 dataset22=20070202
584 viraltest=abdd3
585
586 xauthpath
587 Specifies the path to the xauth binary to enable X11 fowarding.
588
589 mom_host
590 Sets the local hostname as used by pbs_mom.
591
593 There is also an optional layout file for creating multiple moms on one
594 box in a specified layout. In the file, each mom on the single box is
595 given its own hostname, cpu indexes, memory nodes (a linux construct),
596 and memory size. This is useful for NUMA systems. Each line in the file
597 specifies one mom. The file follows the following format:
598
599 <hostname> cpus=<X> mem=<Y> memsize=<Z>
600 cpus and mem can be comma separated lists, while memsize should
601 be a memory size in the format:
602
603 <number><units>
604 For example, a file could contain the following line:
605
606 foohost-1 cpus=1,2 mem=1,2,3,4 memsize=8GB
607 This would specify that foohost-1 has cpus 1 and 2, memory nodes
608 1-4, and a total of 8 GB of memory.
609
611 Resource Monitor queries can be made with momctl's -q option to re‐
612 trieve and set pbs_mom options. Any configured static resource may be
613 retrieved with a request of the same name. These are resource requests
614 not otherwise documented in the PBS ERS.
615
616 cycle forces an immediate MOM cycle
617
618 status_update_time
619 retrieve or set the $status_update_time parameter
620
621 check_poll_time
622 retrieve or set the $check_poll_time parameter
623
624 configversion
625 retrieve the config version
626
627 jobstartblocktime
628 retrieve or set the $jobstartblocktime parameter
629
630 enablemomrestart
631 retrieve or set the $enablemomrestart parameter
632
633 loglevel
634 retrieve or set the $loglevel parameter
635
636 down_on_error
637 retrieve or set the EXPERIMENTAL $down_on_error parameter
638
639 diag0 - diag4
640 retrieves various diagnostic information
641
642 rcpcmd retrieve or set the $rcpcmd parameter
643
644 version
645 retrieves the pbs_mom version
646
648 The health check script is executed directly by the pbs_mom daemon un‐
649 der the root user id. It must be accessible from the compute node and
650 may be a script or compiled executable program. It may make any needed
651 system calls and execute any combination of system utilities but should
652 not execute resource manager client commands. Also, as of TORQUE
653 1.0.1, the pbs_mom daemon blocks until the health check is completed
654 and does not possess a built-in timeout. Consequently, it is advisable
655 to keep the launch script execution time short and verify that the
656 script will not block even under failure conditions.
657
658 If the script detects a failure, it should return the keyword 'ERROR'
659 to stdout followed by an error message. The message (up to 256 charac‐
660 ters) immediately following the ERROR string will be assigned to the
661 node attribute 'message' of the associated node.
662
663 If the script detects a failure when run from "jobstart", then the job
664 will be rejected. This should probably only be used with advanced
665 schedulers like Moab so that the job can be routed to another node.
666
667 TORQUE currently ignores ERROR messages by default, but advanced sched‐
668 ulers like moab can be configured to react appropriately.
669
670 If the experimental $down_on_error MOM setting is enabled, MOM will set
671 itself to state down and report to pbs_server; and pbs_server will re‐
672 port the node as "down". Additionally, the experimental "down_on_er‐
673 ror" server attribute can be enabled which has the same effect but
674 moves the decision to pbs_server. It is redundant to have MOM's
675 $down_on_error and pbs_server's down_on_error features enabled. See
676 "down_on_error" in pbs_server_attributes(7B).
677
679 $PBS_SERVER_HOME/server_name
680 contains the hostname running pbs_server.
681
682 $PBS_SERVER_HOME/mom_priv
683 the default directory for configuration files, typically
684 (/usr/spool/pbs)/mom_priv.
685
686 $PBS_SERVER_HOME/mom_logs
687 directory for log files recorded by the server.
688
689 $PBS_SERVER_HOME/mom_priv/prologue
690 the administrative script to be run before job execution.
691
692 $PBS_SERVER_HOME/mom_priv/epilogue
693 the administrative script to be run after job execution.
694
696 pbs_mom handles the following signals:
697
698 SIGHUP causes pbs_mom to re-read its configuration file, close and re‐
699 open the log file, and reinitialize resource structures.
700
701 SIGALRM
702 results in a log file entry. The signal is used to limit the
703 time taken by certain children processes, such as the prologue
704 and epilogue.
705
706 SIGINT and SIGTERM
707 results in pbs_mom exiting without terminating any running jobs.
708 This is the action for the following signals as well: SIGXCPU,
709 SIGXFSZ, SIGCPULIM, and SIGSHUTDN.
710
711 SIGUSR1, SIGUSR2
712 causes MOM to increase and decrease logging levels, respec‐
713 tively.
714
715 SIGPIPE, SIGINFO
716 are ignored.
717
718 SIGBUS, SIGFPE, SIGILL, SIGTRAP, and SIGSYS
719 cause a core dump if the PBSCOREDUMP environmental variable is
720 defined.
721
722 All other signals have their default behavior installed.
723
725 If the mini-server command fails to begin operation, the server exits
726 with a value greater than zero.
727
729 pbs_server(8B), pbs_scheduler_basl(8B), pbs_scheduler_tcl(8B), the PBS
730 External Reference Specification, and the PBS Administrator's Guide.
731
732
733
734Local pbs_mom(8B)