sge_sched_conf(5)

1SCHED_CONF(5)              Grid Engine File Formats              SCHED_CONF(5)
2
3
4

NAME

6       sched_conf - Grid Engine default scheduler configuration file
7

DESCRIPTION

9       sched_conf  defines  the  configuration  file  format for Grid Engine's
10       scheduler.  In order to modify the  configuration,  use  the  graphical
11       user's interface qmon(1) or the -msconf option of the qconf(1) command.
12       A default configuration is provided together with the Grid Engine  dis‐
13       tribution package.
14
15       Note,  Grid  Engine  allows  backslashes  (\) be used to escape newline
16       (\newline) characters. The backslash and the newline are replaced  with
17       a space (" ") character before any interpretation.
18

FORMAT

20       The following parameters are recognized by the Grid Engine scheduler if
21       present in sched_conf:
22
23   algorithm
24       Note: Deprecated, may be removed in future release.
25       Allows for the selection of alternative scheduling algorithms.
26
27       Currently default is the only allowed setting.
28
29   load_formula
30       A simple algebraic expression used to derive  a  single  weighted  load
31       value  from  all or part of the load parameters reported by ge_execd(8)
32       for each host and from all or part of  the  consumable  resources  (see
33       complex(5))  being  maintained for each host.  The load formula expres‐
34       sion syntax is that of a summation weighted load values, that is:
35
36              {w1|load_val1[*w1]}[{+|-}{w2|load_val2[*w2]}[{+|-}...]]
37
38       Note, no blanks are allowed in the load formula.
39       The load values and consumable resources (load_val1, ...)   are  speci‐
40       fied by the name defined in the complex (see complex(5)).
41       Note:  Administrator defined load values (see the load_sensor parameter
42       in ge_conf(5) for details) and consumable resources available  for  all
43       hosts  (see complex(5)) may be used as well as Grid Engine default load
44       parameters.
45       The weighting factors  (w1,  ...)  are  positive  integers.  After  the
46       expression  is  evaluated for each host the results are assigned to the
47       hosts and are used to sort the  hosts  corresponding  to  the  weighted
48       load. The sorted host list is used to sort queues subsequently.
49       The default load formula is "np_load_avg".
50
51   job_load_adjustments
52       The  load, which is imposed by the Grid Engine jobs running on a system
53       varies in time, and often, e.g. for the CPU load, requires some  amount
54       of  time  to  be  reported in the appropriate quantity by the operating
55       system. Consequently, if a job was started very recently, the  reported
56       load  may  not provide a sufficient representation of the load which is
57       already imposed on that host by the job. The reported load  will  adapt
58       to  the  real  load  over  time,  but  the period of time, in which the
59       reported load is too low, may already lead to  an  oversubscription  of
60       that   host.   Grid   Engine   allows   the  administrator  to  specify
61       job_load_adjustments which are used in the  Grid  Engine  scheduler  to
62       compensate for this problem.
63       The  job_load_adjustments  are  specified  as a comma separated list of
64       arbitrary load parameters or consumable resources and (separated by  an
65       equal sign) an associated load correction value. Whenever a job is dis‐
66       patched to a host by the scheduler, the load parameter  and  consumable
67       value  set  of  that  host  is  increased by the values provided in the
68       job_load_adjustments list. These correction values are decayed linearly
69       over  time  until  after  load_adjustment_decay_time from the start the
70       corrections reach the value 0.  If  the  job_load_adjustments  list  is
71       assigned  the  special  denominator  NONE, no load corrections are per‐
72       formed.
73       The adjusted load and consumable values are used to  compute  the  com‐
74       bined  and weighted load of the hosts with the load_formula (see above)
75       and to compare the load and consumable values against the load  thresh‐
76       old  lists defined in the queue configurations (see queue_conf(5)).  If
77       the load_formula consists simply of the default CPU load average param‐
78       eter np_load_avg, and if the jobs are very compute intensive, one might
79       want to set the job_load_adjustments list  to  np_load_avg=1.00,  which
80       means  that  every  new job dispatched to a host will require 100 % CPU
81       time, and thus the machine's load is instantly increased by 1.00.
82
83   load_adjustment_decay_time
84       The load corrections  in  the  "job_load_adjustments"  list  above  are
85       decayed  linearly  over time from the point of the job start, where the
86       corresponding load or consumable parameter is raised by the  full  cor‐
87       rection   value,   until   after   a   time   period  of  "load_adjust‐
88       ment_decay_time", where the correction becomes  0.  Proper  values  for
89       "load_adjustment_decay_time" greatly depend upon the load or consumable
90       parameters used and the specific operating system(s).  Therefore,  they
91       can  only  be  determined  on-site and experimentally.  For the default
92       np_load_avg load parameter a "load_adjustment_decay_time" of 7  minutes
93       has proven to yield reasonable results.
94
95   maxujobs
96       The  maximum  number of jobs any user may have running in a Grid Engine
97       cluster at the same time. If set to 0 (default) the users  may  run  an
98       arbitrary number of jobs.
99
100   schedule_interval
101       At  the time the scheduler thread initially registers at the event mas‐
102       ter thread in ge_qmaster(8)process schedule_interval is used to set the
103       time  interval  in which the event master thread sends scheduling event
104       updates to the scheduler thread.  A scheduling event is a status change
105       that  has  occurred  within  ge_qmaster(8)  which may trigger or affect
106       scheduler decisions (e.g. a job has finished  and  thus  the  allocated
107       resources are available again).
108       In  the Grid Engine default scheduler the arrival of a scheduling event
109       report triggers a scheduler run. The scheduler waits for event  reports
110       otherwise.
111       Schedule_interval  is  a time value (see queue_conf(5) for a definition
112       of the syntax of time values).
113
114   queue_sort_method
115       This parameter determines in which order  several  criteria  are  taken
116       into  account  to  product a sorted queue list. Currently, two settings
117       are valid: seqno and load. However in both cases, Grid Engine  attempts
118       to  maximize  the number of soft requests (see qsub(1) -s option) being
119       fulfilled by the queues for a particular as the primary criterion.
120       Then, if the queue_sort_method parameter is set to seqno,  Grid  Engine
121       will  use  the seq_no parameter as configured in the current queue con‐
122       figurations (see queue_conf(5)) as the next criterion to sort the queue
123       list.  The  load_formula  (see  above) has only a meaning if two queues
124       have equal sequence numbers.  If queue_sort_method is set to  load  the
125       load  according  the  load_formula  is the criterion after maximizing a
126       job's soft requests and the sequence number is only used if  two  hosts
127       have  the same load.  The sequence number sorting is most useful if you
128       want to define a fixed order in which queues are  to  be  filled  (e.g.
129       the cheapest resource first).
130
131       The default for this parameter is load.
132
133   halftime
134       When  executing  under a share based policy, the scheduler "ages" (i.e.
135       decreases) usage to implement a sliding window for achieving the  share
136       entitlements  as  defined  by  the share tree. The halftime defines the
137       time interval in which accumulated usage will have been decayed to half
138       its original value. Valid values are specified in hours or according to
139       the time format as specified in queue_conf(5).
140       If the value is set to 0, the usage is not decayed.
141
142   usage_weight_list
143       Grid Engine accounts for the consumption  of  the  resources  CPU-time,
144       memory  and IO to determine the usage which is imposed on a system by a
145       job. A single usage value is computed from these three input parameters
146       by multiplying the individual values by weights and adding them up. The
147       weights are defined in the usage_weight_list. The format of the list is
148
149              cpu=wcpu,mem=wmem,io=wio
150
151       where wcpu, wmem and wio are the configurable weights. The weights  are
152       real number. The sum of all tree weights should be 1.
153
154   compensation_factor
155       Determines  how fast Grid Engine should compensate for past usage below
156       of above the share entitlement defined in the share  tree.  Recommended
157       values are between 2 and 10, where 10 means faster compensation.
158
159   weight_user
160       The  relative  importance  of the user shares in the functional policy.
161       Values are of type real.
162
163   weight_project
164       The relative importance of the project shares in the functional policy.
165       Values are of type real.
166
167   weight_department
168       The relative importance of the department shares in the functional pol‐
169       icy. Values are of type real.
170
171   weight_job
172       The relative importance of the job shares  in  the  functional  policy.
173       Values are of type real.
174
175   weight_tickets_functional
176       The  maximum number of functional tickets available for distribution by
177       Grid Engine. Determines the relative importance of the functional  pol‐
178       icy.  See under sge_priority(5) for an overview on job priorities.
179
180   weight_tickets_share
181       The maximum number of share based tickets available for distribution by
182       Grid Engine. Determines the relative importance of the share tree  pol‐
183       icy. See under sge_priority(5) for an overview on job priorities.
184
185   weight_deadline
186       The  weight  applied  on  the  remaining time until a jobs latest start
187       time. Determines the relative importance of  the  deadline.  See  under
188       sge_priority(5) for an overview on job priorities.
189
190   weight_waiting_time
191       The  weight  applied  on the jobs waiting time since submission. Deter‐
192       mines the relative importance of the waiting time.  See under  sge_pri‐
193       ority(5) for an overview on job priorities.
194
195   weight_urgency
196       The weight applied on jobs normalized urgency when determining priority
197       finally used.  Determines the  relative  importance  of  urgency.   See
198       under sge_priority(5) for an overview on job priorities.
199
200   weight_priority
201       The  weight  applied on jobs normalized POSIX priority when determining
202       priority finally used. Determines the relative importance of POSIX pri‐
203       ority.  See under sge_priority(5) for an overview on job priorities.
204
205   weight_ticket
206       The  weight applied on normalized ticket amount when determining prior‐
207       ity finally used.  Determines the relative  importance  of  the  ticket
208       policies. See under sge_priority(5) for an overview on job priorities.
209
210   flush_finish_sec
211       The  parameters  are provided for tuning the system's scheduling behav‐
212       ior.  By default, a scheduler run is triggered in the scheduler  inter‐
213       val.  When  this parameter is set to 1 or larger, the scheduler will be
214       triggered x seconds after a job has finished. Setting this parameter to
215       0 disables the flush after a job has finished.
216
217   flush_submit_sec
218       The  parameters  are provided for tuning the system's scheduling behav‐
219       ior.  By default, a scheduler run is triggered in the scheduler  inter‐
220       val.   When this parameter is set to 1 or larger, the scheduler will be
221       triggered  x seconds after a job was submitted to the  system.  Setting
222       this parameter to 0 disables the flush after a job was submitted.
223
224   schedd_job_info
225       The  default  scheduler  can keep track why jobs could not be scheduled
226       during the last scheduler run. This parameter enables or  disables  the
227       observation.  The value true enables the monitoring false turns it off.
228
229       It  is also possible to activate the observation only for certain jobs.
230       This will be done if the parameter is set to  job_list  followed  by  a
231       comma separated list of job ids.
232
233       The  user  can  obtain the collected information with the command qstat
234       -j.
235
236   params
237       This is foreseen for passing additional parameters to the  Grid  Engine
238       scheduler. The following values are recognized:
239
240       DURATION_OFFSET
241              If set, overrides the default of value 60 seconds.  This parame‐
242              ter is used by the Grid Engine scheduler when planning  resource
243              utilization as the delta between net job runtimes and total time
244              until resources become available again. Net job runtime as spec‐
245              ified  with  -l  h_rt=...   or  -l  s_rt=... or default_duration
246              always differs from total job runtime due to delays  before  and
247              after  actual  job start and finish. Among the delays before job
248              start is the time until the end of a schedule_interval, the time
249              it  takes to deliver a job to sge_execd(8) and the delays caused
250              by prolog in queue_conf(5) , start_proc_args  in  sge_pe(5)  and
251              starter_method  in  queue_conf(5)  (notify,  terminate_method or
252              checkpointing), procedures run after actual job finish, such  as
253              stop_proc_args in sge_pe(5) or epilog in queue_conf(5) , and the
254              delay until a new schedule_interval.
255              If the offset is too low, resource reservations (see  max_reser‐
256              vation)  can  be  delayed repeatedly due to an overly optimistic
257              job circulation time.
258
259       JC_FILTER
260              Note: Deprecated, may be removed in future release.
261              If set to true, the scheduler limits the number of jobs it looks
262              at  during  a scheduling run. At the beginning of the scheduling
263              run it assigns each job a specific category, which is  based  on
264              the  job's  requests,  priority settings, and the job owner. All
265              scheduling policies will assign the same importance to each  job
266              in  one category. Therefore the number of jobs per category have
267              a FIFO order and can be limited to the number of free  slots  in
268              the system.
269
270              A exception are jobs, which request a resource reservation. They
271              are included regardless of the number of jobs in a category.
272
273              This setting is turned off per default,  because  in  very  rare
274              cases,  the  scheduler  can  make  a  wrong decision. It is also
275              advised to turn report_pjob_tickets off.  Otherwise  qstat  -ext
276              can report outdated ticket amounts. The information shown with a
277              qstat -j for a job, that was excluded in a  scheduling  run,  is
278              very limited.
279
280       PROFILE
281              If set equal to 1, the scheduler logs profiling information sum‐
282              marizing each scheduling run.
283
284       MONITOR
285              If set equal to 1, the scheduler records  information  for  each
286              scheduling  run  allowing to reproduce job resources utilization
287              in the file <ge_root>/<cell>/common/schedule.
288
289       PE_RANGE_ALG
290              This parameter sets the algorithm for the pe range  computation.
291              The  default  is  automatic, which means that the scheduler will
292              select the best one, and it should not be necessary to change it
293              to  a different setting in normal operation. If a custom setting
294              is needed, the following values are available:
295              auto       : the scheduler selects the best algorithm
296              least      : starts the resource matching with the  lowest  slot
297              amount first
298              bin         :  starts the resource matching in the middle of the
299              pe slot range
300              highest    : starts the resource matching with the highest  slot
301              amount first
302
303       Changing  params will take immediate effect.  The default for params is
304       none.
305
306   reprioritize_interval
307       Interval (HH:MM:SS) to reprioritize jobs on the execution  hosts  based
308       on  the  current ticket amount for the running jobs. If the interval is
309       set to 00:00:00 the reprioritization is turned off. The  default  value
310       is 00:00:00.  The reprioritization tickets are calculated by the sched‐
311       uler and update events for running jobs are only sent after the  sched‐
312       uler calculated new values. How often the schedule should calculate the
313       tickets is defined by the reprioritize_interval.  Because the scheduler
314       is  only  triggered  in  a  specific interval (scheduler_interval) this
315       means the reprioritize_interval has only a meaning if set greater  than
316       the  scheduler_interval.   For  example, if the scheduler_interval is 2
317       minutes and reprioritize_interval is set to 10 seconds, this means  the
318       jobs get re-prioritized every 2 minutes.
319
320   report_pjob_tickets
321       This  parameter  allows to tune the system's scheduling run time. It is
322       used to enable / disable the reporting of pending job  tickets  to  the
323       qmaster.  It does not influence the tickets calculation. The sort order
324       of jobs in qstat and qmon is only based on the submit  time,  when  the
325       reporting is turned off.
326       The reporting should be turned off in a system with a very large amount
327       of jobs by setting this parameter to "false".
328
329   halflife_decay_list
330       The halflife_decay_list allows to configure different decay  rates  for
331       the "finished_jobs usage types, which is used in the pending job ticket
332       calculation to account for jobs which have just ended. This allows  the
333       user  the  pending jobs algorithm to count finished jobs against a user
334       or project for a configurable decayed  time  period.  This  feature  is
335       turned off by default, and the halftime is used instead.
336       The  halflife_decay_list  also  allows one to configure different decay
337       rates for each usage type being tracked (cpu, io, and mem). The list is
338       specified in the following format:
339
340              <USAGE_TYPE>=<TIME>[:<USAGE_TYPE>=<TIME>[:<USAGE_TYPE>=<TIME>]]
341
342       <Usage_TYPE> can be one of the following: cpu, io, or mem.
343       <TIME>  can  be  -1, 0 or a timespan specified in minutes. If <TIME> is
344       -1, only the usage of currently running jobs is used. 0 means that  the
345       usage is not decayed.
346
347   policy_hierarchy
348       This  parameter  sets  up  a dependency chain of ticket based policies.
349       Each ticket based policy in the dependency chain is influenced  by  the
350       previous policies and influences the following policies. A typical sce‐
351       nario is to assign precedence for the override policy over  the  share-
352       based  policy. The override policy determines in such a case how share-
353       based tickets are assigned among jobs of  the  same  user  or  project.
354       Note  that  all  policies contribute to the ticket amount assigned to a
355       particular job regardless of the policy hierarchy definition.  Yet  the
356       tickets  calculated  in each of the policies can be different depending
357       on "POLICY_HIERARCHY".
358
359       The "POLICY_HIERARCHY" parameter can be a up to 3 letter combination of
360       the  first letters of the 3 ticket based policies S(hare-based), F(unc‐
361       tional) and O(verride). So a value "OFS" means that the override policy
362       takes  precedence  over the functional policy, which finally influences
363       the share-based policy.  Less than 3 letters  mean  that  some  of  the
364       policies do not influence other policies and also are not influenced by
365       other policies. So a value of "FS" means  that  the  functional  policy
366       influences  the  share-based  policy  and that there is no interference
367       with the other policies.
368
369       The special value "NONE" switches off policy hierarchies.
370
371   share_override_tickets
372       If set to "true" or  "1",  override  tickets  of  any  override  object
373       instance  are shared equally among all running jobs associated with the
374       object. The pending jobs will get as many  override  tickets,  as  they
375       would  have, when they were running. If set to "false" or "0", each job
376       gets the full value of the override tickets associated with the object.
377       The default value is "true".
378
379   share_functional_shares
380       If  set  to  "true"  or "1", functional shares of any functional object
381       instance are shared among all the jobs associated with the  object.  If
382       set  to  "false"  or "0", each job associated with a functional object,
383       gets the full functional shares of that object. The  default  value  is
384       "true".
385
386   max_functional_jobs_to_schedule
387       The  maximum  number of pending jobs to schedule in the functional pol‐
388       icy.  The default value is 200.
389
390   max_pending_tasks_per_job
391       The maximum number of subtasks per pending array job to schedule.  This
392       parameter  exists  in  order to reduce scheduling overhead. The default
393       value is 50.
394
395   max_reservation
396       The maximum number of reservations scheduled within a  schedule  inter‐
397       val.   When  a  runnable  job  can  not be started due to a shortage of
398       resources a reservation can be scheduled  instead.  A  reservation  can
399       cover consumable resources with the global host, any execution host and
400       any queue. For parallel jobs  reservations  are  done  also  for  slots
401       resource  as specified in sge_pe(5).  As job runtime the maximum of the
402       time specified with -l h_rt=... or -l s_rt=...  is  assumed.  For  jobs
403       that  have  neither  of them the default_duration is assumed.  Reserva‐
404       tions prevent jobs of lower priority as  specified  in  sge_priority(5)
405       from  utilizing the reserved resource quota during the time of reserva‐
406       tion.  Jobs of lower priority are allowed  to  utilize  those  reserved
407       resources  only if their prospective job end is before the start of the
408       reservation (backfilling).  Reservation is done only for  non-immediate
409       jobs  (-now  no) that request reservation (-R y). If max_reservation is
410       set to "0" no job reservation is done.
411
412       Note, that reservation scheduling  can  be  performance  consuming  and
413       hence reservation scheduling is switched off by default. Since reserva‐
414       tion scheduling performance consumption is known to grow with the  num‐
415       ber  of  pending  jobs,  the use of -R y option is recommended only for
416       those jobs actually queuing for bottleneck  resources.   Together  with
417       the max_reservation parameter this technique can be used to narrow down
418       performance impacts.
419
420   default_duration
421       When job reservation is enabled through  max_reservation  sched_conf(5)
422       parameter the default duration is assumed as runtime for jobs that have
423       neither -l h_rt=...  nor  -l  s_rt=...  specified.  In  contrast  to  a
424       h_rt/s_rt time limit the default_duration is not enforced.
425

FILES

427       <ge_root>/<cell>/common/sched_configuration
428                 scheduler thread configuration
429

COPYRIGHT

436       See ge_intro(1) for a full statement of rights and permissions.
437
438
439
440GE 6.2u5                 $Date: 2009/07/08 14:42:40 $            SCHED_CONF(5)

NAME

DESCRIPTION

FORMAT

FILES

SEE ALSO

COPYRIGHT