sge_sched_conf(5)

1SCHED_CONF(5)              Grid Engine File Formats              SCHED_CONF(5)
2
3
4

NAME

6       sched_conf - Grid Engine default scheduler configuration file
7

DESCRIPTION

9       sched_conf  defines  the  configuration  file  format for Grid Engine's
10       default scheduler provided by sge_schedd(8).  In order  to  modify  the
11       configuration,  use  the  graphical  user's  interface  qmon(1)  or the
12       -msconf option of the qconf(1) command. A default configuration is pro‐
13       vided together with the Grid Engine distribution package.
14
15       Note:  Grid  Engine  allows  backslashes  (\) be used to escape newline
16       (\newline) characters. The backslash and the newline are replaced  with
17       a space (" ") character before any interpretation.
18

FORMAT

20       The following parameters are recognized by the Grid Engine scheduler if
21       present in sched_conf:
22
23   algorithm
24       Allows for the selection of alternative scheduling algorithms.
25
26       Currently default is the only allowed setting.
27
28   load_formula
29       A simple algebraic expression used to derive  a  single  weighted  load
30       value  from all or part of the load parameters reported by sge_execd(8)
31       for each host and from all or part of  the  consumable  resources  (see
32       complex(5)) maintained for each host.  The load formula expression syn‐
33       tax is that of a summation weighted load values, that is:
34
35              {w1|load_val1[*w1]}[{+|-}{w2|load_val2[*w2]}[{+|-}...]]
36
37       Note, no blanks are allowed in the load formula.
38       The load values and consumable resources (load_val1, ...)   are  speci‐
39       fied by the name defined in the complex (see complex(5)).
40       Note:  Administrator defined load values (see the load_sensor parameter
41       in sge_conf(5) for details) and consumable resources available for  all
42       hosts  (see complex(5)) may be used as well as Grid Engine default load
43       parameters.
44       The weighting factors  (w1,  ...)  are  positive  integers.  After  the
45       expression  is  evaluated for each host the results are assigned to the
46       hosts and are used to sort the  hosts  corresponding  to  the  weighted
47       load. The sorted host list is used to sort queues subsequently.
48       The default load formula is "np_load_avg".
49
50   job_load_adjustments
51       The  load, which is imposed by the Grid Engine jobs running on a system
52       varies in time, and often, e.g. for the CPU load, requires some  amount
53       of  time  to  be  reported in the appropriate quantity by the operating
54       system. Consequently, if a job was started very recently, the  reported
55       load  may  not provide a sufficient representation of the load which is
56       already imposed on that host by the job. The reported load  will  adapt
57       to  the  real  load  over  time,  but  the period of time, in which the
58       reported load is too low, may already lead to  an  oversubscription  of
59       that   host.   Grid   Engine   allows   the  administrator  to  specify
60       job_load_adjustments which are used in the  Grid  Engine  scheduler  to
61       compensate for this problem.
62       The  job_load_adjustments  are  specified  as a comma separated list of
63       arbitrary load parameters or consumable resources and (separated by  an
64       equal sign) an associated load correction value. Whenever a job is dis‐
65       patched to a host by sge_schedd(8), the load parameter  and  consumable
66       value  set  of  that  host  is  increased by the values provided in the
67       job_load_adjustments list. These correction values are decayed linearly
68       over  time  until  after  load_adjustment_decay_time from the start the
69       corrections reach the value 0.  If  the  job_load_adjustments  list  is
70       assigned  the  special  denominator  NONE, no load corrections are per‐
71       formed.
72       The adjusted load and consumable values are used to  compute  the  com‐
73       bined  and weighted load of the hosts with the load_formula (see above)
74       and to compare the load and consumable values against the load  thresh‐
75       old  lists defined in the queue configurations (see queue_conf(5)).  If
76       the load_formula consists simply of the default CPU load average param‐
77       eter np_load_avg, and if the jobs are very compute intensive, one might
78       want to set the job_load_adjustments list  to  np_load_avg=1.00,  which
79       means  that  every  new job dispatched to a host will require 100 % CPU
80       time, and thus the machine's load is instantly increased by 1.00.
81
82   load_adjustment_decay_time
83       The load corrections  in  the  "job_load_adjustments"  list  above  are
84       decayed  linearly  over time from the point of the job start, where the
85       corresponding load or consumable parameter is raised by the  full  cor‐
86       rection   value,   until   after   a   time   period  of  "load_adjust‐
87       ment_decay_time", where the correction becomes  0.  Proper  values  for
88       "load_adjustment_decay_time" greatly depend upon the load or consumable
89       parameters used and the specific operating system(s).  Therefore,  they
90       can  only  be  determined  on-site and experimentally.  For the default
91       np_load_avg load parameter a "load_adjustment_decay_time" of 7  minutes
92       has proven to yield reasonable results.
93
94   maxujobs
95       The  maximum  number of jobs any user may have running in a Grid Engine
96       cluster at the same time. If set to 0 (default) the users  may  run  an
97       arbitrary number of jobs.
98
99   schedule_interval
100       At  the time sge_schedd(8) initially registers to sge_qmaster(8) sched‐
101       ule_interval is used to set the time interval in  which  sge_qmaster(8)
102       sends scheduling event updates to sge_schedd(8).  A scheduling event is
103       a status change that has occurred within sge_qmaster(8) which may trig‐
104       ger or affect scheduler decisions (e.g. a job has finished and thus the
105       allocated resources are available again).
106       In the Grid Engine default scheduler the arrival of a scheduling  event
107       report  triggers a scheduler run. The scheduler waits for event reports
108       otherwise.
109       Schedule_interval is a time value (see queue_conf(5) for  a  definition
110       of the syntax of time values).
111
112   queue_sort_method
113       This  parameter  determines  in  which order several criteria are taken
114       into account to product a sorted queue list.  Currently,  two  settings
115       are  valid: seqno and load. However in both cases, Grid Engine attempts
116       to maximize the number of soft requests (see qsub(1)  -s  option)  ful‐
117       filled by the queues for a particular as the primary criterion.
118       Then,  if  the queue_sort_method parameter is set to seqno, Grid Engine
119       will use the seq_no parameter as configured in the current  queue  con‐
120       figurations (see queue_conf(5)) as the next criterion to sort the queue
121       list. The load_formula (see above) has only a  meaning  if  two  queues
122       have  equal  sequence numbers.  If queue_sort_method is set to load the
123       load according the load_formula is the  criterion  after  maximizing  a
124       job's  soft  requests and the sequence number is only used if two hosts
125       have the same load.  The sequence number sorting is most useful if  you
126       want  to  define  a  fixed order in which queues are to be filled (e.g.
127       the cheapest resource first).
128
129       The default for this parameter is load.
130
131   halftime
132       When executing under a share based policy, the scheduler  "ages"  (i.e.
133       decreases)  usage to implement a sliding window for achieving the share
134       entitlements as defined by the share tree.  The  halftime  defines  the
135       time interval in which accumulated usage will have been decayed to half
136       its original value. Valid values are specified in hours or according to
137       the time format as specified in queue_conf(5).
138       If the value is set to 0, the usage is not decayed.
139
140   usage_weight_list
141       Grid  Engine  accounts  for  the consumption of the resources CPU-time,
142       memory and IO to determine the usage which is imposed on a system by  a
143       job. A single usage value is computed from these three input parameters
144       by multiplying the individual values by weights and adding them up. The
145       weights are defined in the usage_weight_list. The format of the list is
146
147              cpu=wcpu,mem=wmem,io=wio
148
149       where  wcpu, wmem and wio are the configurable weights. The weights are
150       real number. The sum of all tree weights should be 1.
151
152   compensation_factor
153       Determines how fast Grid Engine should compensate for past usage  below
154       of  above  the share entitlement defined in the share tree. Recommended
155       values are between 2 and 10, where 10 means faster compensation.
156
157   weight_user
158       The relative importance of the user shares in  the  functional  policy.
159       Values are of type real.
160
161   weight_project
162       The relative importance of the project shares in the functional policy.
163       Values are of type real.
164
165   weight_department
166       The relative importance of the department shares in the functional pol‐
167       icy. Values are of type real.
168
169   weight_job
170       The  relative  importance  of  the job shares in the functional policy.
171       Values are of type real.
172
173   weight_tickets_functional
174       The maximum number of functional tickets available for distribution  by
175       Grid  Engine. Determines the relative importance of the functional pol‐
176       icy.  See under sge_priority(5) for an overview on job priorities.
177
178   weight_tickets_share
179       The maximum number of share based tickets available for distribution by
180       Grid  Engine. Determines the relative importance of the share tree pol‐
181       icy. See under sge_priority(5) for an overview on job priorities.
182
183   weight_deadline
184       The weight applied on the remaining time  until  a  jobs  latest  start
185       time.  Determines  the  relative  importance of the deadline. See under
186       sge_priority(5) for an overview on job priorities.
187
188   weight_waiting_time
189       The weight applied on the jobs waiting time  since  submission.  Deter‐
190       mines  the relative importance of the waiting time.  See under sge_pri‐
191       ority(5) for an overview on job priorities.
192
193   weight_urgency
194       The weight applied on jobs normalized urgency when determining priority
195       finally  used.   Determines  the  relative  importance of urgency.  See
196       under sge_priority(5) for an overview on job priorities.
197
198   weight_ticket
199       The weight applied on normalized ticket amount when determining  prior‐
200       ity  finally  used.   Determines  the relative importance of the ticket
201       policies. See under sge_priority(5) for an overview on job priorities.
202
203   flush_finish_sec
204       The parameters are provided for tuning the system's  scheduling  behav‐
205       ior.   By default, a scheduler run is triggered in the scheduler inter‐
206       val. When this parameter is set to 1 or larger, the scheduler  will  be
207       triggered x seconds after a job has finished. Setting this parameter to
208       0 disables the flush after a job has finished.
209
210   flush_submit_sec
211       The parameters are provided for tuning the system's  scheduling  behav‐
212       ior.   By default, a scheduler run is triggered in the scheduler inter‐
213       val.  When this parameter is set to 1 or larger, the scheduler will  be
214       triggered   x  seconds after a job was submitted to the system. Setting
215       this parameter to 0 disables the flush after a job was submitted.
216
217   schedd_job_info
218       The default scheduler can keep track why jobs could  not  be  scheduled
219       during  the  last scheduler run. This parameter enables or disables the
220       observation.  The value true enables the monitoring false turns it off.
221
222       It is also possible to activate the observation only for certain  jobs.
223       This  will  be  done  if the parameter is set to job_list followed by a
224       comma separated list of job ids.
225
226       The user can obtain the collected information with  the  command  qstat
227       -j.
228
229   params
230       This  is  foreseen for passing additional parameters to the Grid Engine
231       scheduler. The following values are recognized:
232
233       DURATION_OFFSET
234              If set, overrides the default of value 60 seconds.  This parame‐
235              ter  is used by the Grid Engine scheduler when planning resource
236              utilization as the delta between net job runtimes and total time
237              until resources become available again. Net job runtime as spec‐
238              ified with -l  h_rt=...   or  -l  s_rt=...  or  default_duration
239              always  differs  from total job runtime due to delays before and
240              after actual job start and finish. Among the delays  before  job
241              start is the time until the end of a schedule_interval, the time
242              it takes to deliver a job to sge_execd(8) and the delays  caused
243              by  prolog  in  queue_conf(5) , start_proc_args in sge_pe(5) and
244              starter_method in  queue_conf(5)  (notify,  terminate_method  or
245              checkpointing),  procedures run after actual job finish, such as
246              stop_proc_args in sge_pe(5) or epilog in queue_conf(5) , and the
247              delay until a new schedule_interval.
248              If  the offset is too low, resource reservations (see max_reser‐
249              vation) can be delayed repeatedly due to  an  overly  optimistic
250              job circulation time.
251
252       JC_FILTER
253              If set to true, the scheduler limits the number of jobs it looks
254              at during a scheduling run. At the beginning of  the  scheduling
255              run  it  assigns each job a specific category, which is based on
256              the job's requests, priority settings, and the  job  owner.  All
257              scheduling  policies will assign the same importance to each job
258              in one category. Therefore the number of jobs per category  have
259              a  FIFO  order and can be limited to the number of free slots in
260              the system.
261
262              A exception are jobs, which request a resource reservation. They
263              are included regardless of the number of jobs in a category.
264
265              This  setting  is  turned  off per default, because in very rare
266              cases, the scheduler can make  a  wrong  decision.  It  is  also
267              advised  to  turn report_pjob_tickets off.  Otherwise qstat -ext
268              can report outdated ticket amounts. The information shown with a
269              qstat  -j  for  a job, that was excluded in a scheduling run, is
270              very limited.
271
272       PROFILE
273              If set equal to 1, the scheduler logs profiling information sum‐
274              marizing each scheduling run.
275
276       MONITOR
277              If  set  equal  to 1, the scheduler records information for each
278              scheduling run allowing to reproduce job  resources  utilization
279              in the file $SGE_ROOT/$SGE_CELL/common/schedule.
280
281       SELECT_PE_RANGE_ALG
282              This  parameter sets the algorithm for the pe range computation.
283              The default is automatic, which means that  the  scheduler  will
284              select the best one, and it should not be necessary to change it
285              to a different setting in normal operation. If a custom  setting
286              is needed, the following values are available:
287              auto       : the scheduler selects the best algorithm
288              least       :  starts the resource matching with the lowest slot
289              amount first
290              bin        : starts the resource matching in the middle  of  the
291              pe slot range
292              highest     : starts the resource matching with the highest slot
293              amount first
294
295       Changing params will take immediate effect.  The default for params  is
296       none.
297
298   reprioritize_interval
299       Interval  (HH:MM:SS)  to reprioritize jobs on the execution hosts based
300       on the current ticket amount for the running jobs. If the  interval  is
301       set  to  00:00:00 the reprioritization is turned off. The default value
302       is 00:00:00.
303
304   report_pjob_tickets
305       This parameter allows to tune the system's scheduling run time.  It  is
306       used  to  enable  / disable the reporting of pending job tickets to the
307       qmaster.  It does not influence the tickets calculation. The sort order
308       of  jobs  in  qstat and qmon is only based on the submit time, when the
309       reporting is turned off.
310       The reporting should be turned of in a system with very large amount of
311       jobs by setting this param to "false".
312
313   halflife_decay_list
314       The  halflife_decay_list  allows to configure different decay rates for
315       the "finished_jobs usage types, which is used in the pending job ticket
316       calculation  to account for jobs which have just ended. This allows the
317       user the pending jobs algorithm to count finished jobs against  a  user
318       or  project  for  a  configurable  decayed time period. This feature is
319       turned off by default, and the halftime is used instead.
320       The halflife_decay_list also allows one to  configure  different  decay
321       rates for each usage type tracked (cpu, io, and mem). The list is spec‐
322       ified in the following format:
323
324              <USAGE_TYPE>=<TIME>[:<USAGE_TYPE>=<TIME>[:<USAGE_TYPE>=<TIME>]]
325
326       <Usage_TYPE> can be one of the following: cpu, io, or mem.
327       <TIME> can be -1, 0 or a timespan specified in minutes.  If  <TIME>  is
328       -1,  only the usage of currently running jobs is used. 0 means that the
329       usage is not decayed.
330
331   policy_hierarchy
332       This parameter sets up a dependency chain  of  ticket  based  policies.
333       Each  ticket  based policy in the dependency chain is influenced by the
334       previous policies and influences the following policies. A typical sce‐
335       nario  is  to assign precedence for the override policy over the share-
336       based policy. The override policy determines in such a case how  share-
337       based  tickets  are  assigned  among  jobs of the same user or project.
338       Note that all policies contribute to the ticket amount  assigned  to  a
339       particular  job  regardless of the policy hierarchy definition. Yet the
340       tickets calculated in each of the policies can be  different  depending
341       on "POLICY_HIERARCHY".
342
343       The "POLICY_HIERARCHY" parameter can be a up to 3 letter combination of
344       the first letters of the 3 ticket based policies S(hare-based),  F(unc‐
345       tional) and O(verride). So a value "OFS" means that the override policy
346       takes precedence over the functional policy, which  finally  influences
347       the  share-based  policy.   Less  than  3 letters mean that some of the
348       policies do not influence other policies and also are not influenced by
349       other  policies.  So  a  value of "FS" means that the functional policy
350       influences the share-based policy and that  there  is  no  interference
351       with the other policies.
352
353       The special value "NONE" switches off policy hierarchies.
354
355   share_override_tickets
356       If  set  to  "true"  or  "1",  override  tickets of any override object
357       instance are shared equally among all running jobs associated with  the
358       object.  The  pending  jobs  will get as many override tickets, as they
359       would have, when they were running. If set to "false" or "0", each  job
360       gets the full value of the override tickets associated with the object.
361       The default value is "true".
362
363   share_functional_shares
364       If set to "true" or "1", functional shares  of  any  functional  object
365       instance  are  shared among all the jobs associated with the object. If
366       set to "false" or "0", each job associated with  a  functional  object,
367       gets  the  full  functional shares of that object. The default value is
368       "true".
369
370   max_functional_jobs_to_schedule
371       The maximum number of pending jobs to schedule in the  functional  pol‐
372       icy.  The default value is 200.
373
374   max_pending_tasks_per_job
375       The  maximum number of subtasks per pending array job to schedule. This
376       parameter exists in order to reduce scheduling  overhead.  The  default
377       value is 50.
378
379   max_reservation
380       The  maximum  number of reservations scheduled within a schedule inter‐
381       val.  When a runnable job can not be  started  due  to  a  shortage  of
382       resources  a  reservation  can  be scheduled instead. A reservation can
383       cover consumable resources with the global host, any execution host and
384       any  queue.  For  parallel  jobs  reservations  are done also for slots
385       resource as specified in sge_pe(5).  As job runtime the maximum of  the
386       time  specified  with  -l  h_rt=... or -l s_rt=... is assumed. For jobs
387       that have neither of them the default_duration  is  assumed.   Reserva‐
388       tions  prevent  jobs  of lower priority as specified in sge_priority(5)
389       from utilizing the reserved resource quota during the while of reserva‐
390       tion.   Jobs  of  lower  priority are allowed to utilize those reserved
391       resources only if their prospective job end is before the start of  the
392       reservation  (backfilling).  Reservation is done only for non-immediate
393       jobs (-now no) that request reservation (-R y). If  max_reservation  is
394       set to "0" no job reservation is done.
395
396       Note:  that  reservation  scheduling  can  be performance consuming and
397       hence reservation scheduling is switched off by default. Since reserva‐
398       tion  scheduling performance consumption is known to grow with the num‐
399       ber of pending jobs use of -R y option is recommended  only  for  those
400       jobs  actually  queuing  for  bottleneck  resources.  Together with the
401       max_reservation parameter this technique can be  used  to  narrow  down
402       performance impacts.
403
404   default_duration
405       When  job  reservation is enabled through max_reservation sched_conf(5)
406       parameter the default duration is assumed as runtime for jobs that have
407       neither  -l  h_rt=...  nor  -l  s_rt=...  specified.  In  contrast to a
408       h_rt/s_rt time limit the default_duration is not enforced.
409

FILES

411       $SGE_ROOT/$SGE_CELL/common/sched_configuration
412                 sge_schedd configuration
413

COPYRIGHT

420       See sge_intro(1) for a full statement of rights and permissions.
421
422
423
424GE 6.1                   $Date: 2007/07/19 08:17:18 $            SCHED_CONF(5)

NAME

DESCRIPTION

FORMAT

FILES

SEE ALSO

COPYRIGHT