1SCHED_CONF(5) Grid Engine File Formats SCHED_CONF(5)
2
3
4
6 sched_conf - Grid Engine default scheduler configuration file
7
9 sched_conf defines the configuration file format for Grid Engine's
10 scheduler. In order to modify the configuration, use the graphical
11 user's interface qmon(1) or the -msconf option of the qconf(1) command.
12 A default configuration is provided together with the Grid Engine dis‐
13 tribution package.
14
15 Note, Grid Engine allows backslashes (\) be used to escape newline
16 (\newline) characters. The backslash and the newline are replaced with
17 a space (" ") character before any interpretation.
18
20 The following parameters are recognized by the Grid Engine scheduler if
21 present in sched_conf:
22
23 algorithm
24 Note: Deprecated, may be removed in future release.
25 Allows for the selection of alternative scheduling algorithms.
26
27 Currently default is the only allowed setting.
28
29 load_formula
30 A simple algebraic expression used to derive a single weighted load
31 value from all or part of the load parameters reported by ge_execd(8)
32 for each host and from all or part of the consumable resources (see
33 complex(5)) being maintained for each host. The load formula expres‐
34 sion syntax is that of a summation weighted load values, that is:
35
36 {w1|load_val1[*w1]}[{+|-}{w2|load_val2[*w2]}[{+|-}...]]
37
38 Note, no blanks are allowed in the load formula.
39 The load values and consumable resources (load_val1, ...) are speci‐
40 fied by the name defined in the complex (see complex(5)).
41 Note: Administrator defined load values (see the load_sensor parameter
42 in ge_conf(5) for details) and consumable resources available for all
43 hosts (see complex(5)) may be used as well as Grid Engine default load
44 parameters.
45 The weighting factors (w1, ...) are positive integers. After the
46 expression is evaluated for each host the results are assigned to the
47 hosts and are used to sort the hosts corresponding to the weighted
48 load. The sorted host list is used to sort queues subsequently.
49 The default load formula is "np_load_avg".
50
51 job_load_adjustments
52 The load, which is imposed by the Grid Engine jobs running on a system
53 varies in time, and often, e.g. for the CPU load, requires some amount
54 of time to be reported in the appropriate quantity by the operating
55 system. Consequently, if a job was started very recently, the reported
56 load may not provide a sufficient representation of the load which is
57 already imposed on that host by the job. The reported load will adapt
58 to the real load over time, but the period of time, in which the
59 reported load is too low, may already lead to an oversubscription of
60 that host. Grid Engine allows the administrator to specify
61 job_load_adjustments which are used in the Grid Engine scheduler to
62 compensate for this problem.
63 The job_load_adjustments are specified as a comma separated list of
64 arbitrary load parameters or consumable resources and (separated by an
65 equal sign) an associated load correction value. Whenever a job is dis‐
66 patched to a host by the scheduler, the load parameter and consumable
67 value set of that host is increased by the values provided in the
68 job_load_adjustments list. These correction values are decayed linearly
69 over time until after load_adjustment_decay_time from the start the
70 corrections reach the value 0. If the job_load_adjustments list is
71 assigned the special denominator NONE, no load corrections are per‐
72 formed.
73 The adjusted load and consumable values are used to compute the com‐
74 bined and weighted load of the hosts with the load_formula (see above)
75 and to compare the load and consumable values against the load thresh‐
76 old lists defined in the queue configurations (see queue_conf(5)). If
77 the load_formula consists simply of the default CPU load average param‐
78 eter np_load_avg, and if the jobs are very compute intensive, one might
79 want to set the job_load_adjustments list to np_load_avg=1.00, which
80 means that every new job dispatched to a host will require 100 % CPU
81 time, and thus the machine's load is instantly increased by 1.00.
82
83 load_adjustment_decay_time
84 The load corrections in the "job_load_adjustments" list above are
85 decayed linearly over time from the point of the job start, where the
86 corresponding load or consumable parameter is raised by the full cor‐
87 rection value, until after a time period of "load_adjust‐
88 ment_decay_time", where the correction becomes 0. Proper values for
89 "load_adjustment_decay_time" greatly depend upon the load or consumable
90 parameters used and the specific operating system(s). Therefore, they
91 can only be determined on-site and experimentally. For the default
92 np_load_avg load parameter a "load_adjustment_decay_time" of 7 minutes
93 has proven to yield reasonable results.
94
95 maxujobs
96 The maximum number of jobs any user may have running in a Grid Engine
97 cluster at the same time. If set to 0 (default) the users may run an
98 arbitrary number of jobs.
99
100 schedule_interval
101 At the time the scheduler thread initially registers at the event mas‐
102 ter thread in ge_qmaster(8)process schedule_interval is used to set the
103 time interval in which the event master thread sends scheduling event
104 updates to the scheduler thread. A scheduling event is a status change
105 that has occurred within ge_qmaster(8) which may trigger or affect
106 scheduler decisions (e.g. a job has finished and thus the allocated
107 resources are available again).
108 In the Grid Engine default scheduler the arrival of a scheduling event
109 report triggers a scheduler run. The scheduler waits for event reports
110 otherwise.
111 Schedule_interval is a time value (see queue_conf(5) for a definition
112 of the syntax of time values).
113
114 queue_sort_method
115 This parameter determines in which order several criteria are taken
116 into account to product a sorted queue list. Currently, two settings
117 are valid: seqno and load. However in both cases, Grid Engine attempts
118 to maximize the number of soft requests (see qsub(1) -s option) being
119 fulfilled by the queues for a particular as the primary criterion.
120 Then, if the queue_sort_method parameter is set to seqno, Grid Engine
121 will use the seq_no parameter as configured in the current queue con‐
122 figurations (see queue_conf(5)) as the next criterion to sort the queue
123 list. The load_formula (see above) has only a meaning if two queues
124 have equal sequence numbers. If queue_sort_method is set to load the
125 load according the load_formula is the criterion after maximizing a
126 job's soft requests and the sequence number is only used if two hosts
127 have the same load. The sequence number sorting is most useful if you
128 want to define a fixed order in which queues are to be filled (e.g.
129 the cheapest resource first).
130
131 The default for this parameter is load.
132
133 halftime
134 When executing under a share based policy, the scheduler "ages" (i.e.
135 decreases) usage to implement a sliding window for achieving the share
136 entitlements as defined by the share tree. The halftime defines the
137 time interval in which accumulated usage will have been decayed to half
138 its original value. Valid values are specified in hours or according to
139 the time format as specified in queue_conf(5).
140 If the value is set to 0, the usage is not decayed.
141
142 usage_weight_list
143 Grid Engine accounts for the consumption of the resources CPU-time,
144 memory and IO to determine the usage which is imposed on a system by a
145 job. A single usage value is computed from these three input parameters
146 by multiplying the individual values by weights and adding them up. The
147 weights are defined in the usage_weight_list. The format of the list is
148
149 cpu=wcpu,mem=wmem,io=wio
150
151 where wcpu, wmem and wio are the configurable weights. The weights are
152 real number. The sum of all tree weights should be 1.
153
154 compensation_factor
155 Determines how fast Grid Engine should compensate for past usage below
156 of above the share entitlement defined in the share tree. Recommended
157 values are between 2 and 10, where 10 means faster compensation.
158
159 weight_user
160 The relative importance of the user shares in the functional policy.
161 Values are of type real.
162
163 weight_project
164 The relative importance of the project shares in the functional policy.
165 Values are of type real.
166
167 weight_department
168 The relative importance of the department shares in the functional pol‐
169 icy. Values are of type real.
170
171 weight_job
172 The relative importance of the job shares in the functional policy.
173 Values are of type real.
174
175 weight_tickets_functional
176 The maximum number of functional tickets available for distribution by
177 Grid Engine. Determines the relative importance of the functional pol‐
178 icy. See under sge_priority(5) for an overview on job priorities.
179
180 weight_tickets_share
181 The maximum number of share based tickets available for distribution by
182 Grid Engine. Determines the relative importance of the share tree pol‐
183 icy. See under sge_priority(5) for an overview on job priorities.
184
185 weight_deadline
186 The weight applied on the remaining time until a jobs latest start
187 time. Determines the relative importance of the deadline. See under
188 sge_priority(5) for an overview on job priorities.
189
190 weight_waiting_time
191 The weight applied on the jobs waiting time since submission. Deter‐
192 mines the relative importance of the waiting time. See under sge_pri‐
193 ority(5) for an overview on job priorities.
194
195 weight_urgency
196 The weight applied on jobs normalized urgency when determining priority
197 finally used. Determines the relative importance of urgency. See
198 under sge_priority(5) for an overview on job priorities.
199
200 weight_priority
201 The weight applied on jobs normalized POSIX priority when determining
202 priority finally used. Determines the relative importance of POSIX pri‐
203 ority. See under sge_priority(5) for an overview on job priorities.
204
205 weight_ticket
206 The weight applied on normalized ticket amount when determining prior‐
207 ity finally used. Determines the relative importance of the ticket
208 policies. See under sge_priority(5) for an overview on job priorities.
209
210 flush_finish_sec
211 The parameters are provided for tuning the system's scheduling behav‐
212 ior. By default, a scheduler run is triggered in the scheduler inter‐
213 val. When this parameter is set to 1 or larger, the scheduler will be
214 triggered x seconds after a job has finished. Setting this parameter to
215 0 disables the flush after a job has finished.
216
217 flush_submit_sec
218 The parameters are provided for tuning the system's scheduling behav‐
219 ior. By default, a scheduler run is triggered in the scheduler inter‐
220 val. When this parameter is set to 1 or larger, the scheduler will be
221 triggered x seconds after a job was submitted to the system. Setting
222 this parameter to 0 disables the flush after a job was submitted.
223
224 schedd_job_info
225 The default scheduler can keep track why jobs could not be scheduled
226 during the last scheduler run. This parameter enables or disables the
227 observation. The value true enables the monitoring false turns it off.
228
229 It is also possible to activate the observation only for certain jobs.
230 This will be done if the parameter is set to job_list followed by a
231 comma separated list of job ids.
232
233 The user can obtain the collected information with the command qstat
234 -j.
235
236 params
237 This is foreseen for passing additional parameters to the Grid Engine
238 scheduler. The following values are recognized:
239
240 DURATION_OFFSET
241 If set, overrides the default of value 60 seconds. This parame‐
242 ter is used by the Grid Engine scheduler when planning resource
243 utilization as the delta between net job runtimes and total time
244 until resources become available again. Net job runtime as spec‐
245 ified with -l h_rt=... or -l s_rt=... or default_duration
246 always differs from total job runtime due to delays before and
247 after actual job start and finish. Among the delays before job
248 start is the time until the end of a schedule_interval, the time
249 it takes to deliver a job to sge_execd(8) and the delays caused
250 by prolog in queue_conf(5) , start_proc_args in sge_pe(5) and
251 starter_method in queue_conf(5) (notify, terminate_method or
252 checkpointing), procedures run after actual job finish, such as
253 stop_proc_args in sge_pe(5) or epilog in queue_conf(5) , and the
254 delay until a new schedule_interval.
255 If the offset is too low, resource reservations (see max_reser‐
256 vation) can be delayed repeatedly due to an overly optimistic
257 job circulation time.
258
259 JC_FILTER
260 Note: Deprecated, may be removed in future release.
261 If set to true, the scheduler limits the number of jobs it looks
262 at during a scheduling run. At the beginning of the scheduling
263 run it assigns each job a specific category, which is based on
264 the job's requests, priority settings, and the job owner. All
265 scheduling policies will assign the same importance to each job
266 in one category. Therefore the number of jobs per category have
267 a FIFO order and can be limited to the number of free slots in
268 the system.
269
270 A exception are jobs, which request a resource reservation. They
271 are included regardless of the number of jobs in a category.
272
273 This setting is turned off per default, because in very rare
274 cases, the scheduler can make a wrong decision. It is also
275 advised to turn report_pjob_tickets off. Otherwise qstat -ext
276 can report outdated ticket amounts. The information shown with a
277 qstat -j for a job, that was excluded in a scheduling run, is
278 very limited.
279
280 PROFILE
281 If set equal to 1, the scheduler logs profiling information sum‐
282 marizing each scheduling run.
283
284 MONITOR
285 If set equal to 1, the scheduler records information for each
286 scheduling run allowing to reproduce job resources utilization
287 in the file <ge_root>/<cell>/common/schedule.
288
289 PE_RANGE_ALG
290 This parameter sets the algorithm for the pe range computation.
291 The default is automatic, which means that the scheduler will
292 select the best one, and it should not be necessary to change it
293 to a different setting in normal operation. If a custom setting
294 is needed, the following values are available:
295 auto : the scheduler selects the best algorithm
296 least : starts the resource matching with the lowest slot
297 amount first
298 bin : starts the resource matching in the middle of the
299 pe slot range
300 highest : starts the resource matching with the highest slot
301 amount first
302
303 Changing params will take immediate effect. The default for params is
304 none.
305
306 reprioritize_interval
307 Interval (HH:MM:SS) to reprioritize jobs on the execution hosts based
308 on the current ticket amount for the running jobs. If the interval is
309 set to 00:00:00 the reprioritization is turned off. The default value
310 is 00:00:00. The reprioritization tickets are calculated by the sched‐
311 uler and update events for running jobs are only sent after the sched‐
312 uler calculated new values. How often the schedule should calculate the
313 tickets is defined by the reprioritize_interval. Because the scheduler
314 is only triggered in a specific interval (scheduler_interval) this
315 means the reprioritize_interval has only a meaning if set greater than
316 the scheduler_interval. For example, if the scheduler_interval is 2
317 minutes and reprioritize_interval is set to 10 seconds, this means the
318 jobs get re-prioritized every 2 minutes.
319
320 report_pjob_tickets
321 This parameter allows to tune the system's scheduling run time. It is
322 used to enable / disable the reporting of pending job tickets to the
323 qmaster. It does not influence the tickets calculation. The sort order
324 of jobs in qstat and qmon is only based on the submit time, when the
325 reporting is turned off.
326 The reporting should be turned off in a system with a very large amount
327 of jobs by setting this parameter to "false".
328
329 halflife_decay_list
330 The halflife_decay_list allows to configure different decay rates for
331 the "finished_jobs usage types, which is used in the pending job ticket
332 calculation to account for jobs which have just ended. This allows the
333 user the pending jobs algorithm to count finished jobs against a user
334 or project for a configurable decayed time period. This feature is
335 turned off by default, and the halftime is used instead.
336 The halflife_decay_list also allows one to configure different decay
337 rates for each usage type being tracked (cpu, io, and mem). The list is
338 specified in the following format:
339
340 <USAGE_TYPE>=<TIME>[:<USAGE_TYPE>=<TIME>[:<USAGE_TYPE>=<TIME>]]
341
342 <Usage_TYPE> can be one of the following: cpu, io, or mem.
343 <TIME> can be -1, 0 or a timespan specified in minutes. If <TIME> is
344 -1, only the usage of currently running jobs is used. 0 means that the
345 usage is not decayed.
346
347 policy_hierarchy
348 This parameter sets up a dependency chain of ticket based policies.
349 Each ticket based policy in the dependency chain is influenced by the
350 previous policies and influences the following policies. A typical sce‐
351 nario is to assign precedence for the override policy over the share-
352 based policy. The override policy determines in such a case how share-
353 based tickets are assigned among jobs of the same user or project.
354 Note that all policies contribute to the ticket amount assigned to a
355 particular job regardless of the policy hierarchy definition. Yet the
356 tickets calculated in each of the policies can be different depending
357 on "POLICY_HIERARCHY".
358
359 The "POLICY_HIERARCHY" parameter can be a up to 3 letter combination of
360 the first letters of the 3 ticket based policies S(hare-based), F(unc‐
361 tional) and O(verride). So a value "OFS" means that the override policy
362 takes precedence over the functional policy, which finally influences
363 the share-based policy. Less than 3 letters mean that some of the
364 policies do not influence other policies and also are not influenced by
365 other policies. So a value of "FS" means that the functional policy
366 influences the share-based policy and that there is no interference
367 with the other policies.
368
369 The special value "NONE" switches off policy hierarchies.
370
371 share_override_tickets
372 If set to "true" or "1", override tickets of any override object
373 instance are shared equally among all running jobs associated with the
374 object. The pending jobs will get as many override tickets, as they
375 would have, when they were running. If set to "false" or "0", each job
376 gets the full value of the override tickets associated with the object.
377 The default value is "true".
378
379 share_functional_shares
380 If set to "true" or "1", functional shares of any functional object
381 instance are shared among all the jobs associated with the object. If
382 set to "false" or "0", each job associated with a functional object,
383 gets the full functional shares of that object. The default value is
384 "true".
385
386 max_functional_jobs_to_schedule
387 The maximum number of pending jobs to schedule in the functional pol‐
388 icy. The default value is 200.
389
390 max_pending_tasks_per_job
391 The maximum number of subtasks per pending array job to schedule. This
392 parameter exists in order to reduce scheduling overhead. The default
393 value is 50.
394
395 max_reservation
396 The maximum number of reservations scheduled within a schedule inter‐
397 val. When a runnable job can not be started due to a shortage of
398 resources a reservation can be scheduled instead. A reservation can
399 cover consumable resources with the global host, any execution host and
400 any queue. For parallel jobs reservations are done also for slots
401 resource as specified in sge_pe(5). As job runtime the maximum of the
402 time specified with -l h_rt=... or -l s_rt=... is assumed. For jobs
403 that have neither of them the default_duration is assumed. Reserva‐
404 tions prevent jobs of lower priority as specified in sge_priority(5)
405 from utilizing the reserved resource quota during the time of reserva‐
406 tion. Jobs of lower priority are allowed to utilize those reserved
407 resources only if their prospective job end is before the start of the
408 reservation (backfilling). Reservation is done only for non-immediate
409 jobs (-now no) that request reservation (-R y). If max_reservation is
410 set to "0" no job reservation is done.
411
412 Note, that reservation scheduling can be performance consuming and
413 hence reservation scheduling is switched off by default. Since reserva‐
414 tion scheduling performance consumption is known to grow with the num‐
415 ber of pending jobs, the use of -R y option is recommended only for
416 those jobs actually queuing for bottleneck resources. Together with
417 the max_reservation parameter this technique can be used to narrow down
418 performance impacts.
419
420 default_duration
421 When job reservation is enabled through max_reservation sched_conf(5)
422 parameter the default duration is assumed as runtime for jobs that have
423 neither -l h_rt=... nor -l s_rt=... specified. In contrast to a
424 h_rt/s_rt time limit the default_duration is not enforced.
425
427 <ge_root>/<cell>/common/sched_configuration
428 scheduler thread configuration
429
431 ge_intro(1), qalter(1), qconf(1), qstat(1), qsub(1), complex(5),
432 queue_conf(5), ge_execd(8), ge_qmaster(8), Grid Engine Installation and
433 Administration Guide
434
436 See ge_intro(1) for a full statement of rights and permissions.
437
438
439
440GE 6.2u5 $Date: 2009/07/08 14:42:40 $ SCHED_CONF(5)