1SCHED_CONF(5) Grid Engine File Formats SCHED_CONF(5)
2
3
4
6 sched_conf - Grid Engine default scheduler configuration file
7
9 sched_conf defines the configuration file format for Grid Engine's
10 default scheduler provided by sge_schedd(8). In order to modify the
11 configuration, use the graphical user's interface qmon(1) or the
12 -msconf option of the qconf(1) command. A default configuration is pro‐
13 vided together with the Grid Engine distribution package.
14
15 Note: Grid Engine allows backslashes (\) be used to escape newline
16 (\newline) characters. The backslash and the newline are replaced with
17 a space (" ") character before any interpretation.
18
20 The following parameters are recognized by the Grid Engine scheduler if
21 present in sched_conf:
22
23 algorithm
24 Allows for the selection of alternative scheduling algorithms.
25
26 Currently default is the only allowed setting.
27
28 load_formula
29 A simple algebraic expression used to derive a single weighted load
30 value from all or part of the load parameters reported by sge_execd(8)
31 for each host and from all or part of the consumable resources (see
32 complex(5)) maintained for each host. The load formula expression syn‐
33 tax is that of a summation weighted load values, that is:
34
35 {w1|load_val1[*w1]}[{+|-}{w2|load_val2[*w2]}[{+|-}...]]
36
37 Note, no blanks are allowed in the load formula.
38 The load values and consumable resources (load_val1, ...) are speci‐
39 fied by the name defined in the complex (see complex(5)).
40 Note: Administrator defined load values (see the load_sensor parameter
41 in sge_conf(5) for details) and consumable resources available for all
42 hosts (see complex(5)) may be used as well as Grid Engine default load
43 parameters.
44 The weighting factors (w1, ...) are positive integers. After the
45 expression is evaluated for each host the results are assigned to the
46 hosts and are used to sort the hosts corresponding to the weighted
47 load. The sorted host list is used to sort queues subsequently.
48 The default load formula is "np_load_avg".
49
50 job_load_adjustments
51 The load, which is imposed by the Grid Engine jobs running on a system
52 varies in time, and often, e.g. for the CPU load, requires some amount
53 of time to be reported in the appropriate quantity by the operating
54 system. Consequently, if a job was started very recently, the reported
55 load may not provide a sufficient representation of the load which is
56 already imposed on that host by the job. The reported load will adapt
57 to the real load over time, but the period of time, in which the
58 reported load is too low, may already lead to an oversubscription of
59 that host. Grid Engine allows the administrator to specify
60 job_load_adjustments which are used in the Grid Engine scheduler to
61 compensate for this problem.
62 The job_load_adjustments are specified as a comma separated list of
63 arbitrary load parameters or consumable resources and (separated by an
64 equal sign) an associated load correction value. Whenever a job is dis‐
65 patched to a host by sge_schedd(8), the load parameter and consumable
66 value set of that host is increased by the values provided in the
67 job_load_adjustments list. These correction values are decayed linearly
68 over time until after load_adjustment_decay_time from the start the
69 corrections reach the value 0. If the job_load_adjustments list is
70 assigned the special denominator NONE, no load corrections are per‐
71 formed.
72 The adjusted load and consumable values are used to compute the com‐
73 bined and weighted load of the hosts with the load_formula (see above)
74 and to compare the load and consumable values against the load thresh‐
75 old lists defined in the queue configurations (see queue_conf(5)). If
76 the load_formula consists simply of the default CPU load average param‐
77 eter np_load_avg, and if the jobs are very compute intensive, one might
78 want to set the job_load_adjustments list to np_load_avg=1.00, which
79 means that every new job dispatched to a host will require 100 % CPU
80 time, and thus the machine's load is instantly increased by 1.00.
81
82 load_adjustment_decay_time
83 The load corrections in the "job_load_adjustments" list above are
84 decayed linearly over time from the point of the job start, where the
85 corresponding load or consumable parameter is raised by the full cor‐
86 rection value, until after a time period of "load_adjust‐
87 ment_decay_time", where the correction becomes 0. Proper values for
88 "load_adjustment_decay_time" greatly depend upon the load or consumable
89 parameters used and the specific operating system(s). Therefore, they
90 can only be determined on-site and experimentally. For the default
91 np_load_avg load parameter a "load_adjustment_decay_time" of 7 minutes
92 has proven to yield reasonable results.
93
94 maxujobs
95 The maximum number of jobs any user may have running in a Grid Engine
96 cluster at the same time. If set to 0 (default) the users may run an
97 arbitrary number of jobs.
98
99 schedule_interval
100 At the time sge_schedd(8) initially registers to sge_qmaster(8) sched‐
101 ule_interval is used to set the time interval in which sge_qmaster(8)
102 sends scheduling event updates to sge_schedd(8). A scheduling event is
103 a status change that has occurred within sge_qmaster(8) which may trig‐
104 ger or affect scheduler decisions (e.g. a job has finished and thus the
105 allocated resources are available again).
106 In the Grid Engine default scheduler the arrival of a scheduling event
107 report triggers a scheduler run. The scheduler waits for event reports
108 otherwise.
109 Schedule_interval is a time value (see queue_conf(5) for a definition
110 of the syntax of time values).
111
112 queue_sort_method
113 This parameter determines in which order several criteria are taken
114 into account to product a sorted queue list. Currently, two settings
115 are valid: seqno and load. However in both cases, Grid Engine attempts
116 to maximize the number of soft requests (see qsub(1) -s option) ful‐
117 filled by the queues for a particular as the primary criterion.
118 Then, if the queue_sort_method parameter is set to seqno, Grid Engine
119 will use the seq_no parameter as configured in the current queue con‐
120 figurations (see queue_conf(5)) as the next criterion to sort the queue
121 list. The load_formula (see above) has only a meaning if two queues
122 have equal sequence numbers. If queue_sort_method is set to load the
123 load according the load_formula is the criterion after maximizing a
124 job's soft requests and the sequence number is only used if two hosts
125 have the same load. The sequence number sorting is most useful if you
126 want to define a fixed order in which queues are to be filled (e.g.
127 the cheapest resource first).
128
129 The default for this parameter is load.
130
131 halftime
132 When executing under a share based policy, the scheduler "ages" (i.e.
133 decreases) usage to implement a sliding window for achieving the share
134 entitlements as defined by the share tree. The halftime defines the
135 time interval in which accumulated usage will have been decayed to half
136 its original value. Valid values are specified in hours or according to
137 the time format as specified in queue_conf(5).
138 If the value is set to 0, the usage is not decayed.
139
140 usage_weight_list
141 Grid Engine accounts for the consumption of the resources CPU-time,
142 memory and IO to determine the usage which is imposed on a system by a
143 job. A single usage value is computed from these three input parameters
144 by multiplying the individual values by weights and adding them up. The
145 weights are defined in the usage_weight_list. The format of the list is
146
147 cpu=wcpu,mem=wmem,io=wio
148
149 where wcpu, wmem and wio are the configurable weights. The weights are
150 real number. The sum of all tree weights should be 1.
151
152 compensation_factor
153 Determines how fast Grid Engine should compensate for past usage below
154 of above the share entitlement defined in the share tree. Recommended
155 values are between 2 and 10, where 10 means faster compensation.
156
157 weight_user
158 The relative importance of the user shares in the functional policy.
159 Values are of type real.
160
161 weight_project
162 The relative importance of the project shares in the functional policy.
163 Values are of type real.
164
165 weight_department
166 The relative importance of the department shares in the functional pol‐
167 icy. Values are of type real.
168
169 weight_job
170 The relative importance of the job shares in the functional policy.
171 Values are of type real.
172
173 weight_tickets_functional
174 The maximum number of functional tickets available for distribution by
175 Grid Engine. Determines the relative importance of the functional pol‐
176 icy. See under sge_priority(5) for an overview on job priorities.
177
178 weight_tickets_share
179 The maximum number of share based tickets available for distribution by
180 Grid Engine. Determines the relative importance of the share tree pol‐
181 icy. See under sge_priority(5) for an overview on job priorities.
182
183 weight_deadline
184 The weight applied on the remaining time until a jobs latest start
185 time. Determines the relative importance of the deadline. See under
186 sge_priority(5) for an overview on job priorities.
187
188 weight_waiting_time
189 The weight applied on the jobs waiting time since submission. Deter‐
190 mines the relative importance of the waiting time. See under sge_pri‐
191 ority(5) for an overview on job priorities.
192
193 weight_urgency
194 The weight applied on jobs normalized urgency when determining priority
195 finally used. Determines the relative importance of urgency. See
196 under sge_priority(5) for an overview on job priorities.
197
198 weight_ticket
199 The weight applied on normalized ticket amount when determining prior‐
200 ity finally used. Determines the relative importance of the ticket
201 policies. See under sge_priority(5) for an overview on job priorities.
202
203 flush_finish_sec
204 The parameters are provided for tuning the system's scheduling behav‐
205 ior. By default, a scheduler run is triggered in the scheduler inter‐
206 val. When this parameter is set to 1 or larger, the scheduler will be
207 triggered x seconds after a job has finished. Setting this parameter to
208 0 disables the flush after a job has finished.
209
210 flush_submit_sec
211 The parameters are provided for tuning the system's scheduling behav‐
212 ior. By default, a scheduler run is triggered in the scheduler inter‐
213 val. When this parameter is set to 1 or larger, the scheduler will be
214 triggered x seconds after a job was submitted to the system. Setting
215 this parameter to 0 disables the flush after a job was submitted.
216
217 schedd_job_info
218 The default scheduler can keep track why jobs could not be scheduled
219 during the last scheduler run. This parameter enables or disables the
220 observation. The value true enables the monitoring false turns it off.
221
222 It is also possible to activate the observation only for certain jobs.
223 This will be done if the parameter is set to job_list followed by a
224 comma separated list of job ids.
225
226 The user can obtain the collected information with the command qstat
227 -j.
228
229 params
230 This is foreseen for passing additional parameters to the Grid Engine
231 scheduler. The following values are recognized:
232
233 DURATION_OFFSET
234 If set, overrides the default of value 60 seconds. This parame‐
235 ter is used by the Grid Engine scheduler when planning resource
236 utilization as the delta between net job runtimes and total time
237 until resources become available again. Net job runtime as spec‐
238 ified with -l h_rt=... or -l s_rt=... or default_duration
239 always differs from total job runtime due to delays before and
240 after actual job start and finish. Among the delays before job
241 start is the time until the end of a schedule_interval, the time
242 it takes to deliver a job to sge_execd(8) and the delays caused
243 by prolog in queue_conf(5) , start_proc_args in sge_pe(5) and
244 starter_method in queue_conf(5) (notify, terminate_method or
245 checkpointing), procedures run after actual job finish, such as
246 stop_proc_args in sge_pe(5) or epilog in queue_conf(5) , and the
247 delay until a new schedule_interval.
248 If the offset is too low, resource reservations (see max_reser‐
249 vation) can be delayed repeatedly due to an overly optimistic
250 job circulation time.
251
252 JC_FILTER
253 If set to true, the scheduler limits the number of jobs it looks
254 at during a scheduling run. At the beginning of the scheduling
255 run it assigns each job a specific category, which is based on
256 the job's requests, priority settings, and the job owner. All
257 scheduling policies will assign the same importance to each job
258 in one category. Therefore the number of jobs per category have
259 a FIFO order and can be limited to the number of free slots in
260 the system.
261
262 A exception are jobs, which request a resource reservation. They
263 are included regardless of the number of jobs in a category.
264
265 This setting is turned off per default, because in very rare
266 cases, the scheduler can make a wrong decision. It is also
267 advised to turn report_pjob_tickets off. Otherwise qstat -ext
268 can report outdated ticket amounts. The information shown with a
269 qstat -j for a job, that was excluded in a scheduling run, is
270 very limited.
271
272 PROFILE
273 If set equal to 1, the scheduler logs profiling information sum‐
274 marizing each scheduling run.
275
276 MONITOR
277 If set equal to 1, the scheduler records information for each
278 scheduling run allowing to reproduce job resources utilization
279 in the file $SGE_ROOT/$SGE_CELL/common/schedule.
280
281 SELECT_PE_RANGE_ALG
282 This parameter sets the algorithm for the pe range computation.
283 The default is automatic, which means that the scheduler will
284 select the best one, and it should not be necessary to change it
285 to a different setting in normal operation. If a custom setting
286 is needed, the following values are available:
287 auto : the scheduler selects the best algorithm
288 least : starts the resource matching with the lowest slot
289 amount first
290 bin : starts the resource matching in the middle of the
291 pe slot range
292 highest : starts the resource matching with the highest slot
293 amount first
294
295 Changing params will take immediate effect. The default for params is
296 none.
297
298 reprioritize_interval
299 Interval (HH:MM:SS) to reprioritize jobs on the execution hosts based
300 on the current ticket amount for the running jobs. If the interval is
301 set to 00:00:00 the reprioritization is turned off. The default value
302 is 00:00:00.
303
304 report_pjob_tickets
305 This parameter allows to tune the system's scheduling run time. It is
306 used to enable / disable the reporting of pending job tickets to the
307 qmaster. It does not influence the tickets calculation. The sort order
308 of jobs in qstat and qmon is only based on the submit time, when the
309 reporting is turned off.
310 The reporting should be turned of in a system with very large amount of
311 jobs by setting this param to "false".
312
313 halflife_decay_list
314 The halflife_decay_list allows to configure different decay rates for
315 the "finished_jobs usage types, which is used in the pending job ticket
316 calculation to account for jobs which have just ended. This allows the
317 user the pending jobs algorithm to count finished jobs against a user
318 or project for a configurable decayed time period. This feature is
319 turned off by default, and the halftime is used instead.
320 The halflife_decay_list also allows one to configure different decay
321 rates for each usage type tracked (cpu, io, and mem). The list is spec‐
322 ified in the following format:
323
324 <USAGE_TYPE>=<TIME>[:<USAGE_TYPE>=<TIME>[:<USAGE_TYPE>=<TIME>]]
325
326 <Usage_TYPE> can be one of the following: cpu, io, or mem.
327 <TIME> can be -1, 0 or a timespan specified in minutes. If <TIME> is
328 -1, only the usage of currently running jobs is used. 0 means that the
329 usage is not decayed.
330
331 policy_hierarchy
332 This parameter sets up a dependency chain of ticket based policies.
333 Each ticket based policy in the dependency chain is influenced by the
334 previous policies and influences the following policies. A typical sce‐
335 nario is to assign precedence for the override policy over the share-
336 based policy. The override policy determines in such a case how share-
337 based tickets are assigned among jobs of the same user or project.
338 Note that all policies contribute to the ticket amount assigned to a
339 particular job regardless of the policy hierarchy definition. Yet the
340 tickets calculated in each of the policies can be different depending
341 on "POLICY_HIERARCHY".
342
343 The "POLICY_HIERARCHY" parameter can be a up to 3 letter combination of
344 the first letters of the 3 ticket based policies S(hare-based), F(unc‐
345 tional) and O(verride). So a value "OFS" means that the override policy
346 takes precedence over the functional policy, which finally influences
347 the share-based policy. Less than 3 letters mean that some of the
348 policies do not influence other policies and also are not influenced by
349 other policies. So a value of "FS" means that the functional policy
350 influences the share-based policy and that there is no interference
351 with the other policies.
352
353 The special value "NONE" switches off policy hierarchies.
354
355 share_override_tickets
356 If set to "true" or "1", override tickets of any override object
357 instance are shared equally among all running jobs associated with the
358 object. The pending jobs will get as many override tickets, as they
359 would have, when they were running. If set to "false" or "0", each job
360 gets the full value of the override tickets associated with the object.
361 The default value is "true".
362
363 share_functional_shares
364 If set to "true" or "1", functional shares of any functional object
365 instance are shared among all the jobs associated with the object. If
366 set to "false" or "0", each job associated with a functional object,
367 gets the full functional shares of that object. The default value is
368 "true".
369
370 max_functional_jobs_to_schedule
371 The maximum number of pending jobs to schedule in the functional pol‐
372 icy. The default value is 200.
373
374 max_pending_tasks_per_job
375 The maximum number of subtasks per pending array job to schedule. This
376 parameter exists in order to reduce scheduling overhead. The default
377 value is 50.
378
379 max_reservation
380 The maximum number of reservations scheduled within a schedule inter‐
381 val. When a runnable job can not be started due to a shortage of
382 resources a reservation can be scheduled instead. A reservation can
383 cover consumable resources with the global host, any execution host and
384 any queue. For parallel jobs reservations are done also for slots
385 resource as specified in sge_pe(5). As job runtime the maximum of the
386 time specified with -l h_rt=... or -l s_rt=... is assumed. For jobs
387 that have neither of them the default_duration is assumed. Reserva‐
388 tions prevent jobs of lower priority as specified in sge_priority(5)
389 from utilizing the reserved resource quota during the while of reserva‐
390 tion. Jobs of lower priority are allowed to utilize those reserved
391 resources only if their prospective job end is before the start of the
392 reservation (backfilling). Reservation is done only for non-immediate
393 jobs (-now no) that request reservation (-R y). If max_reservation is
394 set to "0" no job reservation is done.
395
396 Note: that reservation scheduling can be performance consuming and
397 hence reservation scheduling is switched off by default. Since reserva‐
398 tion scheduling performance consumption is known to grow with the num‐
399 ber of pending jobs use of -R y option is recommended only for those
400 jobs actually queuing for bottleneck resources. Together with the
401 max_reservation parameter this technique can be used to narrow down
402 performance impacts.
403
404 default_duration
405 When job reservation is enabled through max_reservation sched_conf(5)
406 parameter the default duration is assumed as runtime for jobs that have
407 neither -l h_rt=... nor -l s_rt=... specified. In contrast to a
408 h_rt/s_rt time limit the default_duration is not enforced.
409
411 $SGE_ROOT/$SGE_CELL/common/sched_configuration
412 sge_schedd configuration
413
415 sge_intro(1), qalter(1), qconf(1), qstat(1), qsub(1), complex(5),
416 queue_conf(5), sge_execd(8), sge_qmaster(8), sge_schedd(8). Grid
417 Engine Installation and Administration Guide
418
420 See sge_intro(1) for a full statement of rights and permissions.
421
422
423
424GE 6.1 $Date: 2007/07/19 08:17:18 $ SCHED_CONF(5)