1io_uring_setup(2) Linux Programmer's Manual io_uring_setup(2)
2
3
4
6 io_uring_setup - setup a context for performing asynchronous I/O
7
9 #include <liburing.h>
10
11 int io_uring_setup(u32 entries, struct io_uring_params *p);
12
14 The io_uring_setup(2) system call sets up a submission queue (SQ) and
15 completion queue (CQ) with at least entries entries, and returns a file
16 descriptor which can be used to perform subsequent operations on the
17 io_uring instance. The submission and completion queues are shared be‐
18 tween userspace and the kernel, which eliminates the need to copy data
19 when initiating and completing I/O.
20
21 params is used by the application to pass options to the kernel, and by
22 the kernel to convey information about the ring buffers.
23
24 struct io_uring_params {
25 __u32 sq_entries;
26 __u32 cq_entries;
27 __u32 flags;
28 __u32 sq_thread_cpu;
29 __u32 sq_thread_idle;
30 __u32 features;
31 __u32 wq_fd;
32 __u32 resv[3];
33 struct io_sqring_offsets sq_off;
34 struct io_cqring_offsets cq_off;
35 };
36
37 The flags, sq_thread_cpu, and sq_thread_idle fields are used to config‐
38 ure the io_uring instance. flags is a bit mask of 0 or more of the
39 following values ORed together:
40
41 IORING_SETUP_IOPOLL
42 Perform busy-waiting for an I/O completion, as opposed to get‐
43 ting notifications via an asynchronous IRQ (Interrupt Request).
44 The file system (if any) and block device must support polling
45 in order for this to work. Busy-waiting provides lower latency,
46 but may consume more CPU resources than interrupt driven I/O.
47 Currently, this feature is usable only on a file descriptor
48 opened using the O_DIRECT flag. When a read or write is submit‐
49 ted to a polled context, the application must poll for comple‐
50 tions on the CQ ring by calling io_uring_enter(2). It is ille‐
51 gal to mix and match polled and non-polled I/O on an io_uring
52 instance.
53
54
55 IORING_SETUP_SQPOLL
56 When this flag is specified, a kernel thread is created to per‐
57 form submission queue polling. An io_uring instance configured
58 in this way enables an application to issue I/O without ever
59 context switching into the kernel. By using the submission
60 queue to fill in new submission queue entries and watching for
61 completions on the completion queue, the application can submit
62 and reap I/Os without doing a single system call.
63
64 If the kernel thread is idle for more than sq_thread_idle mil‐
65 liseconds, it will set the IORING_SQ_NEED_WAKEUP bit in the
66 flags field of the struct io_sq_ring. When this happens, the
67 application must call io_uring_enter(2) to wake the kernel
68 thread. If I/O is kept busy, the kernel thread will never
69 sleep. An application making use of this feature will need to
70 guard the io_uring_enter(2) call with the following code se‐
71 quence:
72
73 /*
74 * Ensure that the wakeup flag is read after the tail pointer
75 * has been written. It's important to use memory load acquire
76 * semantics for the flags read, as otherwise the application
77 * and the kernel might not agree on the consistency of the
78 * wakeup flag.
79 */
80 unsigned flags = atomic_load_relaxed(sq_ring->flags);
81 if (flags & IORING_SQ_NEED_WAKEUP)
82 io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);
83
84 where sq_ring is a submission queue ring setup using the struct
85 io_sqring_offsets described below.
86
87 Before version 5.11 of the Linux kernel, to successfully use
88 this feature, the application must register a set of files to be
89 used for IO through io_uring_register(2) using the IORING_REGIS‐
90 TER_FILES opcode. Failure to do so will result in submitted IO
91 being errored with EBADF. The presence of this feature can be
92 detected by the IORING_FEAT_SQPOLL_NONFIXED feature flag. In
93 version 5.11 and later, it is no longer necessary to register
94 files to use this feature. 5.11 also allows using this as non-
95 root, if the user has the CAP_SYS_NICE capability.
96
97 IORING_SETUP_SQ_AFF
98 If this flag is specified, then the poll thread will be bound to
99 the cpu set in the sq_thread_cpu field of the struct io_ur‐
100 ing_params. This flag is only meaningful when IOR‐
101 ING_SETUP_SQPOLL is specified. When cgroup setting cpuset.cpus
102 changes (typically in container environment), the bounded cpu
103 set may be changed as well.
104
105 IORING_SETUP_CQSIZE
106 Create the completion queue with struct io_uring_params.cq_en‐
107 tries entries. The value must be greater than entries, and may
108 be rounded up to the next power-of-two.
109
110 IORING_SETUP_CLAMP
111 If this flag is specified, and if entries exceeds IORING_MAX_EN‐
112 TRIES , then entries will be clamped at IORING_MAX_ENTRIES . If
113 the flag IORING_SETUP_SQPOLL is set, and if the value of struct
114 io_uring_params.cq_entries exceeds IORING_MAX_CQ_ENTRIES , then
115 it will be clamped at IORING_MAX_CQ_ENTRIES .
116
117 IORING_SETUP_ATTACH_WQ
118 This flag should be set in conjunction with struct io_ur‐
119 ing_params.wq_fd being set to an existing io_uring ring file de‐
120 scriptor. When set, the io_uring instance being created will
121 share the asynchronous worker thread backend of the specified
122 io_uring ring, rather than create a new separate thread pool.
123
124 IORING_SETUP_R_DISABLED
125 If this flag is specified, the io_uring ring starts in a dis‐
126 abled state. In this state, restrictions can be registered, but
127 submissions are not allowed. See io_uring_register(2) for de‐
128 tails on how to enable the ring. Available since 5.10.
129
130 IORING_SETUP_SUBMIT_ALL
131 Normally io_uring stops submitting a batch of request, if one of
132 these requests results in an error. This can cause submission of
133 less than what is expected, if a request ends in error while be‐
134 ing submitted. If the ring is created with this flag, io_ur‐
135 ing_enter(2) will continue submitting requests even if it en‐
136 counters an error submitting a request. CQEs are still posted
137 for errored request regardless of whether or not this flag is
138 set at ring creation time, the only difference is if the submit
139 sequence is halted or continued when an error is observed.
140 Available since 5.18.
141
142 IORING_SETUP_COOP_TASKRUN
143 By default, io_uring will interrupt a task running in userspace
144 when a completion event comes in. This is to ensure that comple‐
145 tions run in a timely manner. For a lot of use cases, this is
146 overkill and can cause reduced performance from both the inter-
147 processor interrupt used to do this, the kernel/user transition,
148 the needless interruption of the tasks userspace activities, and
149 reduced batching if completions come in at a rapid rate. Most
150 applications don't need the forceful interruption, as the events
151 are processed at any kernel/user transition. The exception are
152 setups where the application uses multiple threads operating on
153 the same ring, where the application waiting on completions
154 isn't the one that submitted them. For most other use cases,
155 setting this flag will improve performance. Available since
156 5.19.
157
158 IORING_SETUP_TASKRUN_FLAG
159 Used in conjunction with IORING_SETUP_COOP_TASKRUN, this pro‐
160 vides a flag, IORING_SQ_TASKRUN, which is set in the SQ ring
161 flags whenever completions are pending that should be processed.
162 liburing will check for this flag even when doing io_ur‐
163 ing_peek_cqe(3) and enter the kernel to process them, and appli‐
164 cations can do the same. This makes IORING_SETUP_TASKRUN_FLAG
165 safe to use even when applications rely on a peek style opera‐
166 tion on the CQ ring to see if anything might be pending to reap.
167 Available since 5.19.
168
169 IORING_SETUP_SQE128
170 If set, io_uring will use 128-byte SQEs rather than the normal
171 64-byte sized variant. This is a requirement for using certain
172 request types, as of 5.19 only the IORING_OP_URING_CMD
173 passthrough command for NVMe passthrough needs this. Available
174 since 5.19.
175
176 IORING_SETUP_CQE32
177 If set, io_uring will use 32-byte CQEs rather than the normal
178 16-byte sized variant. This is a requirement for using certain
179 request types, as of 5.19 only the IORING_OP_URING_CMD
180 passthrough command for NVMe passthrough needs this. Available
181 since 5.19.
182
183 IORING_SETUP_SINGLE_ISSUER
184 A hint to the kernel that only a single task (or thread) will
185 submit requests, which is used for internal optimisations. The
186 submission task is either the task that created the ring, or if
187 IORING_SETUP_R_DISABLED is specified then it is the task that
188 enables the ring through io_uring_register(2). The kernel en‐
189 forces this rule, failing requests with -EEXIST if the restric‐
190 tion is violated. Note that when IORING_SETUP_SQPOLL is set it
191 is considered that the polling task is doing all submissions on
192 behalf of the userspace and so it always complies with the rule
193 disregarding how many userspace tasks do io_uring_enter(2).
194 Available since 6.0.
195
196 IORING_SETUP_DEFER_TASKRUN
197 By default, io_uring will process all outstanding work at the
198 end of any system call or thread interrupt. This can delay the
199 application from making other progress. Setting this flag will
200 hint to io_uring that it should defer work until an io_uring_en‐
201 ter(2) call with the IORING_ENTER_GETEVENTS flag set. This al‐
202 lows the application to request work to run just before it wants
203 to process completions. This flag requires the IOR‐
204 ING_SETUP_SINGLE_ISSUER flag to be set, and also enforces that
205 the call to io_uring_enter(2) is called from the same thread
206 that submitted requests. Note that if this flag is set then it
207 is the application's responsibility to periodically trigger work
208 (for example via any of the CQE waiting functions) or else com‐
209 pletions may not be delivered. Available since 6.1.
210
211 If no flags are specified, the io_uring instance is setup for interrupt
212 driven I/O. I/O may be submitted using io_uring_enter(2) and can be
213 reaped by polling the completion queue.
214
215 The resv array must be initialized to zero.
216
217 features is filled in by the kernel, which specifies various features
218 supported by current kernel version.
219
220 IORING_FEAT_SINGLE_MMAP
221 If this flag is set, the two SQ and CQ rings can be mapped with
222 a single mmap(2) call. The SQEs must still be allocated sepa‐
223 rately. This brings the necessary mmap(2) calls down from three
224 to two. Available since kernel 5.4.
225
226 IORING_FEAT_NODROP
227 If this flag is set, io_uring supports almost never dropping
228 completion events. If a completion event occurs and the CQ ring
229 is full, the kernel stores the event internally until such a
230 time that the CQ ring has room for more entries. If this over‐
231 flow condition is entered, attempting to submit more IO will
232 fail with the -EBUSY error value, if it can't flush the over‐
233 flown events to the CQ ring. If this happens, the application
234 must reap events from the CQ ring and attempt the submit again.
235 If the kernel has no free memory to store the event internally
236 it will be visible by an increase in the overflow value on the
237 cqring. Available since kernel 5.5. Additionally io_uring_en‐
238 ter(2) will return -EBADR the next time it would otherwise sleep
239 waiting for completions (since kernel 5.19).
240
241
242 IORING_FEAT_SUBMIT_STABLE
243 If this flag is set, applications can be certain that any data
244 for async offload has been consumed when the kernel has consumed
245 the SQE. Available since kernel 5.5.
246
247 IORING_FEAT_RW_CUR_POS
248 If this flag is set, applications can specify offset == -1 with
249 IORING_OP_{READV,WRITEV} , IORING_OP_{READ,WRITE}_FIXED , and
250 IORING_OP_{READ,WRITE} to mean current file position, which be‐
251 haves like preadv2(2) and pwritev2(2) with offset == -1. It'll
252 use (and update) the current file position. This obviously comes
253 with the caveat that if the application has multiple reads or
254 writes in flight, then the end result will not be as expected.
255 This is similar to threads sharing a file descriptor and doing
256 IO using the current file position. Available since kernel 5.6.
257
258 IORING_FEAT_CUR_PERSONALITY
259 If this flag is set, then io_uring guarantees that both sync and
260 async execution of a request assumes the credentials of the task
261 that called io_uring_enter(2) to queue the requests. If this
262 flag isn't set, then requests are issued with the credentials of
263 the task that originally registered the io_uring. If only one
264 task is using a ring, then this flag doesn't matter as the cre‐
265 dentials will always be the same. Note that this is the default
266 behavior, tasks can still register different personalities
267 through io_uring_register(2) with IORING_REGISTER_PERSONALITY
268 and specify the personality to use in the sqe. Available since
269 kernel 5.6.
270
271 IORING_FEAT_FAST_POLL
272 If this flag is set, then io_uring supports using an internal
273 poll mechanism to drive data/space readiness. This means that
274 requests that cannot read or write data to a file no longer need
275 to be punted to an async thread for handling, instead they will
276 begin operation when the file is ready. This is similar to doing
277 poll + read/write in userspace, but eliminates the need to do
278 so. If this flag is set, requests waiting on space/data consume
279 a lot less resources doing so as they are not blocking a thread.
280 Available since kernel 5.7.
281
282 IORING_FEAT_POLL_32BITS
283 If this flag is set, the IORING_OP_POLL_ADD command accepts the
284 full 32-bit range of epoll based flags. Most notably EPOLLEXCLU‐
285 SIVE which allows exclusive (waking single waiters) behavior.
286 Available since kernel 5.9.
287
288 IORING_FEAT_SQPOLL_NONFIXED
289 If this flag is set, the IORING_SETUP_SQPOLL feature no longer
290 requires the use of fixed files. Any normal file descriptor can
291 be used for IO commands without needing registration. Available
292 since kernel 5.11.
293
294 IORING_FEAT_ENTER_EXT_ARG
295 If this flag is set, then the io_uring_enter(2) system call sup‐
296 ports passing in an extended argument instead of just the
297 sigset_t of earlier kernels. This. extended argument is of type
298 struct io_uring_getevents_arg and allows the caller to pass in
299 both a sigset_t and a timeout argument for waiting on events.
300 The struct layout is as follows:
301
302 struct io_uring_getevents_arg {
303 __u64 sigmask;
304 __u32 sigmask_sz;
305 __u32 pad;
306 __u64 ts;
307 };
308
309 and a pointer to this struct must be passed in if IORING_EN‐
310 TER_EXT_ARG is set in the flags for the enter system call.
311 Available since kernel 5.11.
312
313 IORING_FEAT_NATIVE_WORKERS
314 If this flag is set, io_uring is using native workers for its
315 async helpers. Previous kernels used kernel threads that as‐
316 sumed the identity of the original io_uring owning task, but
317 later kernels will actively create what looks more like regular
318 process threads instead. Available since kernel 5.12.
319
320 IORING_FEAT_RSRC_TAGS
321 If this flag is set, then io_uring supports a variety of fea‐
322 tures related to fixed files and buffers. In particular, it in‐
323 dicates that registered buffers can be updated in-place, whereas
324 before the full set would have to be unregistered first. Avail‐
325 able since kernel 5.13.
326
327 IORING_FEAT_CQE_SKIP
328 If this flag is set, then io_uring supports setting
329 IOSQE_CQE_SKIP_SUCCESS in the submitted SQE, indicating that no
330 CQE should be generated for this SQE if it executes normally. If
331 an error happens processing the SQE, a CQE with the appropriate
332 error value will still be generated. Available since kernel
333 5.17.
334
335 IORING_FEAT_LINKED_FILE
336 If this flag is set, then io_uring supports sane assignment of
337 files for SQEs that have dependencies. For example, if a chain
338 of SQEs are submitted with IOSQE_IO_LINK, then kernels without
339 this flag will prepare the file for each link upfront. If a
340 previous link opens a file with a known index, eg if direct de‐
341 scriptors are used with open or accept, then file assignment
342 needs to happen post execution of that SQE. If this flag is set,
343 then the kernel will defer file assignment until execution of a
344 given request is started. Available since kernel 5.17.
345
346
347 The rest of the fields in the struct io_uring_params are filled in by
348 the kernel, and provide the information necessary to memory map the
349 submission queue, completion queue, and the array of submission queue
350 entries. sq_entries specifies the number of submission queue entries
351 allocated. sq_off describes the offsets of various ring buffer fields:
352
353 struct io_sqring_offsets {
354 __u32 head;
355 __u32 tail;
356 __u32 ring_mask;
357 __u32 ring_entries;
358 __u32 flags;
359 __u32 dropped;
360 __u32 array;
361 __u32 resv[3];
362 };
363
364 Taken together, sq_entries and sq_off provide all of the information
365 necessary for accessing the submission queue ring buffer and the sub‐
366 mission queue entry array. The submission queue can be mapped with a
367 call like:
368
369 ptr = mmap(0, sq_off.array + sq_entries * sizeof(__u32),
370 PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,
371 ring_fd, IORING_OFF_SQ_RING);
372
373 where sq_off is the io_sqring_offsets structure, and ring_fd is the
374 file descriptor returned from io_uring_setup(2). The addition of
375 sq_off.array to the length of the region accounts for the fact that the
376 ring located at the end of the data structure. As an example, the ring
377 buffer head pointer can be accessed by adding sq_off.head to the ad‐
378 dress returned from mmap(2):
379
380 head = ptr + sq_off.head;
381
382 The flags field is used by the kernel to communicate state information
383 to the application. Currently, it is used to inform the application
384 when a call to io_uring_enter(2) is necessary. See the documentation
385 for the IORING_SETUP_SQPOLL flag above. The dropped member is incre‐
386 mented for each invalid submission queue entry encountered in the ring
387 buffer.
388
389 The head and tail track the ring buffer state. The tail is incremented
390 by the application when submitting new I/O, and the head is incremented
391 by the kernel when the I/O has been successfully submitted. Determin‐
392 ing the index of the head or tail into the ring is accomplished by ap‐
393 plying a mask:
394
395 index = tail & ring_mask;
396
397 The array of submission queue entries is mapped with:
398
399 sqentries = mmap(0, sq_entries * sizeof(struct io_uring_sqe),
400 PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,
401 ring_fd, IORING_OFF_SQES);
402
403 The completion queue is described by cq_entries and cq_off shown here:
404
405 struct io_cqring_offsets {
406 __u32 head;
407 __u32 tail;
408 __u32 ring_mask;
409 __u32 ring_entries;
410 __u32 overflow;
411 __u32 cqes;
412 __u32 flags;
413 __u32 resv[3];
414 };
415
416 The completion queue is simpler, since the entries are not separated
417 from the queue itself, and can be mapped with:
418
419 ptr = mmap(0, cq_off.cqes + cq_entries * sizeof(struct io_uring_cqe),
420 PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, ring_fd,
421 IORING_OFF_CQ_RING);
422
423 Closing the file descriptor returned by io_uring_setup(2) will free all
424 resources associated with the io_uring context.
425
427 io_uring_setup(2) returns a new file descriptor on success. The appli‐
428 cation may then provide the file descriptor in a subsequent mmap(2)
429 call to map the submission and completion queues, or to the io_ur‐
430 ing_register(2) or io_uring_enter(2) system calls.
431
432 On error, a negative error code is returned. The caller should not rely
433 on errno variable.
434
436 EFAULT params is outside your accessible address space.
437
438 EINVAL The resv array contains non-zero data, p.flags contains an un‐
439 supported flag, entries is out of bounds, IORING_SETUP_SQ_AFF
440 was specified, but IORING_SETUP_SQPOLL was not, or IOR‐
441 ING_SETUP_CQSIZE was specified, but io_uring_params.cq_entries
442 was invalid.
443
444 EMFILE The per-process limit on the number of open file descriptors has
445 been reached (see the description of RLIMIT_NOFILE in getr‐
446 limit(2)).
447
448 ENFILE The system-wide limit on the total number of open files has been
449 reached.
450
451 ENOMEM Insufficient kernel resources are available.
452
453 EPERM IORING_SETUP_SQPOLL was specified, but the effective user ID of
454 the caller did not have sufficient privileges.
455
457 io_uring_register(2), io_uring_enter(2)
458
459
460
461Linux 2019-01-29 io_uring_setup(2)