1io_uring_setup(2) Linux Programmer's Manual io_uring_setup(2)
2
3
4
6 io_uring_setup - setup a context for performing asynchronous I/O
7
9 #include <liburing.h>
10
11 int io_uring_setup(u32 entries, struct io_uring_params *p);
12
14 The io_uring_setup(2) system call sets up a submission queue (SQ) and
15 completion queue (CQ) with at least entries entries, and returns a file
16 descriptor which can be used to perform subsequent operations on the
17 io_uring instance. The submission and completion queues are shared be‐
18 tween userspace and the kernel, which eliminates the need to copy data
19 when initiating and completing I/O.
20
21 params is used by the application to pass options to the kernel, and by
22 the kernel to convey information about the ring buffers.
23
24 struct io_uring_params {
25 __u32 sq_entries;
26 __u32 cq_entries;
27 __u32 flags;
28 __u32 sq_thread_cpu;
29 __u32 sq_thread_idle;
30 __u32 features;
31 __u32 wq_fd;
32 __u32 resv[3];
33 struct io_sqring_offsets sq_off;
34 struct io_cqring_offsets cq_off;
35 };
36
37 The flags, sq_thread_cpu, and sq_thread_idle fields are used to config‐
38 ure the io_uring instance. flags is a bit mask of 0 or more of the
39 following values ORed together:
40
41 IORING_SETUP_IOPOLL
42 Perform busy-waiting for an I/O completion, as opposed to get‐
43 ting notifications via an asynchronous IRQ (Interrupt Request).
44 The file system (if any) and block device must support polling
45 in order for this to work. Busy-waiting provides lower latency,
46 but may consume more CPU resources than interrupt driven I/O.
47 Currently, this feature is usable only on a file descriptor
48 opened using the O_DIRECT flag. When a read or write is submit‐
49 ted to a polled context, the application must poll for comple‐
50 tions on the CQ ring by calling io_uring_enter(2). It is ille‐
51 gal to mix and match polled and non-polled I/O on an io_uring
52 instance.
53
54 This is only applicable for storage devices for now, and the
55 storage device must be configured for polling. How to do that
56 depends on the device type in question. For NVMe devices, the
57 nvme driver must be loaded with the poll_queues parameter set to
58 the desired number of polling queues. The polling queues will be
59 shared appropriately between the CPUs in the system, if the num‐
60 ber is less than the number of online CPU threads.
61
62
63 IORING_SETUP_SQPOLL
64 When this flag is specified, a kernel thread is created to per‐
65 form submission queue polling. An io_uring instance configured
66 in this way enables an application to issue I/O without ever
67 context switching into the kernel. By using the submission
68 queue to fill in new submission queue entries and watching for
69 completions on the completion queue, the application can submit
70 and reap I/Os without doing a single system call.
71
72 If the kernel thread is idle for more than sq_thread_idle mil‐
73 liseconds, it will set the IORING_SQ_NEED_WAKEUP bit in the
74 flags field of the struct io_sq_ring. When this happens, the
75 application must call io_uring_enter(2) to wake the kernel
76 thread. If I/O is kept busy, the kernel thread will never
77 sleep. An application making use of this feature will need to
78 guard the io_uring_enter(2) call with the following code se‐
79 quence:
80
81 /*
82 * Ensure that the wakeup flag is read after the tail pointer
83 * has been written. It's important to use memory load acquire
84 * semantics for the flags read, as otherwise the application
85 * and the kernel might not agree on the consistency of the
86 * wakeup flag.
87 */
88 unsigned flags = atomic_load_relaxed(sq_ring->flags);
89 if (flags & IORING_SQ_NEED_WAKEUP)
90 io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);
91
92 where sq_ring is a submission queue ring setup using the struct
93 io_sqring_offsets described below.
94
95 Note that, when using a ring setup with IORING_SETUP_SQPOLL, you
96 never directly call the io_uring_enter(2) system call. That is
97 usually taken care of by liburing's io_uring_submit(3) function.
98 It automatically determines if you are using polling mode or not
99 and deals with when your program needs to call io_uring_enter(2)
100 without you having to bother about it.
101
102 Before version 5.11 of the Linux kernel, to successfully use
103 this feature, the application must register a set of files to be
104 used for IO through io_uring_register(2) using the IORING_REGIS‐
105 TER_FILES opcode. Failure to do so will result in submitted IO
106 being errored with EBADF. The presence of this feature can be
107 detected by the IORING_FEAT_SQPOLL_NONFIXED feature flag. In
108 version 5.11 and later, it is no longer necessary to register
109 files to use this feature. 5.11 also allows using this as non-
110 root, if the user has the CAP_SYS_NICE capability. In 5.13 this
111 requirement was also relaxed, and no special privileges are
112 needed for SQPOLL in newer kernels. Certain stable kernels older
113 than 5.13 may also support unprivileged SQPOLL.
114
115 IORING_SETUP_SQ_AFF
116 If this flag is specified, then the poll thread will be bound to
117 the cpu set in the sq_thread_cpu field of the struct io_ur‐
118 ing_params. This flag is only meaningful when IOR‐
119 ING_SETUP_SQPOLL is specified. When cgroup setting cpuset.cpus
120 changes (typically in container environment), the bounded cpu
121 set may be changed as well.
122
123 IORING_SETUP_CQSIZE
124 Create the completion queue with struct io_uring_params.cq_en‐
125 tries entries. The value must be greater than entries, and may
126 be rounded up to the next power-of-two.
127
128 IORING_SETUP_CLAMP
129 If this flag is specified, and if entries exceeds IORING_MAX_EN‐
130 TRIES , then entries will be clamped at IORING_MAX_ENTRIES . If
131 the flag IORING_SETUP_SQPOLL is set, and if the value of struct
132 io_uring_params.cq_entries exceeds IORING_MAX_CQ_ENTRIES , then
133 it will be clamped at IORING_MAX_CQ_ENTRIES .
134
135 IORING_SETUP_ATTACH_WQ
136 This flag should be set in conjunction with struct io_ur‐
137 ing_params.wq_fd being set to an existing io_uring ring file de‐
138 scriptor. When set, the io_uring instance being created will
139 share the asynchronous worker thread backend of the specified
140 io_uring ring, rather than create a new separate thread pool.
141
142 IORING_SETUP_R_DISABLED
143 If this flag is specified, the io_uring ring starts in a dis‐
144 abled state. In this state, restrictions can be registered, but
145 submissions are not allowed. See io_uring_register(2) for de‐
146 tails on how to enable the ring. Available since 5.10.
147
148 IORING_SETUP_SUBMIT_ALL
149 Normally io_uring stops submitting a batch of requests, if one
150 of these requests results in an error. This can cause submission
151 of less than what is expected, if a request ends in error while
152 being submitted. If the ring is created with this flag, io_ur‐
153 ing_enter(2) will continue submitting requests even if it en‐
154 counters an error submitting a request. CQEs are still posted
155 for errored request regardless of whether or not this flag is
156 set at ring creation time, the only difference is if the submit
157 sequence is halted or continued when an error is observed.
158 Available since 5.18.
159
160 IORING_SETUP_COOP_TASKRUN
161 By default, io_uring will interrupt a task running in userspace
162 when a completion event comes in. This is to ensure that comple‐
163 tions run in a timely manner. For a lot of use cases, this is
164 overkill and can cause reduced performance from both the inter-
165 processor interrupt used to do this, the kernel/user transition,
166 the needless interruption of the tasks userspace activities, and
167 reduced batching if completions come in at a rapid rate. Most
168 applications don't need the forceful interruption, as the events
169 are processed at any kernel/user transition. The exception are
170 setups where the application uses multiple threads operating on
171 the same ring, where the application waiting on completions
172 isn't the one that submitted them. For most other use cases,
173 setting this flag will improve performance. Available since
174 5.19.
175
176 IORING_SETUP_TASKRUN_FLAG
177 Used in conjunction with IORING_SETUP_COOP_TASKRUN, this pro‐
178 vides a flag, IORING_SQ_TASKRUN, which is set in the SQ ring
179 flags whenever completions are pending that should be processed.
180 liburing will check for this flag even when doing io_ur‐
181 ing_peek_cqe(3) and enter the kernel to process them, and appli‐
182 cations can do the same. This makes IORING_SETUP_TASKRUN_FLAG
183 safe to use even when applications rely on a peek style opera‐
184 tion on the CQ ring to see if anything might be pending to reap.
185 Available since 5.19.
186
187 IORING_SETUP_SQE128
188 If set, io_uring will use 128-byte SQEs rather than the normal
189 64-byte sized variant. This is a requirement for using certain
190 request types, as of 5.19 only the IORING_OP_URING_CMD
191 passthrough command for NVMe passthrough needs this. Available
192 since 5.19.
193
194 IORING_SETUP_CQE32
195 If set, io_uring will use 32-byte CQEs rather than the normal
196 16-byte sized variant. This is a requirement for using certain
197 request types, as of 5.19 only the IORING_OP_URING_CMD
198 passthrough command for NVMe passthrough needs this. Available
199 since 5.19.
200
201 IORING_SETUP_SINGLE_ISSUER
202 A hint to the kernel that only a single task (or thread) will
203 submit requests, which is used for internal optimisations. The
204 submission task is either the task that created the ring, or if
205 IORING_SETUP_R_DISABLED is specified then it is the task that
206 enables the ring through io_uring_register(2). The kernel en‐
207 forces this rule, failing requests with -EEXIST if the restric‐
208 tion is violated. Note that when IORING_SETUP_SQPOLL is set it
209 is considered that the polling task is doing all submissions on
210 behalf of the userspace and so it always complies with the rule
211 disregarding how many userspace tasks do io_uring_enter(2).
212 Available since 6.0.
213
214 IORING_SETUP_DEFER_TASKRUN
215 By default, io_uring will process all outstanding work at the
216 end of any system call or thread interrupt. This can delay the
217 application from making other progress. Setting this flag will
218 hint to io_uring that it should defer work until an io_uring_en‐
219 ter(2) call with the IORING_ENTER_GETEVENTS flag set. This al‐
220 lows the application to request work to run just before it wants
221 to process completions. This flag requires the IOR‐
222 ING_SETUP_SINGLE_ISSUER flag to be set, and also enforces that
223 the call to io_uring_enter(2) is called from the same thread
224 that submitted requests. Note that if this flag is set then it
225 is the application's responsibility to periodically trigger work
226 (for example via any of the CQE waiting functions) or else com‐
227 pletions may not be delivered. Available since 6.1.
228
229 IORING_SETUP_NO_MMAP
230 By default, io_uring allocates kernel memory that callers must
231 subsequently mmap(2). If this flag is set, io_uring instead
232 uses caller-allocated buffers; p->cq_off.user_addr must point to
233 the memory for the sq/cq rings, and p->sq_off.user_addr must
234 point to the memory for the sqes. Each allocation must be con‐
235 tiguous memory. Typically, callers should allocate this memory
236 by using mmap(2) to allocate a huge page. If this flag is set,
237 a subsequent attempt to mmap(2) the io_uring file descriptor
238 will fail. Available since 6.5.
239
240 IORING_SETUP_REGISTERED_FD_ONLY
241 If this flag is set, io_uring will register the ring file de‐
242 scriptor, and return the registered descriptor index, without
243 ever allocating an unregistered file descriptor. The caller will
244 need to use IORING_REGISTER_USE_REGISTERED_RING when calling
245 io_uring_register(2).
246
247
248 If no flags are specified, the io_uring instance is setup for interrupt
249 driven I/O. I/O may be submitted using io_uring_enter(2) and can be
250 reaped by polling the completion queue.
251
252 The resv array must be initialized to zero.
253
254 features is filled in by the kernel, which specifies various features
255 supported by current kernel version.
256
257 IORING_FEAT_SINGLE_MMAP
258 If this flag is set, the two SQ and CQ rings can be mapped with
259 a single mmap(2) call. The SQEs must still be allocated sepa‐
260 rately. This brings the necessary mmap(2) calls down from three
261 to two. Available since kernel 5.4.
262
263 IORING_FEAT_NODROP
264 If this flag is set, io_uring supports almost never dropping
265 completion events. If a completion event occurs and the CQ ring
266 is full, the kernel stores the event internally until such a
267 time that the CQ ring has room for more entries. If this over‐
268 flow condition is entered, attempting to submit more IO will
269 fail with the -EBUSY error value, if it can't flush the over‐
270 flown events to the CQ ring. If this happens, the application
271 must reap events from the CQ ring and attempt the submit again.
272 If the kernel has no free memory to store the event internally
273 it will be visible by an increase in the overflow value on the
274 cqring. Available since kernel 5.5. Additionally io_uring_en‐
275 ter(2) will return -EBADR the next time it would otherwise sleep
276 waiting for completions (since kernel 5.19).
277
278
279 IORING_FEAT_SUBMIT_STABLE
280 If this flag is set, applications can be certain that any data
281 for async offload has been consumed when the kernel has consumed
282 the SQE. Available since kernel 5.5.
283
284 IORING_FEAT_RW_CUR_POS
285 If this flag is set, applications can specify offset == -1 with
286 IORING_OP_{READV,WRITEV} , IORING_OP_{READ,WRITE}_FIXED , and
287 IORING_OP_{READ,WRITE} to mean current file position, which be‐
288 haves like preadv2(2) and pwritev2(2) with offset == -1. It'll
289 use (and update) the current file position. This obviously comes
290 with the caveat that if the application has multiple reads or
291 writes in flight, then the end result will not be as expected.
292 This is similar to threads sharing a file descriptor and doing
293 IO using the current file position. Available since kernel 5.6.
294
295 IORING_FEAT_CUR_PERSONALITY
296 If this flag is set, then io_uring guarantees that both sync and
297 async execution of a request assumes the credentials of the task
298 that called io_uring_enter(2) to queue the requests. If this
299 flag isn't set, then requests are issued with the credentials of
300 the task that originally registered the io_uring. If only one
301 task is using a ring, then this flag doesn't matter as the cre‐
302 dentials will always be the same. Note that this is the default
303 behavior, tasks can still register different personalities
304 through io_uring_register(2) with IORING_REGISTER_PERSONALITY
305 and specify the personality to use in the sqe. Available since
306 kernel 5.6.
307
308 IORING_FEAT_FAST_POLL
309 If this flag is set, then io_uring supports using an internal
310 poll mechanism to drive data/space readiness. This means that
311 requests that cannot read or write data to a file no longer need
312 to be punted to an async thread for handling, instead they will
313 begin operation when the file is ready. This is similar to doing
314 poll + read/write in userspace, but eliminates the need to do
315 so. If this flag is set, requests waiting on space/data consume
316 a lot less resources doing so as they are not blocking a thread.
317 Available since kernel 5.7.
318
319 IORING_FEAT_POLL_32BITS
320 If this flag is set, the IORING_OP_POLL_ADD command accepts the
321 full 32-bit range of epoll based flags. Most notably EPOLLEXCLU‐
322 SIVE which allows exclusive (waking single waiters) behavior.
323 Available since kernel 5.9.
324
325 IORING_FEAT_SQPOLL_NONFIXED
326 If this flag is set, the IORING_SETUP_SQPOLL feature no longer
327 requires the use of fixed files. Any normal file descriptor can
328 be used for IO commands without needing registration. Available
329 since kernel 5.11.
330
331 IORING_FEAT_ENTER_EXT_ARG
332 If this flag is set, then the io_uring_enter(2) system call sup‐
333 ports passing in an extended argument instead of just the
334 sigset_t of earlier kernels. This. extended argument is of type
335 struct io_uring_getevents_arg and allows the caller to pass in
336 both a sigset_t and a timeout argument for waiting on events.
337 The struct layout is as follows:
338
339 struct io_uring_getevents_arg {
340 __u64 sigmask;
341 __u32 sigmask_sz;
342 __u32 pad;
343 __u64 ts;
344 };
345
346 and a pointer to this struct must be passed in if IORING_EN‐
347 TER_EXT_ARG is set in the flags for the enter system call.
348 Available since kernel 5.11.
349
350 IORING_FEAT_NATIVE_WORKERS
351 If this flag is set, io_uring is using native workers for its
352 async helpers. Previous kernels used kernel threads that as‐
353 sumed the identity of the original io_uring owning task, but
354 later kernels will actively create what looks more like regular
355 process threads instead. Available since kernel 5.12.
356
357 IORING_FEAT_RSRC_TAGS
358 If this flag is set, then io_uring supports a variety of fea‐
359 tures related to fixed files and buffers. In particular, it in‐
360 dicates that registered buffers can be updated in-place, whereas
361 before the full set would have to be unregistered first. Avail‐
362 able since kernel 5.13.
363
364 IORING_FEAT_CQE_SKIP
365 If this flag is set, then io_uring supports setting
366 IOSQE_CQE_SKIP_SUCCESS in the submitted SQE, indicating that no
367 CQE should be generated for this SQE if it executes normally. If
368 an error happens processing the SQE, a CQE with the appropriate
369 error value will still be generated. Available since kernel
370 5.17.
371
372 IORING_FEAT_LINKED_FILE
373 If this flag is set, then io_uring supports sane assignment of
374 files for SQEs that have dependencies. For example, if a chain
375 of SQEs are submitted with IOSQE_IO_LINK, then kernels without
376 this flag will prepare the file for each link upfront. If a
377 previous link opens a file with a known index, eg if direct de‐
378 scriptors are used with open or accept, then file assignment
379 needs to happen post execution of that SQE. If this flag is set,
380 then the kernel will defer file assignment until execution of a
381 given request is started. Available since kernel 5.17.
382
383 IORING_FEAT_REG_REG_RING
384 If this flag is set, then io_uring supports calling io_ur‐
385 ing_register(2) using a registered ring fd, via IORING_REGIS‐
386 TER_USE_REGISTERED_RING. Available since kernel 6.3.
387
388
389 The rest of the fields in the struct io_uring_params are filled in by
390 the kernel, and provide the information necessary to memory map the
391 submission queue, completion queue, and the array of submission queue
392 entries. sq_entries specifies the number of submission queue entries
393 allocated. sq_off describes the offsets of various ring buffer fields:
394
395 struct io_sqring_offsets {
396 __u32 head;
397 __u32 tail;
398 __u32 ring_mask;
399 __u32 ring_entries;
400 __u32 flags;
401 __u32 dropped;
402 __u32 array;
403 __u32 resv1;
404 __u64 user_addr;
405 };
406
407 Taken together, sq_entries and sq_off provide all of the information
408 necessary for accessing the submission queue ring buffer and the sub‐
409 mission queue entry array. The submission queue can be mapped with a
410 call like:
411
412 ptr = mmap(0, sq_off.array + sq_entries * sizeof(__u32),
413 PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,
414 ring_fd, IORING_OFF_SQ_RING);
415
416 where sq_off is the io_sqring_offsets structure, and ring_fd is the
417 file descriptor returned from io_uring_setup(2). The addition of
418 sq_off.array to the length of the region accounts for the fact that the
419 ring is located at the end of the data structure. As an example, the
420 ring buffer head pointer can be accessed by adding sq_off.head to the
421 address returned from mmap(2):
422
423 head = ptr + sq_off.head;
424
425 The flags field is used by the kernel to communicate state information
426 to the application. Currently, it is used to inform the application
427 when a call to io_uring_enter(2) is necessary. See the documentation
428 for the IORING_SETUP_SQPOLL flag above. The dropped member is incre‐
429 mented for each invalid submission queue entry encountered in the ring
430 buffer.
431
432 The head and tail track the ring buffer state. The tail is incremented
433 by the application when submitting new I/O, and the head is incremented
434 by the kernel when the I/O has been successfully submitted. Determin‐
435 ing the index of the head or tail into the ring is accomplished by ap‐
436 plying a mask:
437
438 index = tail & ring_mask;
439
440 The array of submission queue entries is mapped with:
441
442 sqentries = mmap(0, sq_entries * sizeof(struct io_uring_sqe),
443 PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,
444 ring_fd, IORING_OFF_SQES);
445
446 The completion queue is described by cq_entries and cq_off shown here:
447
448 struct io_cqring_offsets {
449 __u32 head;
450 __u32 tail;
451 __u32 ring_mask;
452 __u32 ring_entries;
453 __u32 overflow;
454 __u32 cqes;
455 __u32 flags;
456 __u32 resv1;
457 __u64 user_addr;
458 };
459
460 The completion queue is simpler, since the entries are not separated
461 from the queue itself, and can be mapped with:
462
463 ptr = mmap(0, cq_off.cqes + cq_entries * sizeof(struct io_uring_cqe),
464 PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, ring_fd,
465 IORING_OFF_CQ_RING);
466
467 Closing the file descriptor returned by io_uring_setup(2) will free all
468 resources associated with the io_uring context. Note that this may hap‐
469 pen asynchronously within the kernel, so it is not guaranteed that re‐
470 sources are freed immediately.
471
473 io_uring_setup(2) returns a new file descriptor on success. The appli‐
474 cation may then provide the file descriptor in a subsequent mmap(2)
475 call to map the submission and completion queues, or to the io_ur‐
476 ing_register(2) or io_uring_enter(2) system calls.
477
478 On error, a negative error code is returned. The caller should not rely
479 on errno variable.
480
482 EFAULT params is outside your accessible address space.
483
484 EINVAL The resv array contains non-zero data, p.flags contains an un‐
485 supported flag, entries is out of bounds, IORING_SETUP_SQ_AFF
486 was specified, but IORING_SETUP_SQPOLL was not, or IOR‐
487 ING_SETUP_CQSIZE was specified, but io_uring_params.cq_entries
488 was invalid.
489
490 EMFILE The per-process limit on the number of open file descriptors has
491 been reached (see the description of RLIMIT_NOFILE in getr‐
492 limit(2)).
493
494 ENFILE The system-wide limit on the total number of open files has been
495 reached.
496
497 ENOMEM Insufficient kernel resources are available.
498
499 EPERM IORING_SETUP_SQPOLL was specified, but the effective user ID of
500 the caller did not have sufficient privileges.
501
502 EPERM /proc/sys/kernel/io_uring_disabled has the value 2, or it has
503 the value 1 and the calling process does not hold the
504 CAP_SYS_ADMIN capability or is not a member of /proc/sys/ker‐
505 nel/io_uring_group.
506
508 io_uring_register(2), io_uring_enter(2)
509
510
511
512Linux 2019-01-29 io_uring_setup(2)