1io_uring_setup(2)          Linux Programmer's Manual         io_uring_setup(2)
2
3
4

NAME

6       io_uring_setup - setup a context for performing asynchronous I/O
7

SYNOPSIS

9       #include <liburing.h>
10
11       int io_uring_setup(u32 entries, struct io_uring_params *p);
12

DESCRIPTION

14       The  io_uring_setup(2)  system call sets up a submission queue (SQ) and
15       completion queue (CQ) with at least entries entries, and returns a file
16       descriptor  which  can  be used to perform subsequent operations on the
17       io_uring instance.  The submission and completion queues are shared be‐
18       tween  userspace and the kernel, which eliminates the need to copy data
19       when initiating and completing I/O.
20
21       params is used by the application to pass options to the kernel, and by
22       the kernel to convey information about the ring buffers.
23
24           struct io_uring_params {
25               __u32 sq_entries;
26               __u32 cq_entries;
27               __u32 flags;
28               __u32 sq_thread_cpu;
29               __u32 sq_thread_idle;
30               __u32 features;
31               __u32 wq_fd;
32               __u32 resv[3];
33               struct io_sqring_offsets sq_off;
34               struct io_cqring_offsets cq_off;
35           };
36
37       The flags, sq_thread_cpu, and sq_thread_idle fields are used to config‐
38       ure the io_uring instance.  flags is a bit mask of 0  or  more  of  the
39       following values ORed together:
40
41       IORING_SETUP_IOPOLL
42              Perform  busy-waiting  for an I/O completion, as opposed to get‐
43              ting notifications via an asynchronous IRQ (Interrupt  Request).
44              The  file  system (if any) and block device must support polling
45              in order for this to work.  Busy-waiting provides lower latency,
46              but  may  consume  more CPU resources than interrupt driven I/O.
47              Currently, this feature is usable  only  on  a  file  descriptor
48              opened using the O_DIRECT flag.  When a read or write is submit‐
49              ted to a polled context, the application must poll  for  comple‐
50              tions  on the CQ ring by calling io_uring_enter(2).  It is ille‐
51              gal to mix and match polled and non-polled I/O  on  an  io_uring
52              instance.
53
54              This  is  only  applicable  for storage devices for now, and the
55              storage device must be configured for polling. How  to  do  that
56              depends  on  the  device type in question. For NVMe devices, the
57              nvme driver must be loaded with the poll_queues parameter set to
58              the desired number of polling queues. The polling queues will be
59              shared appropriately between the CPUs in the system, if the num‐
60              ber is less than the number of online CPU threads.
61
62
63       IORING_SETUP_SQPOLL
64              When  this flag is specified, a kernel thread is created to per‐
65              form submission queue polling.  An io_uring instance  configured
66              in  this  way  enables  an application to issue I/O without ever
67              context switching into the  kernel.   By  using  the  submission
68              queue  to  fill in new submission queue entries and watching for
69              completions on the completion queue, the application can  submit
70              and reap I/Os without doing a single system call.
71
72              If  the  kernel thread is idle for more than sq_thread_idle mil‐
73              liseconds, it will set  the  IORING_SQ_NEED_WAKEUP  bit  in  the
74              flags  field  of  the struct io_sq_ring.  When this happens, the
75              application must  call  io_uring_enter(2)  to  wake  the  kernel
76              thread.   If  I/O  is  kept  busy,  the kernel thread will never
77              sleep.  An application making use of this feature will  need  to
78              guard  the  io_uring_enter(2)  call  with the following code se‐
79              quence:
80
81                  /*
82                   * Ensure that the wakeup flag is read after the tail pointer
83                   * has been written. It's important to use memory load acquire
84                   * semantics for the flags read, as otherwise the application
85                   * and the kernel might not agree on the consistency of the
86                   * wakeup flag.
87                   */
88                  unsigned flags = atomic_load_relaxed(sq_ring->flags);
89                  if (flags & IORING_SQ_NEED_WAKEUP)
90                      io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);
91
92              where sq_ring is a submission queue ring setup using the  struct
93              io_sqring_offsets described below.
94
95              Note that, when using a ring setup with IORING_SETUP_SQPOLL, you
96              never directly call the io_uring_enter(2) system call.  That  is
97              usually taken care of by liburing's io_uring_submit(3) function.
98              It automatically determines if you are using polling mode or not
99              and deals with when your program needs to call io_uring_enter(2)
100              without you having to bother about it.
101
102              Before version 5.11 of the Linux  kernel,  to  successfully  use
103              this feature, the application must register a set of files to be
104              used for IO through io_uring_register(2) using the IORING_REGIS‐
105              TER_FILES  opcode.  Failure to do so will result in submitted IO
106              being errored with EBADF.  The presence of this feature  can  be
107              detected  by  the  IORING_FEAT_SQPOLL_NONFIXED feature flag.  In
108              version 5.11 and later, it is no longer  necessary  to  register
109              files  to  use this feature. 5.11 also allows using this as non-
110              root, if the user has the CAP_SYS_NICE capability. In 5.13  this
111              requirement  was  also  relaxed,  and  no special privileges are
112              needed for SQPOLL in newer kernels. Certain stable kernels older
113              than 5.13 may also support unprivileged SQPOLL.
114
115       IORING_SETUP_SQ_AFF
116              If this flag is specified, then the poll thread will be bound to
117              the cpu set in the sq_thread_cpu  field  of  the  struct  io_ur‐
118              ing_params.    This   flag   is   only   meaningful   when  IOR‐
119              ING_SETUP_SQPOLL is specified. When cgroup  setting  cpuset.cpus
120              changes  (typically  in  container environment), the bounded cpu
121              set may be changed as well.
122
123       IORING_SETUP_CQSIZE
124              Create the completion queue with  struct  io_uring_params.cq_en‐
125              tries  entries.  The value must be greater than entries, and may
126              be rounded up to the next power-of-two.
127
128       IORING_SETUP_CLAMP
129              If this flag is specified, and if entries exceeds IORING_MAX_EN‐
130              TRIES , then entries will be clamped at IORING_MAX_ENTRIES .  If
131              the flag IORING_SETUP_SQPOLL is set, and if the value of  struct
132              io_uring_params.cq_entries  exceeds IORING_MAX_CQ_ENTRIES , then
133              it will be clamped at IORING_MAX_CQ_ENTRIES .
134
135       IORING_SETUP_ATTACH_WQ
136              This flag should  be  set  in  conjunction  with  struct  io_ur‐
137              ing_params.wq_fd being set to an existing io_uring ring file de‐
138              scriptor. When set, the io_uring  instance  being  created  will
139              share  the  asynchronous  worker thread backend of the specified
140              io_uring ring, rather than create a new separate thread pool.
141
142       IORING_SETUP_R_DISABLED
143              If this flag is specified, the io_uring ring starts  in  a  dis‐
144              abled state.  In this state, restrictions can be registered, but
145              submissions are not allowed.  See io_uring_register(2)  for  de‐
146              tails on how to enable the ring. Available since 5.10.
147
148       IORING_SETUP_SUBMIT_ALL
149              Normally  io_uring  stops submitting a batch of requests, if one
150              of these requests results in an error. This can cause submission
151              of  less than what is expected, if a request ends in error while
152              being submitted. If the ring is created with this  flag,  io_ur‐
153              ing_enter(2)  will  continue  submitting requests even if it en‐
154              counters an error submitting a request. CQEs  are  still  posted
155              for  errored  request  regardless of whether or not this flag is
156              set at ring creation time, the only difference is if the  submit
157              sequence  is  halted  or  continued  when  an error is observed.
158              Available since 5.18.
159
160       IORING_SETUP_COOP_TASKRUN
161              By default, io_uring will interrupt a task running in  userspace
162              when a completion event comes in. This is to ensure that comple‐
163              tions run in a timely manner. For a lot of use  cases,  this  is
164              overkill  and can cause reduced performance from both the inter-
165              processor interrupt used to do this, the kernel/user transition,
166              the needless interruption of the tasks userspace activities, and
167              reduced batching if completions come in at a  rapid  rate.  Most
168              applications don't need the forceful interruption, as the events
169              are processed at any kernel/user transition. The  exception  are
170              setups  where the application uses multiple threads operating on
171              the same ring, where  the  application  waiting  on  completions
172              isn't  the  one  that  submitted them. For most other use cases,
173              setting this flag  will  improve  performance.  Available  since
174              5.19.
175
176       IORING_SETUP_TASKRUN_FLAG
177              Used  in  conjunction  with IORING_SETUP_COOP_TASKRUN, this pro‐
178              vides a flag, IORING_SQ_TASKRUN, which is set  in  the  SQ  ring
179              flags whenever completions are pending that should be processed.
180              liburing will  check  for  this  flag  even  when  doing  io_ur‐
181              ing_peek_cqe(3) and enter the kernel to process them, and appli‐
182              cations can do the same.  This  makes  IORING_SETUP_TASKRUN_FLAG
183              safe  to  use even when applications rely on a peek style opera‐
184              tion on the CQ ring to see if anything might be pending to reap.
185              Available since 5.19.
186
187       IORING_SETUP_SQE128
188              If  set,  io_uring will use 128-byte SQEs rather than the normal
189              64-byte sized variant. This is a requirement for  using  certain
190              request   types,   as   of  5.19  only  the  IORING_OP_URING_CMD
191              passthrough command for NVMe passthrough needs  this.  Available
192              since 5.19.
193
194       IORING_SETUP_CQE32
195              If  set,  io_uring  will use 32-byte CQEs rather than the normal
196              16-byte sized variant. This is a requirement for  using  certain
197              request   types,   as   of  5.19  only  the  IORING_OP_URING_CMD
198              passthrough command for NVMe passthrough needs  this.  Available
199              since 5.19.
200
201       IORING_SETUP_SINGLE_ISSUER
202              A  hint  to  the kernel that only a single task (or thread) will
203              submit requests, which is used for internal  optimisations.  The
204              submission  task is either the task that created the ring, or if
205              IORING_SETUP_R_DISABLED is specified then it is  the  task  that
206              enables  the  ring through io_uring_register(2).  The kernel en‐
207              forces this rule, failing requests with -EEXIST if the  restric‐
208              tion  is violated.  Note that when IORING_SETUP_SQPOLL is set it
209              is considered that the polling task is doing all submissions  on
210              behalf  of the userspace and so it always complies with the rule
211              disregarding how  many  userspace  tasks  do  io_uring_enter(2).
212              Available since 6.0.
213
214       IORING_SETUP_DEFER_TASKRUN
215              By  default,  io_uring  will process all outstanding work at the
216              end of any system call or thread interrupt. This can  delay  the
217              application  from making other progress.  Setting this flag will
218              hint to io_uring that it should defer work until an io_uring_en‐
219              ter(2)  call  with the IORING_ENTER_GETEVENTS flag set. This al‐
220              lows the application to request work to run just before it wants
221              to   process   completions.    This   flag   requires  the  IOR‐
222              ING_SETUP_SINGLE_ISSUER flag to be set, and also  enforces  that
223              the  call  to  io_uring_enter(2)  is called from the same thread
224              that submitted requests.  Note that if this flag is set then  it
225              is the application's responsibility to periodically trigger work
226              (for example via any of the CQE waiting functions) or else  com‐
227              pletions may not be delivered.  Available since 6.1.
228
229       IORING_SETUP_NO_MMAP
230              By  default,  io_uring allocates kernel memory that callers must
231              subsequently mmap(2).  If this flag  is  set,  io_uring  instead
232              uses caller-allocated buffers; p->cq_off.user_addr must point to
233              the memory for the sq/cq  rings,  and  p->sq_off.user_addr  must
234              point  to the memory for the sqes.  Each allocation must be con‐
235              tiguous memory.  Typically, callers should allocate this  memory
236              by  using mmap(2) to allocate a huge page.  If this flag is set,
237              a subsequent attempt to mmap(2)  the  io_uring  file  descriptor
238              will fail.  Available since 6.5.
239
240       IORING_SETUP_REGISTERED_FD_ONLY
241              If  this  flag  is set, io_uring will register the ring file de‐
242              scriptor, and return the registered  descriptor  index,  without
243              ever allocating an unregistered file descriptor. The caller will
244              need to  use  IORING_REGISTER_USE_REGISTERED_RING  when  calling
245              io_uring_register(2).
246
247
248       If no flags are specified, the io_uring instance is setup for interrupt
249       driven I/O.  I/O may be submitted using io_uring_enter(2)  and  can  be
250       reaped by polling the completion queue.
251
252       The resv array must be initialized to zero.
253
254       features  is  filled in by the kernel, which specifies various features
255       supported by current kernel version.
256
257       IORING_FEAT_SINGLE_MMAP
258              If this flag is set, the two SQ and CQ rings can be mapped  with
259              a  single  mmap(2)  call. The SQEs must still be allocated sepa‐
260              rately. This brings the necessary mmap(2) calls down from  three
261              to two. Available since kernel 5.4.
262
263       IORING_FEAT_NODROP
264              If  this  flag  is  set, io_uring supports almost never dropping
265              completion events.  If a completion event occurs and the CQ ring
266              is  full,  the  kernel  stores the event internally until such a
267              time that the CQ ring has room for more entries. If  this  over‐
268              flow  condition  is  entered,  attempting to submit more IO will
269              fail with the -EBUSY error value, if it can't  flush  the  over‐
270              flown  events  to  the CQ ring. If this happens, the application
271              must reap events from the CQ ring and attempt the submit  again.
272              If  the  kernel has no free memory to store the event internally
273              it will be visible by an increase in the overflow value  on  the
274              cqring.   Available  since kernel 5.5. Additionally io_uring_en‐
275              ter(2) will return -EBADR the next time it would otherwise sleep
276              waiting for completions (since kernel 5.19).
277
278
279       IORING_FEAT_SUBMIT_STABLE
280              If  this  flag is set, applications can be certain that any data
281              for async offload has been consumed when the kernel has consumed
282              the SQE. Available since kernel 5.5.
283
284       IORING_FEAT_RW_CUR_POS
285              If  this flag is set, applications can specify offset == -1 with
286              IORING_OP_{READV,WRITEV} ,  IORING_OP_{READ,WRITE}_FIXED  ,  and
287              IORING_OP_{READ,WRITE}  to mean current file position, which be‐
288              haves like preadv2(2) and pwritev2(2) with offset == -1.   It'll
289              use (and update) the current file position. This obviously comes
290              with the caveat that if the application has  multiple  reads  or
291              writes  in  flight, then the end result will not be as expected.
292              This is similar to threads sharing a file descriptor  and  doing
293              IO using the current file position. Available since kernel 5.6.
294
295       IORING_FEAT_CUR_PERSONALITY
296              If this flag is set, then io_uring guarantees that both sync and
297              async execution of a request assumes the credentials of the task
298              that  called  io_uring_enter(2)  to  queue the requests. If this
299              flag isn't set, then requests are issued with the credentials of
300              the  task  that  originally registered the io_uring. If only one
301              task is using a ring, then this flag doesn't matter as the  cre‐
302              dentials  will always be the same. Note that this is the default
303              behavior,  tasks  can  still  register  different  personalities
304              through  io_uring_register(2)  with  IORING_REGISTER_PERSONALITY
305              and specify the personality to use in the sqe.  Available  since
306              kernel 5.6.
307
308       IORING_FEAT_FAST_POLL
309              If  this  flag  is set, then io_uring supports using an internal
310              poll mechanism to drive data/space readiness.  This  means  that
311              requests that cannot read or write data to a file no longer need
312              to be punted to an async thread for handling, instead they  will
313              begin operation when the file is ready. This is similar to doing
314              poll + read/write in userspace, but eliminates the  need  to  do
315              so.  If this flag is set, requests waiting on space/data consume
316              a lot less resources doing so as they are not blocking a thread.
317              Available since kernel 5.7.
318
319       IORING_FEAT_POLL_32BITS
320              If  this flag is set, the IORING_OP_POLL_ADD command accepts the
321              full 32-bit range of epoll based flags. Most notably EPOLLEXCLU‐
322              SIVE  which  allows  exclusive (waking single waiters) behavior.
323              Available since kernel 5.9.
324
325       IORING_FEAT_SQPOLL_NONFIXED
326              If this flag is set, the IORING_SETUP_SQPOLL feature  no  longer
327              requires  the use of fixed files. Any normal file descriptor can
328              be used for IO commands without needing registration.  Available
329              since kernel 5.11.
330
331       IORING_FEAT_ENTER_EXT_ARG
332              If this flag is set, then the io_uring_enter(2) system call sup‐
333              ports passing in  an  extended  argument  instead  of  just  the
334              sigset_t of earlier kernels. This.  extended argument is of type
335              struct io_uring_getevents_arg and allows the caller to  pass  in
336              both  a  sigset_t  and a timeout argument for waiting on events.
337              The struct layout is as follows:
338
339               struct io_uring_getevents_arg {
340                  __u64 sigmask;
341                  __u32 sigmask_sz;
342                  __u32 pad;
343                  __u64 ts;
344              };
345
346              and a pointer to this struct must be  passed  in  if  IORING_EN‐
347              TER_EXT_ARG  is  set  in  the  flags  for the enter system call.
348              Available since kernel 5.11.
349
350       IORING_FEAT_NATIVE_WORKERS
351              If this flag is set, io_uring is using native  workers  for  its
352              async  helpers.   Previous  kernels used kernel threads that as‐
353              sumed the identity of the original  io_uring  owning  task,  but
354              later  kernels will actively create what looks more like regular
355              process threads instead. Available since kernel 5.12.
356
357       IORING_FEAT_RSRC_TAGS
358              If this flag is set, then io_uring supports a  variety  of  fea‐
359              tures  related to fixed files and buffers. In particular, it in‐
360              dicates that registered buffers can be updated in-place, whereas
361              before  the full set would have to be unregistered first. Avail‐
362              able since kernel 5.13.
363
364       IORING_FEAT_CQE_SKIP
365              If  this  flag  is   set,   then   io_uring   supports   setting
366              IOSQE_CQE_SKIP_SUCCESS  in the submitted SQE, indicating that no
367              CQE should be generated for this SQE if it executes normally. If
368              an  error happens processing the SQE, a CQE with the appropriate
369              error value will still  be  generated.  Available  since  kernel
370              5.17.
371
372       IORING_FEAT_LINKED_FILE
373              If  this  flag is set, then io_uring supports sane assignment of
374              files for SQEs that have dependencies. For example, if  a  chain
375              of  SQEs  are submitted with IOSQE_IO_LINK, then kernels without
376              this flag will prepare the file for each  link  upfront.   If  a
377              previous  link opens a file with a known index, eg if direct de‐
378              scriptors are used with open or  accept,  then  file  assignment
379              needs to happen post execution of that SQE. If this flag is set,
380              then the kernel will defer file assignment until execution of  a
381              given request is started. Available since kernel 5.17.
382
383       IORING_FEAT_REG_REG_RING
384              If  this  flag  is  set,  then  io_uring supports calling io_ur‐
385              ing_register(2) using a registered ring  fd,  via  IORING_REGIS‐
386              TER_USE_REGISTERED_RING.  Available since kernel 6.3.
387
388
389       The  rest  of the fields in the struct io_uring_params are filled in by
390       the kernel, and provide the information necessary  to  memory  map  the
391       submission  queue,  completion queue, and the array of submission queue
392       entries.  sq_entries specifies the number of submission  queue  entries
393       allocated.  sq_off describes the offsets of various ring buffer fields:
394
395           struct io_sqring_offsets {
396               __u32 head;
397               __u32 tail;
398               __u32 ring_mask;
399               __u32 ring_entries;
400               __u32 flags;
401               __u32 dropped;
402               __u32 array;
403               __u32 resv1;
404               __u64 user_addr;
405           };
406
407       Taken  together,  sq_entries  and sq_off provide all of the information
408       necessary for accessing the submission queue ring buffer and  the  sub‐
409       mission  queue  entry array.  The submission queue can be mapped with a
410       call like:
411
412           ptr = mmap(0, sq_off.array + sq_entries * sizeof(__u32),
413                      PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,
414                      ring_fd, IORING_OFF_SQ_RING);
415
416       where sq_off is the io_sqring_offsets structure,  and  ring_fd  is  the
417       file  descriptor  returned  from  io_uring_setup(2).   The  addition of
418       sq_off.array to the length of the region accounts for the fact that the
419       ring  is  located at the end of the data structure.  As an example, the
420       ring buffer head pointer can be accessed by adding sq_off.head  to  the
421       address returned from mmap(2):
422
423           head = ptr + sq_off.head;
424
425       The  flags field is used by the kernel to communicate state information
426       to the application.  Currently, it is used to  inform  the  application
427       when  a  call to io_uring_enter(2) is necessary.  See the documentation
428       for the IORING_SETUP_SQPOLL flag above.  The dropped member  is  incre‐
429       mented  for each invalid submission queue entry encountered in the ring
430       buffer.
431
432       The head and tail track the ring buffer state.  The tail is incremented
433       by the application when submitting new I/O, and the head is incremented
434       by the kernel when the I/O has been successfully submitted.   Determin‐
435       ing  the index of the head or tail into the ring is accomplished by ap‐
436       plying a mask:
437
438           index = tail & ring_mask;
439
440       The array of submission queue entries is mapped with:
441
442           sqentries = mmap(0, sq_entries * sizeof(struct io_uring_sqe),
443                            PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,
444                            ring_fd, IORING_OFF_SQES);
445
446       The completion queue is described by cq_entries and cq_off shown here:
447
448           struct io_cqring_offsets {
449               __u32 head;
450               __u32 tail;
451               __u32 ring_mask;
452               __u32 ring_entries;
453               __u32 overflow;
454               __u32 cqes;
455               __u32 flags;
456               __u32 resv1;
457               __u64 user_addr;
458           };
459
460       The completion queue is simpler, since the entries  are  not  separated
461       from the queue itself, and can be mapped with:
462
463           ptr = mmap(0, cq_off.cqes + cq_entries * sizeof(struct io_uring_cqe),
464                      PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, ring_fd,
465                      IORING_OFF_CQ_RING);
466
467       Closing the file descriptor returned by io_uring_setup(2) will free all
468       resources associated with the io_uring context. Note that this may hap‐
469       pen  asynchronously within the kernel, so it is not guaranteed that re‐
470       sources are freed immediately.
471

RETURN VALUE

473       io_uring_setup(2) returns a new file descriptor on success.  The appli‐
474       cation  may  then  provide  the file descriptor in a subsequent mmap(2)
475       call to map the submission and completion  queues,  or  to  the  io_ur‐
476       ing_register(2) or io_uring_enter(2) system calls.
477
478       On error, a negative error code is returned. The caller should not rely
479       on errno variable.
480

ERRORS

482       EFAULT params is outside your accessible address space.
483
484       EINVAL The resv array contains non-zero data, p.flags contains  an  un‐
485              supported  flag,  entries  is out of bounds, IORING_SETUP_SQ_AFF
486              was  specified,  but  IORING_SETUP_SQPOLL  was  not,   or   IOR‐
487              ING_SETUP_CQSIZE  was  specified, but io_uring_params.cq_entries
488              was invalid.
489
490       EMFILE The per-process limit on the number of open file descriptors has
491              been  reached  (see  the  description  of RLIMIT_NOFILE in getr‐
492              limit(2)).
493
494       ENFILE The system-wide limit on the total number of open files has been
495              reached.
496
497       ENOMEM Insufficient kernel resources are available.
498
499       EPERM  IORING_SETUP_SQPOLL  was specified, but the effective user ID of
500              the caller did not have sufficient privileges.
501
502       EPERM  /proc/sys/kernel/io_uring_disabled has the value 2,  or  it  has
503              the   value  1  and  the  calling  process  does  not  hold  the
504              CAP_SYS_ADMIN capability or is not a  member  of  /proc/sys/ker‐
505              nel/io_uring_group.
506

SEE ALSO

508       io_uring_register(2), io_uring_enter(2)
509
510
511
512Linux                             2019-01-29                 io_uring_setup(2)
Impressum