1IO_URING_SETUP(2)          Linux Programmer's Manual         IO_URING_SETUP(2)
2
3
4

NAME

6       io_uring_setup - setup a context for performing asynchronous I/O
7

SYNOPSIS

9       #include <linux/io_uring.h>
10
11       int io_uring_setup(u32 entries, struct io_uring_params *p);
12

DESCRIPTION

14       The  io_uring_setup()  system  call sets up a submission queue (SQ) and
15       completion queue (CQ) with at least entries entries, and returns a file
16       descriptor  which  can  be used to perform subsequent operations on the
17       io_uring instance.  The submission and completion queues are shared be‐
18       tween  userspace and the kernel, which eliminates the need to copy data
19       when initiating and completing I/O.
20
21       params is used by the application to pass options to the kernel, and by
22       the kernel to convey information about the ring buffers.
23
24           struct io_uring_params {
25               __u32 sq_entries;
26               __u32 cq_entries;
27               __u32 flags;
28               __u32 sq_thread_cpu;
29               __u32 sq_thread_idle;
30               __u32 features;
31               __u32 resv[4];
32               struct io_sqring_offsets sq_off;
33               struct io_cqring_offsets cq_off;
34           };
35
36       The flags, sq_thread_cpu, and sq_thread_idle fields are used to config‐
37       ure the io_uring instance.  flags is a bit mask of 0  or  more  of  the
38       following values ORed together:
39
40       IORING_SETUP_IOPOLL
41              Perform  busy-waiting  for an I/O completion, as opposed to get‐
42              ting notifications via an asynchronous IRQ (Interrupt  Request).
43              The  file  system (if any) and block device must support polling
44              in order for this to work.  Busy-waiting provides lower latency,
45              but  may  consume  more CPU resources than interrupt driven I/O.
46              Currently, this feature is usable  only  on  a  file  descriptor
47              opened using the O_DIRECT flag.  When a read or write is submit‐
48              ted to a polled context, the application must poll  for  comple‐
49              tions  on the CQ ring by calling io_uring_enter(2).  It is ille‐
50              gal to mix and match polled and non-polled I/O  on  an  io_uring
51              instance.
52
53
54       IORING_SETUP_SQPOLL
55              When  this flag is specified, a kernel thread is created to per‐
56              form submission queue polling.  An io_uring instance  configured
57              in  this  way  enables  an application to issue I/O without ever
58              context switching into the  kernel.   By  using  the  submission
59              queue  to  fill in new submission queue entries and watching for
60              completions on the completion queue, the application can  submit
61              and reap I/Os without doing a single system call.
62
63              If  the  kernel thread is idle for more than sq_thread_idle mil‐
64              liseconds, it will set  the  IORING_SQ_NEED_WAKEUP  bit  in  the
65              flags  field  of  the struct io_sq_ring.  When this happens, the
66              application must  call  io_uring_enter(2)  to  wake  the  kernel
67              thread.   If  I/O  is  kept  busy,  the kernel thread will never
68              sleep.  An application making use of this feature will  need  to
69              guard  the  io_uring_enter(2)  call  with the following code se‐
70              quence:
71
72                  /*
73                   * Ensure that the wakeup flag is read after the tail pointer
74                   * has been written. It's important to use memory load acquire
75                   * semantics for the flags read, as otherwise the application
76                   * and the kernel might not agree on the consistency of the
77                   * wakeup flag.
78                   */
79                  unsigned flags = atomic_load_relaxed(sq_ring->flags);
80                  if (flags & IORING_SQ_NEED_WAKEUP)
81                      io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);
82
83              where sq_ring is a submission queue ring setup using the  struct
84              io_sqring_offsets described below.
85
86              Before  version  5.11  of  the Linux kernel, to successfully use
87              this feature, the application must register a set of files to be
88              used for IO through io_uring_register(2) using the IORING_REGIS‐
89              TER_FILES opcode. Failure to do so will result in  submitted  IO
90              being  errored  with EBADF.  The presence of this feature can be
91              detected by the IORING_FEAT_SQPOLL_NONFIXED  feature  flag.   In
92              version  5.11  and  later, it is no longer necessary to register
93              files to use this feature. 5.11 also allows using this  as  non-
94              root, if the user has the CAP_SYS_NICE capability.
95
96       IORING_SETUP_SQ_AFF
97              If this flag is specified, then the poll thread will be bound to
98              the cpu set in the sq_thread_cpu  field  of  the  struct  io_ur‐
99              ing_params.    This   flag   is   only   meaningful   when  IOR‐
100              ING_SETUP_SQPOLL is specified.
101
102       IORING_SETUP_CQSIZE
103              Create the completion queue with  struct  io_uring_params.cq_en‐
104              tries  entries.  The value must be greater than entries, and may
105              be rounded up to the next power-of-two.
106
107       IORING_SETUP_CLAMP
108              If this flag is specified, and if entries exceeds IORING_MAX_EN‐
109              TRIES , then entries will be clamped at IORING_MAX_ENTRIES .  If
110              the flag IORING_SETUP_SQPOLL is set, and if the value of  struct
111              io_uring_params.cq_entries  exceeds IORING_MAX_CQ_ENTRIES , then
112              it will be clamped at IORING_MAX_CQ_ENTRIES .
113
114       IORING_SETUP_ATTACH_WQ
115              This flag should  be  set  in  conjunction  with  struct  io_ur‐
116              ing_params.wq_fd being set to an existing io_uring ring file de‐
117              scriptor. When set, the io_uring  instance  being  created  will
118              share  the  asynchronous  worker thread backend of the specified
119              io_uring ring, rather than create a new separate thread pool.
120
121       IORING_SETUP_R_DISABLED
122              If this flag is specified, the io_uring ring starts  in  a  dis‐
123              abled state.  In this state, restrictions can be registered, but
124              submissions are not allowed.  See io_uring_register(2)  for  de‐
125              tails on how to enable the ring. Available since 5.10.
126
127       If no flags are specified, the io_uring instance is setup for interrupt
128       driven I/O.  I/O may be submitted using io_uring_enter(2)  and  can  be
129       reaped by polling the completion queue.
130
131       The resv array must be initialized to zero.
132
133       features  is  filled in by the kernel, which specifies various features
134       supported by current kernel version.
135
136       IORING_FEAT_SINGLE_MMAP
137              If this flag is set, the two SQ and CQ rings can be mapped  with
138              a  single  mmap(2)  call. The SQEs must still be allocated sepa‐
139              rately. This brings the necessary mmap(2) calls down from  three
140              to two.
141
142       IORING_FEAT_NODROP
143              If this flag is set, io_uring supports never dropping completion
144              events.  If a completion event occurs and the CQ ring  is  full,
145              the  kernel  stores  the event internally until such a time that
146              the CQ ring has room for more entries. If this  overflow  condi‐
147              tion is entered, attempting to submit more IO with fail with the
148              -EBUSY error value, if it can't flush the  overflown  events  to
149              the  CQ  ring. If this happens, the application must reap events
150              from the CQ ring and attempt the submit again.
151
152       IORING_FEAT_SUBMIT_STABLE
153              If this flag is set, applications can be certain that  any  data
154              for async offload has been consumed when the kernel has consumed
155              the SQE.
156
157       IORING_FEAT_RW_CUR_POS
158              If this flag is set, applications can specify offset == -1  with
159              IORING_OP_{READV,WRITEV}  ,  IORING_OP_{READ,WRITE}_FIXED  , and
160              IORING_OP_{READ,WRITE} to mean current file position, which  be‐
161              haves  like  preadv2(2) and pwritev2(2) with offset == -1. It'll
162              use (and update) the current file position. This obviously comes
163              with  the  caveat  that if the application has multiple reads or
164              writes in flight, then the end result will not be  as  expected.
165              This  is  similar to threads sharing a file descriptor and doing
166              IO using the current file position.
167
168       IORING_FEAT_CUR_PERSONALITY
169              If this flag is set, then io_uring guarantees that both sync and
170              async execution of a request assumes the credentials of the task
171              that called io_uring_enter(2) to queue  the  requests.  If  this
172              flag isn't set, then requests are issued with the credentials of
173              the task that originally registered the io_uring.  If  only  one
174              task  is using a ring, then this flag doesn't matter as the cre‐
175              dentials will always be the same. Note that this is the  default
176              behavior,  tasks  can  still  register  different  personalities
177              through  io_uring_register(2)  with  IORING_REGISTER_PERSONALITY
178              and specify the personality to use in the sqe.
179
180       IORING_FEAT_FAST_POLL
181              If  this  flag  is set, then io_uring supports using an internal
182              poll mechanism to drive data/space readiness.  This  means  that
183              requests that cannot read or write data to a file no longer need
184              to be punted to an async thread for handling, instead they  will
185              begin operation when the file is ready. This is similar to doing
186              poll + read/write in userspace, but eliminates the  need  to  do
187              so.  If this flag is set, requests waiting on space/data consume
188              a lot less resources doing so as they are not blocking a thread.
189
190       IORING_FEAT_POLL_32BITS
191              If this flag is set, the IORING_OP_POLL_ADD command accepts  the
192              full 32-bit range of epoll based flags. Most notably EPOLLEXCLU‐
193              SIVE which allows exclusive (waking single waiters) behavior.
194
195       IORING_FEAT_SQPOLL_NONFIXED
196              If this flag is set, the IORING_SETUP_SQPOLL feature  no  longer
197              requires  the use of fixed files. Any normal file descriptor can
198              be used for IO commands without needing registration.
199
200
201       The rest of the fields in the struct io_uring_params are filled  in  by
202       the  kernel,  and  provide  the information necessary to memory map the
203       submission queue, completion queue, and the array of  submission  queue
204       entries.   sq_entries  specifies the number of submission queue entries
205       allocated.  sq_off describes the offsets of various ring buffer fields:
206
207           struct io_sqring_offsets {
208               __u32 head;
209               __u32 tail;
210               __u32 ring_mask;
211               __u32 ring_entries;
212               __u32 flags;
213               __u32 dropped;
214               __u32 array;
215               __u32 resv[3];
216           };
217
218       Taken together, sq_entries and sq_off provide all  of  the  information
219       necessary  for  accessing the submission queue ring buffer and the sub‐
220       mission queue entry array.  The submission queue can be mapped  with  a
221       call like:
222
223           ptr = mmap(0, sq_off.array + sq_entries * sizeof(__u32),
224                      PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,
225                      ring_fd, IORING_OFF_SQ_RING);
226
227       where  sq_off  is  the  io_sqring_offsets structure, and ring_fd is the
228       file descriptor  returned  from  io_uring_setup(2).   The  addition  of
229       sq_off.array to the length of the region accounts for the fact that the
230       ring located at the end of the data structure.  As an example, the ring
231       buffer  head  pointer  can be accessed by adding sq_off.head to the ad‐
232       dress returned from mmap(2):
233
234           head = ptr + sq_off.head;
235
236       The flags field is used by the kernel to communicate state  information
237       to  the  application.   Currently, it is used to inform the application
238       when a call to io_uring_enter(2) is necessary.  See  the  documentation
239       for  the  IORING_SETUP_SQPOLL flag above.  The dropped member is incre‐
240       mented for each invalid submission queue entry encountered in the  ring
241       buffer.
242
243       The head and tail track the ring buffer state.  The tail is incremented
244       by the application when submitting new I/O, and the head is incremented
245       by  the kernel when the I/O has been successfully submitted.  Determin‐
246       ing the index of the head or tail into the ring is accomplished by  ap‐
247       plying a mask:
248
249           index = tail & ring_mask;
250
251       The array of submission queue entries is mapped with:
252
253           sqentries = mmap(0, sq_entries * sizeof(struct io_uring_sqe),
254                            PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,
255                            ring_fd, IORING_OFF_SQES);
256
257       The completion queue is described by cq_entries and cq_off shown here:
258
259           struct io_cqring_offsets {
260               __u32 head;
261               __u32 tail;
262               __u32 ring_mask;
263               __u32 ring_entries;
264               __u32 overflow;
265               __u32 cqes;
266               __u32 flags;
267               __u32 resv[3];
268           };
269
270       The  completion  queue  is simpler, since the entries are not separated
271       from the queue itself, and can be mapped with:
272
273           ptr = mmap(0, cq_off.cqes + cq_entries * sizeof(struct io_uring_cqe),
274                      PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, ring_fd,
275                      IORING_OFF_CQ_RING);
276
277       Closing the file descriptor returned by io_uring_setup(2) will free all
278       resources associated with the io_uring context.
279

RETURN VALUE

281       io_uring_setup(2) returns a new file descriptor on success.  The appli‐
282       cation may then provide the file descriptor  in  a  subsequent  mmap(2)
283       call  to  map  the  submission  and completion queues, or to the io_ur‐
284       ing_register(2) or io_uring_enter(2) system calls.
285
286       On error, -1 is returned and errno is set appropriately.
287

ERRORS

289       EFAULT params is outside your accessible address space.
290
291       EINVAL The resv array contains non-zero data, p.flags contains  an  un‐
292              supported  flag,  entries  is out of bounds, IORING_SETUP_SQ_AFF
293              was  specified,  but  IORING_SETUP_SQPOLL  was  not,   or   IOR‐
294              ING_SETUP_CQSIZE  was  specified, but io_uring_params.cq_entries
295              was invalid.
296
297       EMFILE The per-process limit on the number of open file descriptors has
298              been  reached  (see  the  description  of RLIMIT_NOFILE in getr‐
299              limit(2)).
300
301       ENFILE The system-wide limit on the total number of open files has been
302              reached.
303
304       ENOMEM Insufficient kernel resources are available.
305
306       EPERM  IORING_SETUP_SQPOLL  was specified, but the effective user ID of
307              the caller did not have sufficient privileges.
308

SEE ALSO

310       io_uring_register(2), io_uring_enter(2)
311
312
313
314Linux                             2019-01-29                 IO_URING_SETUP(2)
Impressum