1IO_URING_ENTER(2)          Linux Programmer's Manual         IO_URING_ENTER(2)
2
3
4

NAME

6       io_uring_enter - initiate and/or complete asynchronous I/O
7

SYNOPSIS

9       #include <linux/io_uring.h>
10
11       int io_uring_enter(unsigned int fd, unsigned int to_submit,
12                          unsigned int min_complete, unsigned int flags,
13                          sigset_t *sig);
14

DESCRIPTION

16       io_uring_enter()  is used to initiate and complete I/O using the shared
17       submission and completion queues setup by a call to  io_uring_setup(2).
18       A  single  call can both submit new I/O and wait for completions of I/O
19       initiated by this call or previous calls to io_uring_enter().
20
21       fd is the file descriptor  returned  by  io_uring_setup(2).   to_submit
22       specifies  the  number of I/Os to submit from the submission queue.  If
23       the IORING_ENTER_GETEVENTS bit is set in flags, then  the  system  call
24       will  attempt to wait for min_complete event completions before return‐
25       ing.  If the io_uring instance was configured for polling, by  specify‐
26       ing IORING_SETUP_IOPOLL in the call to io_uring_setup(2), then min_com‐
27       plete has a slightly different meaning.  Passing a value of 0 instructs
28       the  kernel  to  return  any events which are already complete, without
29       blocking.  If min_complete is a non-zero value, the kernel  will  still
30       return immediately if any completion events are available.  If no event
31       completions are available, then the call will poll either until one  or
32       more  completions  become  available, or until the process has exceeded
33       its scheduler time slice.
34
35       Note that, for interrupt driven I/O (where IORING_SETUP_IOPOLL was  not
36       specified  in  the call to io_uring_setup(2)), an application may check
37       the completion queue for event completions without entering the  kernel
38       at all.
39
40       When  the  system  call returns that a certain amount of SQEs have been
41       consumed and submitted, it's safe to reuse SQE  entries  in  the  ring.
42       This is true even if the actual IO submission had to be punted to async
43       context, which means that the SQE may in fact not have  been  submitted
44       yet.  If  the  kernel  requires later use of a particular SQE entry, it
45       will have made a private copy of it.
46
47       sig is a pointer to a signal mask (see sigprocmask(2)); if sig  is  not
48       NULL,  io_uring_enter()  first  replaces the current signal mask by the
49       one pointed to by sig, then waits for events to become available in the
50       completion queue, and then restores the original signal mask.  The fol‐
51       lowing io_uring_enter() call:
52
53           ret = io_uring_enter(fd, 0, 1, IORING_ENTER_GETEVENTS, &sig);
54
55       is equivalent to atomically executing the following calls:
56
57           pthread_sigmask(SIG_SETMASK, &sig, &orig);
58           ret = io_uring_enter(fd, 0, 1, IORING_ENTER_GETEVENTS, NULL);
59           pthread_sigmask(SIG_SETMASK, &orig, NULL);
60
61       See the description of pselect(2) for an explanation  of  why  the  sig
62       parameter is necessary.
63
64       Submission  queue  entries  are  represented  using  the following data
65       structure:
66
67           /*
68            * IO submission data structure (Submission Queue Entry)
69            */
70           struct io_uring_sqe {
71               __u8    opcode;         /* type of operation for this sqe */
72               __u8    flags;          /* IOSQE_ flags */
73               __u16   ioprio;         /* ioprio for the request */
74               __s32   fd;             /* file descriptor to do IO on */
75               union {
76                   __u64   off;            /* offset into file */
77                   __u64   addr2;
78               };
79               __u64   addr;           /* pointer to buffer or iovecs */
80               __u32   len;            /* buffer size or number of iovecs */
81               union {
82                   __kernel_rwf_t  rw_flags;
83                   __u32    fsync_flags;
84                   __u16    poll_events;
85                   __u32    sync_range_flags;
86                   __u32    msg_flags;
87                   __u32    timeout_flags;
88                   __u32    accept_flags;
89                   __u32    cancel_flags;
90               };
91               __u64    user_data;     /* data to be passed back at completion time */
92               union {
93                struct {
94                    /* index into fixed buffers, if used */
95                       __u16    buf_index;
96                    /* personality to use, if used */
97                    __u16    personality;
98                };
99                   __u64    __pad2[3];
100               };
101           };
102
103       The opcode describes the operation to be performed.  It can be one of:
104
105       IORING_OP_NOP
106              Do not perform any I/O.  This is useful for testing the  perfor‐
107              mance of the io_uring implementation itself.
108
109       IORING_OP_READV
110
111       IORING_OP_WRITEV
112              Vectored  read  and  write operations, similar to preadv2(2) and
113              pwritev2(2).
114
115
116       IORING_OP_READ_FIXED
117
118       IORING_OP_WRITE_FIXED
119              Read from or write to pre-mapped buffers.   See  io_uring_regis‐
120              ter(2) for details on how to setup a context for fixed reads and
121              writes.
122
123
124       IORING_OP_FSYNC
125              File sync.  See also fsync(2).  Note that, while I/O  is  initi‐
126              ated  in  the order in which it appears in the submission queue,
127              completions are unordered.  For example,  an  application  which
128              places  a write I/O followed by an fsync in the submission queue
129              cannot expect the fsync to apply to the write.  The  two  opera‐
130              tions  execute in parallel, so the fsync may complete before the
131              write is issued to the storage.  The same is also true for  pre‐
132              viously  issued  writes  that  have  not  completed prior to the
133              fsync.
134
135
136       IORING_OP_POLL_ADD
137              Poll the fd specified in the  submission  queue  entry  for  the
138              events specified in the poll_events field.  Unlike poll or epoll
139              without EPOLLONESHOT, this interface always works  in  one  shot
140              mode.   That  is,  once the poll operation is completed, it will
141              have to be resubmitted.
142
143
144       IORING_OP_POLL_REMOVE
145              Remove an existing poll request.  If found, the res field of the
146              struct io_uring_cqe will contain 0.  If not found, res will con‐
147              tain -ENOENT.
148
149
150       IORING_OP_SYNC_FILE_RANGE
151              Issue the equivalent  of  a  sync_file_range  (2)  on  the  file
152              descriptor. The fd field is the file descriptor to sync, the off
153              field holds the offset in bytes, the len field holds the  length
154              in  bytes,  and the flags field holds the flags for the command.
155              See also sync_file_range(2).  for the general description of the
156              related system call. Available since 5.2.
157
158
159       IORING_OP_SENDMSG
160              Issue  the  equivalent  of a sendmsg(2) system call.  fd must be
161              set to the socket file descriptor, addr must contain  a  pointer
162              to  the  msghdr  structure, and flags holds the flags associated
163              with the system call. See  also  sendmsg(2).   for  the  general
164              description of the related system call. Available since 5.3.
165
166
167       IORING_OP_RECVMSG
168              Works   just   like  IORING_OP_SENDMSG,  except  for  recvmsg(2)
169              instead. See the  description  of  IORING_OP_SENDMSG.  Available
170              since 5.3.
171
172
173       IORING_OP_SEND
174              Issue  the  equivalent of a send(2) system call.  fd must be set
175              to the socket file descriptor, addr must contain  a  pointer  to
176              the  buffer,  len  denotes the length of the buffer to send, and
177              flags holds the flags associated with the system call. See  also
178              send(2).   for  the  general  description  of the related system
179              call. Available since 5.6.
180
181
182       IORING_OP_RECV
183              Works just like IORING_OP_SEND, except for recv(2) instead.  See
184              the description of IORING_OP_SEND. Available since 5.6.
185
186
187       IORING_OP_TIMEOUT
188              This  command  will register a timeout operation. The addr field
189              must contain a pointer to a  struct  timespec64  structure,  len
190              must  contain  1  to  signify  one  timespec64  structure, time‐
191              out_flags may contain IORING_TIMEOUT_ABS for an absolutel  time‐
192              out  value, or 0 for a relative timeout.  off may contain a com‐
193              pletion event count. If not set, this defaults to 1.  A  timeout
194              will  trigger  a  wakeup event on the completion ring for anyone
195              waiting for events. A timeout condition is met when  either  the
196              specified  timeout  expires,  or  the specified number of events
197              have  completed.  Either  condition  will  trigger  the   event.
198              io_uring  timeouts  use  the  CLOCK_MONOTONIC  clock source. The
199              request will complete with -ETIME if the timeout  got  completed
200              through  expiration  of  the timer, or 0 if the timeout got com‐
201              pleted through requests completing on their own. If the  timeout
202              was  cancelled before it expired, the request will complete with
203              -ECANCELED.  Available since 5.4.
204
205
206       IORING_OP_TIMEOUT_REMOVE
207              Attempt to remove an existing timeout operation.  addr must con‐
208              tain the user_data field of the previously issued timeout opera‐
209              tion. If the specified timeout request is  found  and  cancelled
210              successfully, this request will terminate with a result value of
211              0 If the timeout request was found but expiration was already in
212              progress,  this  request  will  terminate with a result value of
213              -EBUSY If the timeout request wasn't  found,  the  request  will
214              terminate with a result value of -ENOENT Available since 5.5.
215
216
217       IORING_OP_ACCEPT
218              Issue  the  equivalent of an accept4(2) system call.  fd must be
219              set to the socket file descriptor, addr must contain the pointer
220              to  the  sockaddr structure, and addr2 must contain a pointer to
221              the socklen_t addrlen field. See also accept4(2) for the general
222              description of the related system call. Available since 5.5.
223
224
225       IORING_OP_ASYNC_CANCEL
226              Attempt  to cancel an already issued request.  addr must contain
227              the user_data field of the request that should be cancelled. The
228              cancellation  request  will  complete  with one of the following
229              results codes. I found, the res field of the cqe will contain 0.
230              If  not  found, res will contain -ENOENT. If found and attempted
231              cancelled, the res field will contain -EALREADY. In  this  case,
232              the  request may or may not terminate. In general, requests that
233              are interruptible (like socket IO)  will  get  cancelled,  while
234              disk IO requests cannot be cancelled if already started.  Avail‐
235              able since 5.5.
236
237
238       IORING_OP_LINK_TIMEOUT
239              This  request  must  be  linked  with  another  request  through
240              IOSQE_IO_LINK  which  is described below. Unlike IORING_OP_TIME‐
241              OUT, IORING_OP_LINK_TIMEOUT acts on the linked request, not  the
242              completion  queue.  The  format of the command is otherwise like
243              IORING_OP_TIMEOUT, except there's no completion event  count  as
244              it's tied to a specific request.  If used, the timeout specified
245              in the command will cancel the linked command, unless the linked
246              command  completes before the timeout. The timeout will complete
247              with -ETIME if the timer expired  and  the  linked  request  was
248              attempted  cancelled,  or  -ECANCELED if the timer got cancelled
249              because  of  completion  of  the  linked  request.   Like   IOR‐
250              ING_OP_TIMEOUT  the  clock source used is CLOCK_MONOTONIC Avail‐
251              able since 5.5.
252
253
254
255       IORING_OP_CONNECT
256              Issue the equivalent of a connect(2) system call.   fd  must  be
257              set to the socket file descriptor, addr must contain the pointer
258              to the sockaddr structure, and off must  contain  the  socklen_t
259              addrlen  field.  See also connect(2) for the general description
260              of the related system call. Available since 5.5.
261
262
263       IORING_OP_FALLOCATE
264              Issue the equivalent of a fallocate(2) system call.  fd must  be
265              set to the file descriptor, off must contain the offset on which
266              to operate, and len must contain the  length.  See  also  fallo‐
267              cate(2)  for the general description of the related system call.
268              Available since 5.6.
269
270
271       IORING_OP_FADVISE
272              Issue the equivalent of a posix_fadvise(2) system call.  fd must
273              be  set  to  the file descriptor, off must contain the offset on
274              which  to  operate,  len  must  contain  the  length,  and  fad‐
275              vise_advice  must  contain the advice associated with the opera‐
276              tion. See also posix_fadvise(2) for the general  description  of
277              the related system call. Available since 5.6.
278
279
280       IORING_OP_MADVISE
281              Issue  the  equivalent  of  a madvise(2) system call.  addr must
282              contain the address to operate on, len must contain  the  length
283              on  which to operate, and fadvise_advice must contain the advice
284              associated with the operation. See also madvise(2) for the  gen‐
285              eral  description  of  the  related system call. Available since
286              5.6.
287
288
289       IORING_OP_OPENAT
290              Issue the equivalent of a openat(2)  system  call.   fd  is  the
291              dirfd  argument,  addr  must  contain a pointer to the *pathname
292              argument, open_flags should contain any  flags  passed  in,  and
293              mode is access mode of the file. See also openat(2) for the gen‐
294              eral description of the related  system  call.  Available  since
295              5.6.
296
297
298       IORING_OP_CLOSE
299              Issue  the equivalent of a close(2) system call.  fd is the file
300              descriptor to be closed.  See  also  close(2)  for  the  general
301              description of the related system call. Available since 5.6.
302
303
304       IORING_OP_STATX
305              Issue the equivalent of a statx(2) system call.  fd is the dirfd
306              argument, addr must contain a pointer to the  *pathname  string,
307              statx_flags  is the flags argument, len should be the mask argu‐
308              ment, and off must contain a  pointer  to  the  statxbuf  to  be
309              filled  in. See also statx(2) for the general description of the
310              related system call. Available since 5.6.
311
312
313       IORING_OP_READ IORING_OP_WRITE
314              Issue the equivalent of a read(2) or write(2) system  call.   fd
315              is the file descriptor to be operated on, addr contains the buf‐
316              fer in question, and len contains the length of  the  IO  opera‐
317              tion. These are non-vectored versions of the IORING_OP_READV and
318              IORING_OP_WRITEV opcodes. See also read(2) and write(2) for  the
319              general  description of the related system call. Available since
320              5.6.
321
322
323       IORING_OP_FILES_UPDATE
324              This  command  is  an   alternative   to   using   IORING_REGIS‐
325              TER_FILES_UPDATE  which then works in an async fashion, like the
326              rest of the io_uring commands.  The arguments passed in are  the
327              same.  addr must contain a pointer to the array of file descrip‐
328              tors, len must contain the length of the  array,  and  off  must
329              contain  the  offset at which to operate. Note that the array of
330              file descriptors pointed to in addr must remain valid until this
331              operation has completed. Available since 5.6.
332
333
334       The flags field is a bit mask. The supported flags are:
335
336       IOSQE_FIXED_FILE
337              When this flag is specified, fd is an index into the files array
338              registered with the io_uring  instance  (see  the  IORING_REGIS‐
339              TER_FILES  section of the io_uring_register(2) man page). Avail‐
340              able since 5.1.
341
342       IOSQE_IO_DRAIN
343              When this flag is specified, the SQE will not be started  before
344              previously  submitted SQEs have completed, and new SQEs will not
345              be started before this one completes. Available since 5.2.
346
347       IOSQE_IO_LINK
348              When this flag is specified, it forms a link with the  next  SQE
349              in the submission ring. That next SQE will not be started before
350              this one completes.  This, in effect, forms  a  chain  of  SQEs,
351              which  can be arbitrarily long. The tail of the chain is denoted
352              by the first SQE that does not have this flag  set.   This  flag
353              has  no  effect  on previous SQE submissions, nor does it impact
354              SQEs that are outside of the chain tail. This means that  multi‐
355              ple  chains can be executing in parallel, or chains and individ‐
356              ual SQEs. Only members inside the chain are serialized. A  chain
357              of  SQEs  will  be  broken, if any request in that chain ends in
358              error. io_uring considers any unexpected result an  error.  This
359              means  that,  eg, a short read will also terminate the remainder
360              of the chain.  If a chain of SQE links is broken, the  remaining
361              unstarted  part  of  the  chain will be terminated and completed
362              with -ECANCELED as the error code. Available since 5.3.
363
364       IOSQE_IO_HARDLINK
365              Like IOSQE_IO_LINK, but it doesn't sever regardless of the  com‐
366              pletion  result.  Note that the link will still sever if we fail
367              submitting the parent request, hard links are only resilient  in
368              the  presence of completion results for requests that did submit
369              correctly. IOSQE_IO_HARDLINK implies  IOSQE_IO_LINK.   Available
370              since 5.5.
371
372       IOSQE_ASYNC
373              Normal operation for io_uring is to try and issue an sqe as non-
374              blocking first, and if that fails, execute it in an  async  man‐
375              ner.  To support more efficient overlapped operation of requests
376              that the application knows/assumes will always (or most  of  the
377              time)  block,  the  application  can ask for an sqe to be issued
378              async from the start. Available since 5.6.
379
380
381
382       ioprio specifies the I/O priority.  See ioprio_get(2) for a description
383       of Linux I/O priorities.
384
385       fd  specifies  the  file descriptor against which the operation will be
386       performed, with the exception noted above.
387
388       If   the   operation   is   one   of   IORING_OP_READ_FIXED   or   IOR‐
389       ING_OP_WRITE_FIXED, addr and len must fall within the buffer located at
390       buf_index in the fixed buffer array.  If the operation is  either  IOR‐
391       ING_OP_READV or IORING_OP_WRITEV, then addr points to an iovec array of
392       len entries.
393
394       rw_flags, specified for read and write operations, contains  a  bitwise
395       OR of per-I/O flags, as described in the preadv2(2) man page.
396
397       The  fsync_flags  bit  mask  may  contain  either  0, for a normal file
398       integrity sync, or IORING_FSYNC_DATASYNC  to  provide  data  sync  only
399       semantics.   See  the descriptions of O_SYNC and O_DSYNC in the open(2)
400       manual page for more information.
401
402       The bits that may be set in poll_events are defined  in  <poll.h>,  and
403       documented in poll(2).
404
405       user_data is an application-supplied value that will be copied into the
406       completion queue entry (see below).  buf_index  is  an  index  into  an
407       array  of fixed buffers, and is only valid if fixed buffers were regis‐
408       tered.  personality is the credentials id to use  for  this  operation.
409       See   io_uring_register(2)  for  how  to  register  personalities  with
410       io_uring. If set to 0, the current personality of the  submitting  task
411       is used.
412
413       Once  the  submission  queue  entry is initialized, I/O is submitted by
414       placing the index of the submission queue entry into the  tail  of  the
415       submission  queue.   After  one or more indexes are added to the queue,
416       and the queue tail is advanced, the io_uring_enter(2) system  call  can
417       be invoked to initiate the I/O.
418
419       Completions use the following data structure:
420
421           /*
422            * IO completion data structure (Completion Queue Entry)
423            */
424           struct io_uring_cqe {
425               __u64    user_data; /* sqe->data submission passed back */
426               __s32    res;       /* result code for this event */
427               __u32    flags;
428           };
429
430       user_data  is  copied from the field of the same name in the submission
431       queue entry.  The primary use case is to store data that  the  applica‐
432       tion  will  need to access upon completion of this particular I/O.  The
433       flags is reserved  for  future  use.   res  is  the  operation-specific
434       result.
435
436       For read and write opcodes, the return values match those documented in
437       the preadv2(2)  and  pwritev2(2)  man  pages.   Return  codes  for  the
438       io_uring-specific  opcodes  are  documented  in  the description of the
439       opcodes above.
440

RETURN VALUE

442       io_uring_enter() returns the  number  of  I/Os  successfully  consumed.
443       This  can  be zero if to_submit was zero or if the submission queue was
444       empty. The errors below that refer to an error in  a  submission  queue
445       entry  will  be  returned  though a completion queue entry, rather than
446       through the system call itself.
447
448       Errors that occur not  on  behalf  of  a  submission  queue  entry  are
449       returned via the system call directly. On such an error, -1 is returned
450       and errno is set appropriately.
451

ERRORS

453       EAGAIN The kernel was unable to allocate memory  for  the  request,  or
454              otherwise  ran  out  of  resources to handle it. The application
455              should wait for some completions and try again.
456
457       EBUSY  The application  is  attempting  to  overcommit  the  number  of
458              requests  it  can  have pending. The application should wait for
459              some completions and try again. May  occur  if  the  application
460              tries  to  queue  more  requests than we have room for in the CQ
461              ring.
462
463       EBADF  The fd field in the submission queue entry is  invalid,  or  the
464              IOSQE_FIXED_FILE flag was set in the submission queue entry, but
465              no files were registered with the io_uring instance.
466
467       EFAULT buffer is outside of the process' accessible address space
468
469       EFAULT IORING_OP_READ_FIXED or IORING_OP_WRITE_FIXED was  specified  in
470              the  opcode field of the submission queue entry, but either buf‐
471              fers were not registered for  this  io_uring  instance,  or  the
472              address  range described by addr and len does not fit within the
473              buffer registered at buf_index.
474
475       EINVAL The index member of the submission queue entry is invalid.
476
477       EINVAL The flags field  or  opcode  in  a  submission  queue  entry  is
478              invalid.
479
480       EINVAL IORING_OP_NOP  was  specified in the submission queue entry, but
481              the io_uring context was setup for polling  (IORING_SETUP_IOPOLL
482              was specified in the call to io_uring_setup).
483
484       EINVAL IORING_OP_READV or IORING_OP_WRITEV was specified in the submis‐
485              sion queue entry, but the io_uring instance  has  fixed  buffers
486              registered.
487
488       EINVAL IORING_OP_READ_FIXED  or  IORING_OP_WRITE_FIXED was specified in
489              the submission queue entry, and the buf_index is invalid.
490
491       EINVAL IORING_OP_READV,  IORING_OP_WRITEV,  IORING_OP_READ_FIXED,  IOR‐
492              ING_OP_WRITE_FIXED  or IORING_OP_FSYNC was specified in the sub‐
493              mission queue entry, but the io_uring  instance  was  configured
494              for  IOPOLLing,  or  any of addr, ioprio, off, len, or buf_index
495              was set in the submission queue entry.
496
497       EINVAL IORING_OP_POLL_ADD or IORING_OP_POLL_REMOVE was specified in the
498              opcode  field  of  the  submission queue entry, but the io_uring
499              instance  was  configured  for  busy-wait  polling  (IORING_SET‐
500              UP_IOPOLL),  or  any  of ioprio, off, len, or buf_index was non-
501              zero in the submission queue entry.
502
503       EINVAL IORING_OP_POLL_ADD was specified in the opcode field of the sub‐
504              mission queue entry, and the addr field was non-zero.
505
506       ENXIO  The io_uring instance is in the process of being torn down.
507
508       EOPNOTSUPP
509              fd does not refer to an io_uring instance.
510
511       EOPNOTSUPP
512              opcode is valid, but not supported by this kernel.
513
514
515
516Linux                             2019-01-22                 IO_URING_ENTER(2)
Impressum