1io_uring_enter(2)          Linux Programmer's Manual         io_uring_enter(2)
2
3
4

NAME

6       io_uring_enter - initiate and/or complete asynchronous I/O
7

SYNOPSIS

9       #include <liburing.h>
10
11       int io_uring_enter(unsigned int fd, unsigned int to_submit,
12                          unsigned int min_complete, unsigned int flags,
13                          sigset_t *sig);
14
15       int io_uring_enter2(unsigned int fd, unsigned int to_submit,
16                           unsigned int min_complete, unsigned int flags,
17                           sigset_t *sig, size_t sz);
18

DESCRIPTION

20       io_uring_enter(2) is used to initiate and complete I/O using the shared
21       submission and completion queues setup by a call to  io_uring_setup(2).
22       A  single  call can both submit new I/O and wait for completions of I/O
23       initiated by this call or previous calls to io_uring_enter(2).
24
25       fd is the file descriptor  returned  by  io_uring_setup(2).   to_submit
26       specifies  the  number  of  I/Os  to  submit from the submission queue.
27       flags is a bitmask of the following values:
28
29       IORING_ENTER_GETEVENTS
30              If this flag is set, then the system  call  will  wait  for  the
31              specified  number  of  events  in min_complete before returning.
32              This flag can be set along with to_submit  to  both  submit  and
33              complete events in a single system call.
34
35       IORING_ENTER_SQ_WAKEUP
36              If the ring has been created with IORING_SETUP_SQPOLL, then this
37              flag asks the kernel to wakeup the SQ kernel  thread  to  submit
38              IO.
39
40       IORING_ENTER_SQ_WAIT
41              If  the ring has been created with IORING_SETUP_SQPOLL, then the
42              application has no real insight into when the SQ  kernel  thread
43              has consumed entries from the SQ ring. This can lead to a situa‐
44              tion where the application can no longer get a free SQE entry to
45              submit,  without knowing when it one becomes available as the SQ
46              kernel thread consumes them. If the system  call  is  used  with
47              this  flag  set,  then  it will wait until at least one entry is
48              free in the SQ ring.
49
50       IORING_ENTER_EXT_ARG
51              Since kernel 5.11, the system calls arguments have been modified
52              to look like the following:
53
54              int io_uring_enter2(unsigned int fd, unsigned int to_submit,
55                                  unsigned int min_complete, unsigned int flags,
56                                  const void *arg, size_t argsz);
57
58              which behaves just like the original definition by default. How‐
59              ever, if IORING_ENTER_EXT_ARG is set, then instead of a sigset_t
60              being passed in, a pointer to a struct io_uring_getevents_arg is
61              used instead and argsz must be set to the size  of  this  struc‐
62              ture. The definition is as follows:
63
64              struct io_uring_getevents_arg {
65                      __u64   sigmask;
66                      __u32   sigmask_sz;
67                      __u32   pad;
68                      __u64   ts;
69              };
70
71              which allows passing in both a signal mask as well as pointer to
72              a struct __kernel_timespec timeout value. If  ts  is  set  to  a
73              valid  pointer,  then  this time value indicates the timeout for
74              waiting on events. If an application is waiting  on  events  and
75              wishes  to  stop  waiting after a specified amount of time, then
76              this can be accomplished directly in version 5.11 and  newer  by
77              using this feature.
78
79       IORING_ENTER_REGISTERED_RING
80              If  the  ring file descriptor has been registered through use of
81              IORING_REGISTER_RING_FDS, then setting this flag will  tell  the
82              kernel  that the ring_fd passed in is the registered ring offset
83              rather than a normal file descriptor.
84
85
86       If the io_uring instance was configured for polling, by specifying IOR‐
87       ING_SETUP_IOPOLL  in  the  call to io_uring_setup(2), then min_complete
88       has a slightly different meaning.  Passing a value of 0  instructs  the
89       kernel  to return any events which are already complete, without block‐
90       ing.  If min_complete is a non-zero value, the kernel will still return
91       immediately  if  any completion events are available.  If no event com‐
92       pletions are available, then the call will poll  either  until  one  or
93       more  completions  become  available, or until the process has exceeded
94       its scheduler time slice.
95
96       Note that, for interrupt driven I/O (where IORING_SETUP_IOPOLL was  not
97       specified  in  the call to io_uring_setup(2)), an application may check
98       the completion queue for event completions without entering the  kernel
99       at all.
100
101       When  the  system  call returns that a certain amount of SQEs have been
102       consumed and submitted, it's safe to reuse SQE  entries  in  the  ring.
103       This is true even if the actual IO submission had to be punted to async
104       context, which means that the SQE may in fact not have  been  submitted
105       yet.  If  the  kernel  requires later use of a particular SQE entry, it
106       will have made a private copy of it.
107
108       sig is a pointer to a signal mask (see sigprocmask(2)); if sig  is  not
109       NULL,  io_uring_enter(2)  first replaces the current signal mask by the
110       one pointed to by sig, then waits for events to become available in the
111       completion queue, and then restores the original signal mask.  The fol‐
112       lowing io_uring_enter(2) call:
113
114           ret = io_uring_enter(fd, 0, 1, IORING_ENTER_GETEVENTS, &sig);
115
116       is equivalent to atomically executing the following calls:
117
118           pthread_sigmask(SIG_SETMASK, &sig, &orig);
119           ret = io_uring_enter(fd, 0, 1, IORING_ENTER_GETEVENTS, NULL);
120           pthread_sigmask(SIG_SETMASK, &orig, NULL);
121
122       See the description of pselect(2) for an explanation of why the sig pa‐
123       rameter is necessary.
124
125       Submission  queue  entries  are  represented  using  the following data
126       structure:
127
128           /*
129            * IO submission data structure (Submission Queue Entry)
130            */
131           struct io_uring_sqe {
132               __u8    opcode;         /* type of operation for this sqe */
133               __u8    flags;          /* IOSQE_ flags */
134               __u16   ioprio;         /* ioprio for the request */
135               __s32   fd;             /* file descriptor to do IO on */
136               union {
137                   __u64   off;            /* offset into file */
138                   __u64   addr2;
139               };
140               union {
141                   __u64   addr;       /* pointer to buffer or iovecs */
142                   __u64   splice_off_in;
143               }
144               __u32   len;            /* buffer size or number of iovecs */
145               union {
146                   __kernel_rwf_t  rw_flags;
147                   __u32    fsync_flags;
148                   __u16    poll_events;   /* compatibility */
149                   __u32    poll32_events; /* word-reversed for BE */
150                   __u32    sync_range_flags;
151                   __u32    msg_flags;
152                   __u32    timeout_flags;
153                   __u32    accept_flags;
154                   __u32    cancel_flags;
155                   __u32    open_flags;
156                   __u32    statx_flags;
157                   __u32    fadvise_advice;
158                   __u32    splice_flags;
159                   __u32    rename_flags;
160                   __u32    unlink_flags;
161                   __u32    hardlink_flags;
162               };
163               __u64    user_data;     /* data to be passed back at completion time */
164               union {
165               struct {
166                   /* index into fixed buffers, if used */
167                       union {
168                           /* index into fixed buffers, if used */
169                           __u16    buf_index;
170                           /* for grouped buffer selection */
171                           __u16    buf_group;
172                       }
173                   /* personality to use, if used */
174                   __u16    personality;
175                   union {
176                       __s32    splice_fd_in;
177                       __u32    file_index;
178                };
179               };
180               __u64    __pad2[3];
181               };
182           };
183
184       The opcode describes the operation to be performed.  It can be one of:
185
186       IORING_OP_NOP
187              Do not perform any I/O.  This is useful for testing the  perfor‐
188              mance of the io_uring implementation itself.
189
190       IORING_OP_READV
191
192       IORING_OP_WRITEV
193              Vectored  read  and  write operations, similar to preadv2(2) and
194              pwritev2(2).  If the file is not seekable, off must  be  set  to
195              zero or -1.
196
197
198       IORING_OP_READ_FIXED
199
200       IORING_OP_WRITE_FIXED
201              Read  from  or write to pre-mapped buffers.  See io_uring_regis‐
202              ter(2) for details on how to setup a context for fixed reads and
203              writes.
204
205
206       IORING_OP_FSYNC
207              File  sync.   See also fsync(2).  Note that, while I/O is initi‐
208              ated in the order in which it appears in the  submission  queue,
209              completions  are  unordered.   For example, an application which
210              places a write I/O followed by an fsync in the submission  queue
211              cannot  expect  the fsync to apply to the write.  The two opera‐
212              tions execute in parallel, so the fsync may complete before  the
213              write  is issued to the storage.  The same is also true for pre‐
214              viously issued writes that  have  not  completed  prior  to  the
215              fsync.
216
217
218       IORING_OP_POLL_ADD
219              Poll  the  fd  specified  in  the submission queue entry for the
220              events specified in the poll_events field.  Unlike poll or epoll
221              without  EPOLLONESHOT, by default this interface always works in
222              one shot mode.  That is, once the poll operation  is  completed,
223              it will have to be resubmitted.
224
225              If  IORING_POLL_ADD_MULTI  is set in the SQE len field, then the
226              poll will work in multi shot  mode  instead.  That  means  it'll
227              repatedly  trigger  when  the  requested event becomes true, and
228              hence multiple CQEs can be generated from this single  SQE.  The
229              CQE flags field will have IORING_CQE_F_MORE set on completion if
230              the application should expect further CQE entries from the orig‐
231              inal  request.  If  this  flag isn't set on completion, then the
232              poll request has been terminated and no further events  will  be
233              generated. This mode is available since 5.13.
234
235              If  IORING_POLL_UPDATE_EVENTS  is set in the SQE len field, then
236              the request will update an existing poll request with  the  mask
237              of  events  passed  in with this request. The lookup is based on
238              the user_data field of the original SQE submitted, and this val‐
239              ues  is passed in the addr field of the SQE. This mode is avail‐
240              able since 5.13.
241
242              If IORING_POLL_UPDATE_USER_DATA is set in  the  SQE  len  field,
243              then  the  request will update the user_data of an existing poll
244              request based on the value passed in the off field. This mode is
245              available since 5.13.
246
247              This  command  works  like  an  async poll(2) and the completion
248              event result is the returned mask of events.  For  the  variants
249              that  update user_data or events , the completion result will be
250              similar to IORING_OP_POLL_REMOVE.
251
252
253       IORING_OP_POLL_REMOVE
254              Remove an existing poll request.  If found, the res field of the
255              struct io_uring_cqe will contain 0.  If not found, res will con‐
256              tain -ENOENT, or -EALREADY  if  the  poll  request  was  in  the
257              process of completing already.
258
259
260       IORING_OP_EPOLL_CTL
261              Add,  remove or modify entries in the interest list of epoll(7).
262              See epoll_ctl(2) for details of the system call.  fd  holds  the
263              file  descriptor  that represents the epoll instance, addr holds
264              the file descriptor to add, remove or modify, len holds the  op‐
265              eration (EPOLL_CTL_ADD, EPOLL_CTL_DEL, EPOLL_CTL_MOD) to perform
266              and, off holds a pointer to the epoll_events  structure.  Avail‐
267              able since 5.6.
268
269
270       IORING_OP_SYNC_FILE_RANGE
271              Issue  the  equivalent  of a sync_file_range (2) on the file de‐
272              scriptor. The fd field is the file descriptor to sync,  the  off
273              field  holds the offset in bytes, the len field holds the length
274              in bytes, and the sync_range_flags field holds the flags for the
275              command. See also sync_file_range(2) for the general description
276              of the related system call. Available since 5.2.
277
278
279       IORING_OP_SENDMSG
280              Issue the equivalent of a sendmsg(2) system call.   fd  must  be
281              set  to  the socket file descriptor, addr must contain a pointer
282              to the msghdr structure, and msg_flags holds the  flags  associ‐
283              ated  with  the system call. See also sendmsg(2) for the general
284              description of the related system call. Available since 5.3.
285
286              This command also supports the following modifiers in ioprio:
287
288
289                   IORING_RECVSEND_POLL_FIRST If set, io_uring will assume the
290                   socket  is  currently full and attempting to send data will
291                   be unsuccessful. For this case, io_uring will arm  internal
292                   poll  and  trigger  a send of the data when there is enough
293                   space available.  This initial send attempt can be wasteful
294                   for  the case where the socket is expected to be full, set‐
295                   ting this flag will bypass the initial send attempt and  go
296                   straight  to  arming  poll. If poll does indicate that data
297                   can be sent, the operation will proceed.
298
299       IORING_OP_RECVMSG
300              Works just like IORING_OP_SENDMSG,  except  for  recvmsg(2)  in‐
301              stead. See the description of IORING_OP_SENDMSG. Available since
302              5.3.
303
304              This command also supports the following modifiers in ioprio:
305
306
307                   IORING_RECVSEND_POLL_FIRST If set, io_uring will assume the
308                   socket  is  currently  empty and attempting to receive data
309                   will be unsuccessful. For this case, io_uring will arm  in‐
310                   ternal  poll  and  trigger  a  receive of the data when the
311                   socket has data to be read.  This initial  receive  attempt
312                   can  be  wasteful for the case where the socket is expected
313                   to be empty, setting this flag will bypass the initial  re‐
314                   ceive  attempt and go straight to arming poll. If poll does
315                   indicate that data is ready to be received,  the  operation
316                   will proceed.
317
318       IORING_OP_SEND
319              Issue  the  equivalent of a send(2) system call.  fd must be set
320              to the socket file descriptor, addr must contain  a  pointer  to
321              the  buffer,  len  denotes the length of the buffer to send, and
322              msg_flags holds the flags associated with the system  call.  See
323              also  send(2)  for the general description of the related system
324              call. Available since 5.6.
325
326              This command also supports the following modifiers in ioprio:
327
328
329                   IORING_RECVSEND_POLL_FIRST If set, io_uring will assume the
330                   socket  is  currently full and attempting to send data will
331                   be unsuccessful. For this case, io_uring will arm  internal
332                   poll  and  trigger  a send of the data when there is enough
333                   space available.  This initial send attempt can be wasteful
334                   for  the case where the socket is expected to be full, set‐
335                   ting this flag will bypass the initial send attempt and  go
336                   straight  to  arming  poll. If poll does indicate that data
337                   can be sent, the operation will proceed.
338
339       IORING_OP_RECV
340              Works just like IORING_OP_SEND, except for recv(2) instead.  See
341              the description of IORING_OP_SEND. Available since 5.6.
342
343              This command also supports the following modifiers in ioprio:
344
345
346                   IORING_RECVSEND_POLL_FIRST If set, io_uring will assume the
347                   socket is currently empty and attempting  to  receive  data
348                   will  be unsuccessful. For this case, io_uring will arm in‐
349                   ternal poll and trigger a receive  of  the  data  when  the
350                   socket  has  data to be read.  This initial receive attempt
351                   can be wasteful for the case where the socket  is  expected
352                   to  be empty, setting this flag will bypass the initial re‐
353                   ceive attempt and go straight to arming poll. If poll  does
354                   indicate  that  data is ready to be received, the operation
355                   will proceed.
356
357       IORING_OP_TIMEOUT
358              This command will register a timeout operation. The  addr  field
359              must  contain  a  pointer  to a struct timespec64 structure, len
360              must contain  1  to  signify  one  timespec64  structure,  time‐
361              out_flags may contain IORING_TIMEOUT_ABS for an absolute timeout
362              value, or 0 for a relative timeout.  off may contain  a  comple‐
363              tion  event  count. A timeout will trigger a wakeup event on the
364              completion ring for anyone waiting for events. A timeout  condi‐
365              tion  is  met  when either the specified timeout expires, or the
366              specified number of events have completed. Either condition will
367              trigger  the  event.  If  set  to  0,  completed  events are not
368              counted, which effectively acts like a timer. io_uring  timeouts
369              use  the CLOCK_MONOTONIC clock source. The request will complete
370              with -ETIME if the timeout got completed through  expiration  of
371              the  timer,  or  0 if the timeout got completed through requests
372              completing on their own. If the timeout was canceled  before  it
373              expired,  the  request will complete with -ECANCELED.  Available
374              since 5.4.
375
376              Since 5.15, this command also supports the  following  modifiers
377              in timeout_flags:
378
379
380                   IORING_TIMEOUT_BOOTTIME  If  set, then the clocksource used
381                   is CLOCK_BOOTTIME instead of CLOCK_MONOTONIC.  This  clock‐
382                   source differs in that it includes time elapsed if the sys‐
383                   tem was suspend while having a timeout request in-flight.
384
385                   IORING_TIMEOUT_REALTIME If set, then the  clocksource  used
386                   is CLOCK_REALTIME instead of CLOCK_MONOTONIC.
387
388       IORING_OP_TIMEOUT_REMOVE
389              If  timeout_flags are zero, then it attempts to remove an exist‐
390              ing timeout operation.  addr must contain the user_data field of
391              the  previously issued timeout operation. If the specified time‐
392              out request is found and  canceled  successfully,  this  request
393              will  terminate  with a result value of 0 If the timeout request
394              was found but expiration was already in progress,  this  request
395              will  terminate with a result value of -EBUSY If the timeout re‐
396              quest wasn't found, the request will  terminate  with  a  result
397              value of -ENOENT Available since 5.5.
398
399              If  timeout_flags  contain IORING_TIMEOUT_UPDATE, instead of re‐
400              moving an existing operation, it updates it.   addr  and  return
401              values  are  same as before.  addr2 field must contain a pointer
402              to a struct timespec64 structure.  timeout_flags may  also  con‐
403              tain IORING_TIMEOUT_ABS, in which case the value given is an ab‐
404              solute one, not a relative one.  Available since 5.11.
405
406
407       IORING_OP_ACCEPT
408              Issue the equivalent of an accept4(2) system call.  fd  must  be
409              set to the socket file descriptor, addr must contain the pointer
410              to the sockaddr structure, and addr2 must contain a  pointer  to
411              the  socklen_t  addrlen field. Flags can be passed using the ac‐
412              cept_flags field. See also accept4(2) for the  general  descrip‐
413              tion of the related system call. Available since 5.5.
414
415              If  the  file_index  field is set to a positive number, the file
416              won't be installed into the normal file table as usual but  will
417              be placed into the fixed file table at index file_index - 1.  In
418              this case, instead of returning a file  descriptor,  the  result
419              will  contain  either  0  on  success  or an error. If the index
420              points to a valid empty slot, the installation is guaranteed  to
421              not fail. If there is already a file in the slot, it will be re‐
422              placed, similar to  IORING_OP_FILES_UPDATE.   Please  note  that
423              only  io_uring has access to such files and no other syscall can
424              use them. See IOSQE_FIXED_FILE and IORING_REGISTER_FILES.
425
426              Available since 5.5.
427
428
429       IORING_OP_ASYNC_CANCEL
430              Attempt to cancel an already issued request.  addr must  contain
431              the  user_data field of the request that should be canceled. The
432              cancelation request will complete with one of the following  re‐
433              sults  codes. If found, the res field of the cqe will contain 0.
434              If not found, res will contain -ENOENT. If found  and  attempted
435              canceled,  the  res  field will contain -EALREADY. In this case,
436              the request may or may not terminate. In general, requests  that
437              are interruptible (like socket IO) will get canceled, while disk
438              IO requests cannot be canceled if  already  started.   Available
439              since 5.5.
440
441
442       IORING_OP_LINK_TIMEOUT
443              This  request  must  be  linked  with  another  request  through
444              IOSQE_IO_LINK which is described below.  Unlike  IORING_OP_TIME‐
445              OUT,  IORING_OP_LINK_TIMEOUT acts on the linked request, not the
446              completion queue. The format of the command  is  otherwise  like
447              IORING_OP_TIMEOUT,  except  there's no completion event count as
448              it's tied to a specific request.  If used, the timeout specified
449              in the command will cancel the linked command, unless the linked
450              command completes before the timeout. The timeout will  complete
451              with  -ETIME if the timer expired and the linked request was at‐
452              tempted canceled, or -ECANCELED if the timer  got  canceled  be‐
453              cause  of completion of the linked request. Like IORING_OP_TIME‐
454              OUT the clock source used  is  CLOCK_MONOTONIC  Available  since
455              5.5.
456
457
458
459       IORING_OP_CONNECT
460              Issue  the  equivalent  of a connect(2) system call.  fd must be
461              set to the socket file descriptor, addr must contain  the  const
462              pointer  to  the  sockaddr  structure,  and off must contain the
463              socklen_t addrlen field. See also connect(2) for the general de‐
464              scription of the related system call. Available since 5.5.
465
466
467       IORING_OP_FALLOCATE
468              Issue  the equivalent of a fallocate(2) system call.  fd must be
469              set to the file descriptor, len must contain the mode associated
470              with  the operation, off must contain the offset on which to op‐
471              erate, and addr must contain the length. See  also  fallocate(2)
472              for  the  general description of the related system call. Avail‐
473              able since 5.6.
474
475
476       IORING_OP_FADVISE
477              Issue the equivalent of a posix_fadvise(2) system call.  fd must
478              be  set  to  the file descriptor, off must contain the offset on
479              which to operate, len must contain the length,  and  fadvise_ad‐
480              vice  must contain the advice associated with the operation. See
481              also posix_fadvise(2) for the general description of the related
482              system call. Available since 5.6.
483
484
485       IORING_OP_MADVISE
486              Issue  the  equivalent  of  a madvise(2) system call.  addr must
487              contain the address to operate on, len must contain  the  length
488              on  which to operate, and fadvise_advice must contain the advice
489              associated with the operation. See also madvise(2) for the  gen‐
490              eral  description  of  the  related system call. Available since
491              5.6.
492
493
494       IORING_OP_OPENAT
495              Issue the equivalent of a openat(2)  system  call.   fd  is  the
496              dirfd argument, addr must contain a pointer to the *pathname ar‐
497              gument, open_flags should contain any flags passed in,  and  len
498              is  access  mode of the file. See also openat(2) for the general
499              description of the related system call. Available since 5.6.
500
501              If the file_index field is set to a positive  number,  the  file
502              won't  be installed into the normal file table as usual but will
503              be placed into the fixed file table at index file_index - 1.  In
504              this  case,  instead  of returning a file descriptor, the result
505              will contain either 0 on success  or  an  error.  If  the  index
506              points  to a valid empty slot, the installation is guaranteed to
507              not fail. If there is already a file in the slot, it will be re‐
508              placed,  similar  to  IORING_OP_FILES_UPDATE.   Please note that
509              only io_uring has access to such files and no other syscall  can
510              use them. See IOSQE_FIXED_FILE and IORING_REGISTER_FILES.
511
512              Available since 5.15.
513
514
515       IORING_OP_OPENAT2
516              Issue  the  equivalent  of  a openat2(2) system call.  fd is the
517              dirfd argument, addr must contain a pointer to the *pathname ar‐
518              gument,  len  should contain the size of the open_how structure,
519              and off should be set to the address of the open_how  structure.
520              See  also  openat2(2) for the general description of the related
521              system call. Available since 5.6.
522
523              If the file_index field is set to a positive  number,  the  file
524              won't  be installed into the normal file table as usual but will
525              be placed into the fixed file table at index file_index - 1.  In
526              this  case,  instead  of returning a file descriptor, the result
527              will contain either 0 on success  or  an  error.  If  the  index
528              points  to a valid empty slot, the installation is guaranteed to
529              not fail. If there is already a file in the slot, it will be re‐
530              placed,  similar  to  IORING_OP_FILES_UPDATE.   Please note that
531              only io_uring has access to such files and no other syscall  can
532              use them. See IOSQE_FIXED_FILE and IORING_REGISTER_FILES.
533
534              Available since 5.15.
535
536
537       IORING_OP_CLOSE
538              Issue  the equivalent of a close(2) system call.  fd is the file
539              descriptor to be closed. See also close(2) for the  general  de‐
540              scription  of  the related system call. Available since 5.6.  If
541              the file_index field is set to a positive number,  this  command
542              can  be used to close files that were direct opened through IOR‐
543              ING_OP_OPENAT , IORING_OP_OPENAT2 ,  or  IORING_OP_ACCEPT  using
544              the  io_uring specific direct descriptors. Note that only one of
545              the descriptor fields may be set. The direct  close  feature  is
546              available  since  the 5.15 kernel, where direct descriptors were
547              introduced.
548
549
550       IORING_OP_STATX
551              Issue the equivalent of a statx(2) system call.  fd is the dirfd
552              argument,  addr  must contain a pointer to the *pathname string,
553              statx_flags is the flags argument, len should be the mask  argu‐
554              ment,  and  off  must  contain  a  pointer to the statxbuf to be
555              filled in. See also statx(2) for the general description of  the
556              related system call. Available since 5.6.
557
558
559       IORING_OP_READ
560
561       IORING_OP_WRITE
562              Issue the equivalent of a pread(2) or pwrite(2) system call.  fd
563              is the file descriptor to be operated on, addr contains the buf‐
564              fer  in  question,  len contains the length of the IO operation,
565              and offs contains the read or write offset. If fd does not refer
566              to  a  seekable  file, off must be set to zero or -1. If offs is
567              set to -1 , the offset will use (and advance) the file position,
568              like  the  read(2) and write(2) system calls. These are non-vec‐
569              tored versions of the IORING_OP_READV and  IORING_OP_WRITEV  op‐
570              codes. See also read(2) and write(2) for the general description
571              of the related system call. Available since 5.6.
572
573
574       IORING_OP_SPLICE
575              Issue the equivalent of a splice(2) system  call.   splice_fd_in
576              is  the file descriptor to read from, splice_off_in is an offset
577              to read from, fd is the file descriptor to write to, off  is  an
578              offset from which to start writing to. A sentinel value of -1 is
579              used to pass the  equivalent  of  a  NULL  for  the  offsets  to
580              splice(2).    len   contains   the  number  of  bytes  to  copy.
581              splice_flags contains a bit mask for the flag  field  associated
582              with the system call.  Please note that one of the file descrip‐
583              tors must refer to a pipe.  See also splice(2) for  the  general
584              description of the related system call. Available since 5.7.
585
586
587       IORING_OP_TEE
588              Issue  the  equivalent of a tee(2) system call.  splice_fd_in is
589              the file descriptor to read from, fd is the file  descriptor  to
590              write  to,  len  contains  the  number  of  bytes  to  copy, and
591              splice_flags contains a bit mask for the flag  field  associated
592              with  the  system  call.   Please note that both of the file de‐
593              scriptors must refer to a pipe.  See also tee(2) for the general
594              description of the related system call. Available since 5.8.
595
596
597       IORING_OP_FILES_UPDATE
598              This   command   is   an   alternative  to  using  IORING_REGIS‐
599              TER_FILES_UPDATE which then works in an async fashion, like  the
600              rest  of the io_uring commands.  The arguments passed in are the
601              same.  addr must contain a pointer to the array of file descrip‐
602              tors,  len  must  contain  the length of the array, and off must
603              contain the offset at which to operate. Note that the  array  of
604              file descriptors pointed to in addr must remain valid until this
605              operation has completed. Available since 5.6.
606
607
608       IORING_OP_PROVIDE_BUFFERS
609              This command allows an application to register a group  of  buf‐
610              fers  to  be used by commands that read/receive data. Using buf‐
611              fers in this manner can eliminate the need to separate the  poll
612              +  read, which provides a convenient point in time to allocate a
613              buffer for a given request. It's often  infeasible  to  have  as
614              many  buffers  available  as pending reads or receive. With this
615              feature, the application can have its pool of buffers  ready  in
616              the kernel, and when the file or socket is ready to read/receive
617              data, a buffer can be selected for the operation.  fd must  con‐
618              tain  the  number  of  buffers to provide, addr must contain the
619              starting address to add  buffers  from,  len  must  contain  the
620              length of each buffer to add from the range, buf_group must con‐
621              tain the group ID of this range of buffers, and off must contain
622              the  starting buffer ID of this range of buffers. With that set,
623              the kernel adds buffers starting  with  the  memory  address  in
624              addr,  each  with a length of len.  Hence the application should
625              provide len * fd worth of memory in addr.  Buffers  are  grouped
626              by the group ID, and each buffer within this group will be iden‐
627              tical in size according to the above arguments. This allows  the
628              application  to provide different groups of buffers, and this is
629              often used to have differently sized buffers available depending
630              on  what  the  expectations  are of the individual request. When
631              submitting a request that should  use  a  provided  buffer,  the
632              IOSQE_BUFFER_SELECT  flag must be set, and buf_group must be set
633              to the desired buffer group ID where the buffer  should  be  se‐
634              lected from. Available since 5.7.
635
636
637       IORING_OP_REMOVE_BUFFERS
638              Remove buffers previously registered with IORING_OP_PROVIDE_BUF‐
639              FERS.  fd must contain the number  of  buffers  to  remove,  and
640              buf_group  must contain the buffer group ID from which to remove
641              the buffers. Available since 5.7.
642
643
644       IORING_OP_SHUTDOWN
645              Issue the equivalent of a shutdown(2) system call.   fd  is  the
646              file  descriptor  to  the socket being shutdown, and len must be
647              set to the how argument. No  no  other  fields  should  be  set.
648              Available since 5.11.
649
650
651       IORING_OP_RENAMEAT
652              Issue  the  equivalent of a renameat2(2) system call.  fd should
653              be set to the olddirfd, addr should be set to the  oldpath,  len
654              should  be  set  to the newdirfd, addr should be set to the old‐
655              path, addr2 should be  set  to  the  newpath,  and  finally  re‐
656              name_flags should be set to the flags passed in to renameat2(2).
657              Available since 5.11.
658
659
660       IORING_OP_UNLINKAT
661              Issue the equivalent of a unlinkat2(2) system call.   fd  should
662              be set to the dirfd, addr should be set to the pathname, and un‐
663              link_flags should be set to the flags being  passed  in  to  un‐
664              linkat(2).  Available since 5.11.
665
666
667       IORING_OP_MKDIRAT
668              Issue the equivalent of a mkdirat2(2) system call.  fd should be
669              set to the dirfd, addr should be set to the  pathname,  and  len
670              should be set to the mode being passed in to mkdirat(2).  Avail‐
671              able since 5.15.
672
673
674       IORING_OP_SYMLINKAT
675              Issue the equivalent of a symlinkat2(2) system call.  fd  should
676              be  set  to  the  newdirfd, addr should be set to the target and
677              addr2 should be set to the linkpath  being  passed  in  to  sym‐
678              linkat(2).  Available since 5.15.
679
680
681       IORING_OP_LINKAT
682              Issue  the equivalent of a linkat2(2) system call.  fd should be
683              set to the olddirfd, addr should be  set  to  the  oldpath,  len
684              should  be  set to the newdirfd, addr2 should be set to the new‐
685              path, and hardlink_flags should be set to the flags being passed
686              in to linkat(2).  Available since 5.15.
687
688
689       IORING_OP_MSG_RING
690              Send  a  message  to  an io_uring.  fd must be set to a file de‐
691              scriptor of a ring that the application has access to,  len  can
692              be  set  to any 32-bit value that the application wishes to pass
693              on, and off should be set any 64-bit value that the  application
694              wishes  to  send.  On the target ring, a CQE will be posted with
695              the res field matching the len set, and a user_data field match‐
696              ing the off value being passed in. This request type can be used
697              to either just wake or interrupt anyone waiting for  completions
698              on  the  target ring, or it can be used to pass messages via the
699              two fields. Available since 5.18.
700
701
702       IORING_OP_SOCKET
703              Issue the equivalent of a socket(2) system call.  fd  must  con‐
704              tain  the  communication domain, off must contain the communica‐
705              tion type, len must contain the protocol, and rw_flags  is  cur‐
706              rently  unused  and  must be set to zero. See also socket(2) for
707              the general description of the related  system  call.  Available
708              since 5.19.
709
710              If  the  file_index  field is set to a positive number, the file
711              won't be installed into the normal file table as usual but  will
712              be placed into the fixed file table at index file_index - 1.  In
713              this case, instead of returning a file  descriptor,  the  result
714              will  contain  either  0  on  success  or an error. If the index
715              points to a valid empty slot, the installation is guaranteed  to
716              not fail. If there is already a file in the slot, it will be re‐
717              placed, similar to  IORING_OP_FILES_UPDATE.   Please  note  that
718              only  io_uring has access to such files and no other syscall can
719              use them. See IOSQE_FIXED_FILE and IORING_REGISTER_FILES.
720
721              Available since 5.19.
722
723
724       IORING_OP_SEND_ZC
725              Issue the zerocopy equivalent of a send(2) system call.  Similar
726              to IORING_OP_SEND, but tries to avoid making intermediate copies
727              of data. Zerocopy execution is not guaranteed and may fall  back
728              to  copying. The request may also fail with -EOPNOTSUPP , when a
729              protocol doesn't support zerocopy, in which case users are  rec‐
730              ommended to use copying sends instead.
731
732              The flags field of the first struct io_uring_cqe may likely con‐
733              tain IORING_CQE_F_MORE , which means that there will be a second
734              completion  event  /  notification  for  the  request,  with the
735              user_data field set to the same value. The user must not  modify
736              the  data buffer until the notification is posted. The first cqe
737              follows the usual rules and so its res field  will  contain  the
738              number  of  bytes  sent  or a negative error code. The notifica‐
739              tion's res field will be set to zero and the  flags  field  will
740              contain  IORING_CQE_F_NOTIF  .  The two step model is needed be‐
741              cause the kernel may hold on to buffers for a  long  time,  e.g.
742              waiting  for  a  TCP  ACK, and having a separate cqe for request
743              completions allows userspace to push more data without extra de‐
744              lays.  Note,  notifications are only responsible for controlling
745              the lifetime of the buffers, and as  such  don't  mean  anything
746              about  whether the data has atually been sent out or received by
747              the other end. Even errored requests may  generate  a  notifica‐
748              tion,  and the user must check for IORING_CQE_F_MORE rather than
749              relying on the result.
750
751              fd must be set to the socket file descriptor, addr must  contain
752              a pointer to the buffer, len denotes the length of the buffer to
753              send, and msg_flags holds the flags associated with  the  system
754              call.  When  addr2  is  non-zero it points to the address of the
755              target with addr_len specifying its size,  turning  the  request
756              into a sendto(2) system call equivalent.
757
758              Available since 6.0.
759
760              This command also supports the following modifiers in ioprio:
761
762
763                   IORING_RECVSEND_POLL_FIRST If set, io_uring will assume the
764                   socket is currently full and attempting to send  data  will
765                   be  unsuccessful. For this case, io_uring will arm internal
766                   poll and trigger a send of the data when  there  is  enough
767                   space available.  This initial send attempt can be wasteful
768                   for the case where the socket is expected to be full,  set‐
769                   ting  this flag will bypass the initial send attempt and go
770                   straight to arming poll. If poll does  indicate  that  data
771                   can be sent, the operation will proceed.
772
773                   IORING_RECVSEND_FIXED_BUF If set, instructs io_uring to use
774                   a pre-mapped buffer. The buf_index field should contain  an
775                   index  into  an array of fixed buffers. See io_uring_regis‐
776                   ter(2) for details on how to setup a context for fixed buf‐
777                   fer I/O.
778
779       The flags field is a bit mask. The supported flags are:
780
781       IOSQE_FIXED_FILE
782              When this flag is specified, fd is an index into the files array
783              registered with the io_uring  instance  (see  the  IORING_REGIS‐
784              TER_FILES  section  of  the io_uring_register(2) man page). Note
785              that this isn't always available for all commands. If used on  a
786              command  that  doesn't  support  fixed files, the SQE will error
787              with -EBADF.  Available since 5.1.
788
789       IOSQE_IO_DRAIN
790              When this flag is specified, the SQE will not be started  before
791              previously  submitted SQEs have completed, and new SQEs will not
792              be started before this one completes. Available since 5.2.
793
794       IOSQE_IO_LINK
795              When this flag is specified, the SQE forms a link with the  next
796              SQE  in  the  submission ring. That next SQE will not be started
797              before the previous request completes. This, in effect, forms  a
798              chain  of  SQEs,  which can be arbitrarily long. The tail of the
799              chain is denoted by the first SQE that does not have  this  flag
800              set. Chains are not supported across submission boundaries. Even
801              if the last SQE in a submission has this flag set, it will still
802              terminate the current chain. This flag has no effect on previous
803              SQE submissions, nor does it impact SQEs that are outside of the
804              chain  tail. This means that multiple chains can be executing in
805              parallel, or chains and individual SQEs. Only members inside the
806              chain are serialized. A chain of SQEs will be broken, if any re‐
807              quest in that chain ends in error. io_uring considers any  unex‐
808              pected  result  an error. This means that, eg, a short read will
809              also terminate the remainder of the chain.  If a  chain  of  SQE
810              links  is broken, the remaining unstarted part of the chain will
811              be terminated and completed with -ECANCELED as the  error  code.
812              Available since 5.3.
813
814       IOSQE_IO_HARDLINK
815              Like  IOSQE_IO_LINK, but it doesn't sever regardless of the com‐
816              pletion result.  Note that the link will still sever if we  fail
817              submitting  the parent request, hard links are only resilient in
818              the presence of completion results for requests that did  submit
819              correctly.  IOSQE_IO_HARDLINK  implies IOSQE_IO_LINK.  Available
820              since 5.5.
821
822       IOSQE_ASYNC
823              Normal operation for io_uring is to try and issue an sqe as non-
824              blocking  first,  and if that fails, execute it in an async man‐
825              ner. To support more efficient overlapped operation of  requests
826              that  the  application knows/assumes will always (or most of the
827              time) block, the application can ask for an  sqe  to  be  issued
828              async from the start. Available since 5.6.
829
830       IOSQE_BUFFER_SELECT
831              Used  in conjunction with the IORING_OP_PROVIDE_BUFFERS command,
832              which registers a pool of buffers to be used  by  commands  that
833              read  or  receive data. When buffers are registered for this use
834              case, and this flag is set in the command, io_uring will grab  a
835              buffer  from  this  pool when the request is ready to receive or
836              read data. If successful,  the  resulting  CQE  will  have  IOR‐
837              ING_CQE_F_BUFFER  set  in  the flags part of the struct, and the
838              upper IORING_CQE_BUFFER_SHIFT bits will contain the  ID  of  the
839              selected  buffers.  This  allows the application to know exactly
840              which buffer was selected for the operation. If no  buffers  are
841              available  and this flag is set, then the request will fail with
842              -ENOBUFS as the error code. Once a buffer has been used,  it  is
843              no longer available in the kernel pool. The application must re-
844              register the given buffer again when it is ready to  recycle  it
845              (eg has completed using it). Available since 5.7.
846
847       IOSQE_CQE_SKIP_SUCCESS
848              Don't  generate  a CQE if the request completes successfully. If
849              the request fails, an appropriate CQE will be  posted  as  usual
850              and  if  there  is no IOSQE_IO_HARDLINK, CQEs for all linked re‐
851              quests will be omitted. The notion of failure/success is  opcode
852              specific   and   is   the   same  as  with  breaking  chains  of
853              IOSQE_IO_LINK.  One special case  is  when  the  request  has  a
854              linked  timeout,  then the CQE generation for the linked timeout
855              is decided solely by whether it has IOSQE_CQE_SKIP_SUCCESS  set,
856              regardless whether it timed out or was canceled. In other words,
857              if a linked timeout has the flag set,  it's  guaranteed  to  not
858              post a CQE.
859
860              The  semantics  are  chosen  to  accommodate  several use cases.
861              First, when all but the last request of a  normal  link  without
862              linked  timeouts  are marked with the flag, only one CQE per lin
863              is posted. Additionally, it enables suppression of CQEs in cases
864              where  the  side effects of a successfully executed operation is
865              enough for userspace to know the state of the system.  One  such
866              example would be writing to a synchronisation file.
867
868              This  flag  is  incompatible with IOSQE_IO_DRAIN.  Using both of
869              them in a single ring is undefined behavior, even when they  are
870              not  used  together  in  a  single request. Currently, after the
871              first request with IOSQE_CQE_SKIP_SUCCESS,  all  subsequent  re‐
872              quests  marked  with  drain  will  be failed at submission time.
873              Note that the error reporting is best effort only, and  restric‐
874              tions may change in the future.
875
876              Available since 5.17.
877
878
879       ioprio specifies the I/O priority.  See ioprio_get(2) for a description
880       of Linux I/O priorities.
881
882       fd specifies the file descriptor against which the  operation  will  be
883       performed, with the exception noted above.
884
885       If   the   operation   is   one   of   IORING_OP_READ_FIXED   or   IOR‐
886       ING_OP_WRITE_FIXED, addr and len must fall within the buffer located at
887       buf_index  in  the fixed buffer array.  If the operation is either IOR‐
888       ING_OP_READV or IORING_OP_WRITEV, then addr points to an iovec array of
889       len entries.
890
891       rw_flags,  specified  for read and write operations, contains a bitwise
892       OR of per-I/O flags, as described in the preadv2(2) man page.
893
894       The fsync_flags bit mask may contain either 0, for a normal file integ‐
895       rity  sync,  or  IORING_FSYNC_DATASYNC to provide data sync only seman‐
896       tics.  See the descriptions of O_SYNC and O_DSYNC in the open(2) manual
897       page for more information.
898
899       The  bits  that  may be set in poll_events are defined in <poll.h>, and
900       documented in poll(2).
901
902       user_data is an application-supplied value that will be copied into the
903       completion  queue entry (see below).  buf_index is an index into an ar‐
904       ray of fixed buffers, and is only valid if fixed  buffers  were  regis‐
905       tered.   personality  is  the credentials id to use for this operation.
906       See io_uring_register(2) for how to register personalities with  io_ur‐
907       ing.  If  set  to  0, the current personality of the submitting task is
908       used.
909
910       Once the submission queue entry is initialized,  I/O  is  submitted  by
911       placing  the  index  of the submission queue entry into the tail of the
912       submission queue.  After one or more indexes are added  to  the  queue,
913       and  the  queue tail is advanced, the io_uring_enter(2) system call can
914       be invoked to initiate the I/O.
915
916       Completions use the following data structure:
917
918           /*
919            * IO completion data structure (Completion Queue Entry)
920            */
921           struct io_uring_cqe {
922               __u64    user_data; /* sqe->data submission passed back */
923               __s32    res;       /* result code for this event */
924               __u32    flags;
925           };
926
927       user_data is copied from the field of the same name in  the  submission
928       queue  entry.   The primary use case is to store data that the applica‐
929       tion will need to access upon completion of this particular  I/O.   The
930       flags  is used for certain commands, like IORING_OP_POLL_ADD or in con‐
931       junction with IOSQE_BUFFER_SELECT or IORING_OP_MSG_RING,  ,  see  those
932       entries  for details.  res is the operation-specific result, but io_ur‐
933       ing-specific errors (e.g. flags or opcode invalid) are returned through
934       this field.  They are described in section CQE ERRORS.
935
936       For  read and write opcodes, the return values match errno values docu‐
937       mented in the preadv2(2) and pwritev2(2) man pages,  with  res  holding
938       the  equivalent of -errno for error cases, or the transferred number of
939       bytes in case the operation is successful. Hence both error and success
940       return  can be found in that field in the CQE. For other request types,
941       the return values are documented in the  matching  man  page  for  that
942       type, or in the opcodes section above for io_uring-specific opcodes.
943

RETURN VALUE

945       io_uring_enter(2)  returns  the  number  of I/Os successfully consumed.
946       This can be zero if to_submit was zero or if the submission  queue  was
947       empty. Note that if the ring was created with IORING_SETUP_SQPOLL spec‐
948       ified, then the return value will generally be the same as to_submit as
949       submission happens outside the context of the system call.
950
951       The errors related to a submission queue entry will be returned through
952       a completion queue entry (see section CQE ERRORS), rather than  through
953       the system call itself.
954
955       Errors  that  occur  not  on behalf of a submission queue entry are re‐
956       turned via the system call directly. On such an error, a negative error
957       code is returned. The caller should not rely on errno variable.
958

ERRORS

960       These are the errors returned by io_uring_enter(2) system call.
961
962       EAGAIN The  kernel  was  unable  to allocate memory for the request, or
963              otherwise ran out of resources to  handle  it.  The  application
964              should wait for some completions and try again.
965
966       EBADF  fd is not a valid file descriptor.
967
968       EBADFD fd  is  a valid file descriptor, but the io_uring ring is not in
969              the right state (enabled). See io_uring_register(2) for  details
970              on how to enable the ring.
971
972       EBADR  At  least  one  CQE was dropped even with the IORING_FEAT_NODROP
973              feature, and there are no otherwise available CQEs. This  clears
974              the  error  state  and so with no other changes the next call to
975              io_uring_setup(2) will not have this error. This error should be
976              extremely  rare  and indicates the machine is running critically
977              low on memory and. It may be reasonable for the  application  to
978              terminate running unless it is able to safely handle any CQE be‐
979              ing lost.
980
981       EBUSY  If the IORING_FEAT_NODROP feature flag is set, then  EBUSY  will
982              be   returned   if   there  were  overflow  entries,  IORING_EN‐
983              TER_GETEVENTS flag is set and not all of  the  overflow  entries
984              were able to be flushed to the CQ ring.
985
986              Without  IORING_FEAT_NODROP  the  application  is  attempting to
987              overcommit the number of requests it can have pending.  The  ap‐
988              plication  should  wait  for some completions and try again. May
989              occur if the application tries to queue more  requests  than  we
990              have  room for in the CQ ring, or if the application attempts to
991              wait for more events without  having  reaped  the  ones  already
992              present in the CQ ring.
993
994       EEXIST The thread submitting the work is invalid.
995
996       EINVAL Some bits in the flags argument are invalid.
997
998       EFAULT An  invalid  user  space address was specified for the sig argu‐
999              ment.
1000
1001       ENXIO  The io_uring instance is in the process of being torn down.
1002
1003       EOPNOTSUPP
1004              fd does not refer to an io_uring instance.
1005
1006       EINTR  The operation was interrupted by a delivery of a  signal  before
1007              it  could complete; see signal(7).  Can happen while waiting for
1008              events with IORING_ENTER_GETEVENTS.
1009
1010

CQE ERRORS

1012       These io_uring-specific errors are returned as a negative value in  the
1013       res field of the completion queue entry.
1014
1015       EACCES The flags field or opcode in a submission queue entry is not al‐
1016              lowed due to registered restrictions.  See  io_uring_register(2)
1017              for details on how restrictions work.
1018
1019       EBADF  The  fd  field  in the submission queue entry is invalid, or the
1020              IOSQE_FIXED_FILE flag was set in the submission queue entry, but
1021              no files were registered with the io_uring instance.
1022
1023       EFAULT buffer is outside of the process' accessible address space
1024
1025       EFAULT IORING_OP_READ_FIXED  or  IORING_OP_WRITE_FIXED was specified in
1026              the opcode field of the submission queue entry, but either  buf‐
1027              fers  were not registered for this io_uring instance, or the ad‐
1028              dress range described by addr and len does not  fit  within  the
1029              buffer registered at buf_index.
1030
1031       EINVAL The  flags  field  or  opcode in a submission queue entry is in‐
1032              valid.
1033
1034       EINVAL The buf_index member of the submission queue entry is invalid.
1035
1036       EINVAL The personality field in a submission queue entry is invalid.
1037
1038       EINVAL IORING_OP_NOP was specified in the submission queue  entry,  but
1039              the  io_uring context was setup for polling (IORING_SETUP_IOPOLL
1040              was specified in the call to io_uring_setup).
1041
1042       EINVAL IORING_OP_READV or IORING_OP_WRITEV was specified in the submis‐
1043              sion  queue  entry,  but the io_uring instance has fixed buffers
1044              registered.
1045
1046       EINVAL IORING_OP_READ_FIXED or IORING_OP_WRITE_FIXED was  specified  in
1047              the submission queue entry, and the buf_index is invalid.
1048
1049       EINVAL IORING_OP_READV,  IORING_OP_WRITEV,  IORING_OP_READ_FIXED,  IOR‐
1050              ING_OP_WRITE_FIXED or IORING_OP_FSYNC was specified in the  sub‐
1051              mission  queue  entry,  but the io_uring instance was configured
1052              for IOPOLLing, or any of addr, ioprio, off,  len,  or  buf_index
1053              was set in the submission queue entry.
1054
1055       EINVAL IORING_OP_POLL_ADD or IORING_OP_POLL_REMOVE was specified in the
1056              opcode field of the submission queue entry, but the io_uring in‐
1057              stance    was    configured    for   busy-wait   polling   (IOR‐
1058              ING_SETUP_IOPOLL), or any of ioprio, off, len, or buf_index  was
1059              non-zero in the submission queue entry.
1060
1061       EINVAL IORING_OP_POLL_ADD was specified in the opcode field of the sub‐
1062              mission queue entry, and the addr field was non-zero.
1063
1064       EOPNOTSUPP
1065              opcode is valid, but not supported by this kernel.
1066
1067       EOPNOTSUPP
1068              IOSQE_BUFFER_SELECT was set in the flags field of the submission
1069              queue entry, but the opcode doesn't support buffer selection.
1070
1071
1072
1073Linux                             2019-01-22                 io_uring_enter(2)
Impressum