1IO_URING_ENTER(2) Linux Programmer's Manual IO_URING_ENTER(2)
2
3
4
6 io_uring_enter - initiate and/or complete asynchronous I/O
7
9 #include <linux/io_uring.h>
10
11 int io_uring_enter(unsigned int fd, unsigned int to_submit,
12 unsigned int min_complete, unsigned int flags,
13 sigset_t *sig);
14
16 io_uring_enter() is used to initiate and complete I/O using the shared
17 submission and completion queues setup by a call to io_uring_setup(2).
18 A single call can both submit new I/O and wait for completions of I/O
19 initiated by this call or previous calls to io_uring_enter().
20
21 fd is the file descriptor returned by io_uring_setup(2). to_submit
22 specifies the number of I/Os to submit from the submission queue. If
23 the IORING_ENTER_GETEVENTS bit is set in flags, then the system call
24 will attempt to wait for min_complete event completions before return‐
25 ing. If the io_uring instance was configured for polling, by specify‐
26 ing IORING_SETUP_IOPOLL in the call to io_uring_setup(2), then min_com‐
27 plete has a slightly different meaning. Passing a value of 0 instructs
28 the kernel to return any events which are already complete, without
29 blocking. If min_complete is a non-zero value, the kernel will still
30 return immediately if any completion events are available. If no event
31 completions are available, then the call will poll either until one or
32 more completions become available, or until the process has exceeded
33 its scheduler time slice.
34
35 Note that, for interrupt driven I/O (where IORING_SETUP_IOPOLL was not
36 specified in the call to io_uring_setup(2)), an application may check
37 the completion queue for event completions without entering the kernel
38 at all.
39
40 When the system call returns that a certain amount of SQEs have been
41 consumed and submitted, it's safe to reuse SQE entries in the ring.
42 This is true even if the actual IO submission had to be punted to async
43 context, which means that the SQE may in fact not have been submitted
44 yet. If the kernel requires later use of a particular SQE entry, it
45 will have made a private copy of it.
46
47 sig is a pointer to a signal mask (see sigprocmask(2)); if sig is not
48 NULL, io_uring_enter() first replaces the current signal mask by the
49 one pointed to by sig, then waits for events to become available in the
50 completion queue, and then restores the original signal mask. The fol‐
51 lowing io_uring_enter() call:
52
53 ret = io_uring_enter(fd, 0, 1, IORING_ENTER_GETEVENTS, &sig);
54
55 is equivalent to atomically executing the following calls:
56
57 pthread_sigmask(SIG_SETMASK, &sig, &orig);
58 ret = io_uring_enter(fd, 0, 1, IORING_ENTER_GETEVENTS, NULL);
59 pthread_sigmask(SIG_SETMASK, &orig, NULL);
60
61 See the description of pselect(2) for an explanation of why the sig
62 parameter is necessary.
63
64 Submission queue entries are represented using the following data
65 structure:
66
67 /*
68 * IO submission data structure (Submission Queue Entry)
69 */
70 struct io_uring_sqe {
71 __u8 opcode; /* type of operation for this sqe */
72 __u8 flags; /* IOSQE_ flags */
73 __u16 ioprio; /* ioprio for the request */
74 __s32 fd; /* file descriptor to do IO on */
75 union {
76 __u64 off; /* offset into file */
77 __u64 addr2;
78 };
79 union {
80 __u64 addr; /* pointer to buffer or iovecs */
81 __u64 splice_off_in;
82 }
83 __u32 len; /* buffer size or number of iovecs */
84 union {
85 __kernel_rwf_t rw_flags;
86 __u32 fsync_flags;
87 __u16 poll_events; /* compatibility */
88 __u32 poll32_events; /* word-reversed for BE */
89 __u32 sync_range_flags;
90 __u32 msg_flags;
91 __u32 timeout_flags;
92 __u32 accept_flags;
93 __u32 cancel_flags;
94 __u32 open_flags;
95 __u32 statx_flags;
96 __u32 fadvise_advice;
97 __u32 splice_flags;
98 };
99 __u64 user_data; /* data to be passed back at completion time */
100 union {
101 struct {
102 /* index into fixed buffers, if used */
103 union {
104 /* index into fixed buffers, if used */
105 __u16 buf_index;
106 /* for grouped buffer selection */
107 __u16 buf_group;
108 }
109 /* personality to use, if used */
110 __u16 personality;
111 __s32 splice_fd_in;
112 };
113 __u64 __pad2[3];
114 };
115 };
116
117 The opcode describes the operation to be performed. It can be one of:
118
119 IORING_OP_NOP
120 Do not perform any I/O. This is useful for testing the perfor‐
121 mance of the io_uring implementation itself.
122
123 IORING_OP_READV
124
125 IORING_OP_WRITEV
126 Vectored read and write operations, similar to preadv2(2) and
127 pwritev2(2).
128
129
130 IORING_OP_READ_FIXED
131
132 IORING_OP_WRITE_FIXED
133 Read from or write to pre-mapped buffers. See io_uring_regis‐
134 ter(2) for details on how to setup a context for fixed reads and
135 writes.
136
137
138 IORING_OP_FSYNC
139 File sync. See also fsync(2). Note that, while I/O is initi‐
140 ated in the order in which it appears in the submission queue,
141 completions are unordered. For example, an application which
142 places a write I/O followed by an fsync in the submission queue
143 cannot expect the fsync to apply to the write. The two opera‐
144 tions execute in parallel, so the fsync may complete before the
145 write is issued to the storage. The same is also true for pre‐
146 viously issued writes that have not completed prior to the
147 fsync.
148
149
150 IORING_OP_POLL_ADD
151 Poll the fd specified in the submission queue entry for the
152 events specified in the poll_events field. Unlike poll or epoll
153 without EPOLLONESHOT, this interface always works in one shot
154 mode. That is, once the poll operation is completed, it will
155 have to be resubmitted.
156
157
158 IORING_OP_POLL_REMOVE
159 Remove an existing poll request. If found, the res field of the
160 struct io_uring_cqe will contain 0. If not found, res will con‐
161 tain -ENOENT.
162
163
164 IORING_OP_EPOLL_CTL
165 Add, remove or modify entries in the interest list of epoll(7).
166 See epoll_ctl(2) for details of the system call. fd holds the
167 file descriptor that represents the epoll instance, addr holds
168 the file descriptor to add, remove or modify, len holds the
169 operation (EPOLL_CTL_ADD, EPOLL_CTL_DEL, EPOLL_CTL_MOD) to per‐
170 form and, off holds a pointer to the epoll_events structure.
171 Available since 5.6.
172
173
174 IORING_OP_SYNC_FILE_RANGE
175 Issue the equivalent of a sync_file_range [22m(2) on the file
176 descriptor. The fd field is the file descriptor to sync, the off
177 field holds the offset in bytes, the len field holds the length
178 in bytes, and the sync_range_flags field holds the flags for the
179 command. See also sync_file_range(2) for the general description
180 of the related system call. Available since 5.2.
181
182
183 IORING_OP_SENDMSG
184 Issue the equivalent of a sendmsg(2) system call. fd must be
185 set to the socket file descriptor, addr must contain a pointer
186 to the msghdr structure, and msg_flags holds the flags associ‐
187 ated with the system call. See also sendmsg(2) for the general
188 description of the related system call. Available since 5.3.
189
190
191 IORING_OP_RECVMSG
192 Works just like IORING_OP_SENDMSG, except for recvmsg(2)
193 instead. See the description of IORING_OP_SENDMSG. Available
194 since 5.3.
195
196
197 IORING_OP_SEND
198 Issue the equivalent of a send(2) system call. fd must be set
199 to the socket file descriptor, addr must contain a pointer to
200 the buffer, len denotes the length of the buffer to send, and
201 msg_flags holds the flags associated with the system call. See
202 also send(2) for the general description of the related system
203 call. Available since 5.6.
204
205
206 IORING_OP_RECV
207 Works just like IORING_OP_SEND, except for recv(2) instead. See
208 the description of IORING_OP_SEND. Available since 5.6.
209
210
211 IORING_OP_TIMEOUT
212 This command will register a timeout operation. The addr field
213 must contain a pointer to a struct timespec64 structure, len
214 must contain 1 to signify one timespec64 structure, time‐
215 out_flags may contain IORING_TIMEOUT_ABS for an absolute timeout
216 value, or 0 for a relative timeout. off may contain a comple‐
217 tion event count. A timeout will trigger a wakeup event on the
218 completion ring for anyone waiting for events. A timeout condi‐
219 tion is met when either the specified timeout expires, or the
220 specified number of events have completed. Either condition will
221 trigger the event. If set to 0, completed events are not
222 counted, which effectively acts like a timer. io_uring timeouts
223 use the CLOCK_MONOTONIC clock source. The request will complete
224 with -ETIME if the timeout got completed through expiration of
225 the timer, or 0 if the timeout got completed through requests
226 completing on their own. If the timeout was cancelled before it
227 expired, the request will complete with -ECANCELED. Available
228 since 5.4.
229
230
231 IORING_OP_TIMEOUT_REMOVE
232 Attempt to remove an existing timeout operation. addr must con‐
233 tain the user_data field of the previously issued timeout opera‐
234 tion. If the specified timeout request is found and cancelled
235 successfully, this request will terminate with a result value of
236 0 If the timeout request was found but expiration was already in
237 progress, this request will terminate with a result value of
238 -EBUSY If the timeout request wasn't found, the request will
239 terminate with a result value of -ENOENT Available since 5.5.
240
241
242 IORING_OP_ACCEPT
243 Issue the equivalent of an accept4(2) system call. fd must be
244 set to the socket file descriptor, addr must contain the pointer
245 to the sockaddr structure, and addr2 must contain a pointer to
246 the socklen_t addrlen field. See also accept4(2) for the general
247 description of the related system call. Available since 5.5.
248
249
250 IORING_OP_ASYNC_CANCEL
251 Attempt to cancel an already issued request. addr must contain
252 the user_data field of the request that should be cancelled. The
253 cancellation request will complete with one of the following
254 results codes. If found, the res field of the cqe will contain
255 0. If not found, res will contain -ENOENT. If found and
256 attempted cancelled, the res field will contain -EALREADY. In
257 this case, the request may or may not terminate. In general,
258 requests that are interruptible (like socket IO) will get can‐
259 celled, while disk IO requests cannot be cancelled if already
260 started. Available since 5.5.
261
262
263 IORING_OP_LINK_TIMEOUT
264 This request must be linked with another request through
265 IOSQE_IO_LINK which is described below. Unlike IORING_OP_TIME‐
266 OUT, IORING_OP_LINK_TIMEOUT acts on the linked request, not the
267 completion queue. The format of the command is otherwise like
268 IORING_OP_TIMEOUT, except there's no completion event count as
269 it's tied to a specific request. If used, the timeout specified
270 in the command will cancel the linked command, unless the linked
271 command completes before the timeout. The timeout will complete
272 with -ETIME if the timer expired and the linked request was
273 attempted cancelled, or -ECANCELED if the timer got cancelled
274 because of completion of the linked request. Like IOR‐
275 ING_OP_TIMEOUT the clock source used is CLOCK_MONOTONIC Avail‐
276 able since 5.5.
277
278
279
280 IORING_OP_CONNECT
281 Issue the equivalent of a connect(2) system call. fd must be
282 set to the socket file descriptor, addr must contain the const
283 pointer to the sockaddr structure, and off must contain the
284 socklen_t addrlen field. See also connect(2) for the general
285 description of the related system call. Available since 5.5.
286
287
288 IORING_OP_FALLOCATE
289 Issue the equivalent of a fallocate(2) system call. fd must be
290 set to the file descriptor, off must contain the offset on which
291 to operate, and len must contain the length. See also fallo‐
292 cate(2) for the general description of the related system call.
293 Available since 5.6.
294
295
296 IORING_OP_FADVISE
297 Issue the equivalent of a posix_fadvise(2) system call. fd must
298 be set to the file descriptor, off must contain the offset on
299 which to operate, len must contain the length, and fad‐
300 vise_advice must contain the advice associated with the opera‐
301 tion. See also posix_fadvise(2) for the general description of
302 the related system call. Available since 5.6.
303
304
305 IORING_OP_MADVISE
306 Issue the equivalent of a madvise(2) system call. addr must
307 contain the address to operate on, len must contain the length
308 on which to operate, and fadvise_advice must contain the advice
309 associated with the operation. See also madvise(2) for the gen‐
310 eral description of the related system call. Available since
311 5.6.
312
313
314 IORING_OP_OPENAT
315 Issue the equivalent of a openat(2) system call. fd is the
316 dirfd argument, addr must contain a pointer to the *pathname
317 argument, open_flags should contain any flags passed in, and
318 mode is access mode of the file. See also openat(2) for the gen‐
319 eral description of the related system call. Available since
320 5.6.
321
322
323 IORING_OP_OPENAT2
324 Issue the equivalent of a openat2(2) system call. fd is the
325 dirfd argument, addr must contain a pointer to the *pathname
326 argument, len should contain the size of the open_how structure,
327 and off should be set to the address of the open_how structure.
328 See also openat2(2) for the general description of the related
329 system call. Available since 5.6.
330
331
332 IORING_OP_CLOSE
333 Issue the equivalent of a close(2) system call. fd is the file
334 descriptor to be closed. See also close(2) for the general
335 description of the related system call. Available since 5.6.
336
337
338 IORING_OP_STATX
339 Issue the equivalent of a statx(2) system call. fd is the dirfd
340 argument, addr must contain a pointer to the *pathname string,
341 statx_flags is the flags argument, len should be the mask argu‐
342 ment, and off must contain a pointer to the statxbuf to be
343 filled in. See also statx(2) for the general description of the
344 related system call. Available since 5.6.
345
346
347 IORING_OP_READ
348
349 IORING_OP_WRITE
350 Issue the equivalent of a read(2) or write(2) system call. fd
351 is the file descriptor to be operated on, addr contains the buf‐
352 fer in question, and len contains the length of the IO opera‐
353 tion. These are non-vectored versions of the IORING_OP_READV and
354 IORING_OP_WRITEV opcodes. See also read(2) and write(2) for the
355 general description of the related system call. Available since
356 5.6.
357
358
359 IORING_OP_SPLICE
360 Issue the equivalent of a splice(2) system call. splice_fd_in
361 is the file descriptor to read from, splice_off_in is an offset
362 to read from, fd is the file descriptor to write to, off is an
363 offset from which to start writing to. A sentinel value of -1 is
364 used to pass the equivalent of a NULL for the offsets to
365 splice(2). len contains the number of bytes to copy.
366 splice_flags contains a bit mask for the flag field associated
367 with the system call. Please note that one of the file descrip‐
368 tors must refer to a pipe. See also splice(2) for the general
369 description of the related system call. Available since 5.7.
370
371
372 IORING_OP_TEE
373 Issue the equivalent of a tee(2) system call. splice_fd_in is
374 the file descriptor to read from, fd is the file descriptor to
375 write to, len contains the number of bytes to copy, and
376 splice_flags contains a bit mask for the flag field associated
377 with the system call. Please note that both of the file
378 descriptors must refer to a pipe. See also tee(2) for the gen‐
379 eral description of the related system call. Available since
380 5.8.
381
382
383 IORING_OP_FILES_UPDATE
384 This command is an alternative to using IORING_REGIS‐
385 TER_FILES_UPDATE which then works in an async fashion, like the
386 rest of the io_uring commands. The arguments passed in are the
387 same. addr must contain a pointer to the array of file descrip‐
388 tors, len must contain the length of the array, and off must
389 contain the offset at which to operate. Note that the array of
390 file descriptors pointed to in addr must remain valid until this
391 operation has completed. Available since 5.6.
392
393
394 The flags field is a bit mask. The supported flags are:
395
396 IOSQE_FIXED_FILE
397 When this flag is specified, fd is an index into the files array
398 registered with the io_uring instance (see the IORING_REGIS‐
399 TER_FILES section of the io_uring_register(2) man page). Avail‐
400 able since 5.1.
401
402 IOSQE_IO_DRAIN
403 When this flag is specified, the SQE will not be started before
404 previously submitted SQEs have completed, and new SQEs will not
405 be started before this one completes. Available since 5.2.
406
407 IOSQE_IO_LINK
408 When this flag is specified, it forms a link with the next SQE
409 in the submission ring. That next SQE will not be started before
410 this one completes. This, in effect, forms a chain of SQEs,
411 which can be arbitrarily long. The tail of the chain is denoted
412 by the first SQE that does not have this flag set. This flag
413 has no effect on previous SQE submissions, nor does it impact
414 SQEs that are outside of the chain tail. This means that multi‐
415 ple chains can be executing in parallel, or chains and individ‐
416 ual SQEs. Only members inside the chain are serialized. A chain
417 of SQEs will be broken, if any request in that chain ends in
418 error. io_uring considers any unexpected result an error. This
419 means that, eg, a short read will also terminate the remainder
420 of the chain. If a chain of SQE links is broken, the remaining
421 unstarted part of the chain will be terminated and completed
422 with -ECANCELED as the error code. Available since 5.3.
423
424 IOSQE_IO_HARDLINK
425 Like IOSQE_IO_LINK, but it doesn't sever regardless of the com‐
426 pletion result. Note that the link will still sever if we fail
427 submitting the parent request, hard links are only resilient in
428 the presence of completion results for requests that did submit
429 correctly. IOSQE_IO_HARDLINK implies IOSQE_IO_LINK. Available
430 since 5.5.
431
432 IOSQE_ASYNC
433 Normal operation for io_uring is to try and issue an sqe as non-
434 blocking first, and if that fails, execute it in an async man‐
435 ner. To support more efficient overlapped operation of requests
436 that the application knows/assumes will always (or most of the
437 time) block, the application can ask for an sqe to be issued
438 async from the start. Available since 5.6.
439
440
441
442 ioprio specifies the I/O priority. See ioprio_get(2) for a description
443 of Linux I/O priorities.
444
445 fd specifies the file descriptor against which the operation will be
446 performed, with the exception noted above.
447
448 If the operation is one of IORING_OP_READ_FIXED or IOR‐
449 ING_OP_WRITE_FIXED, addr and len must fall within the buffer located at
450 buf_index in the fixed buffer array. If the operation is either IOR‐
451 ING_OP_READV or IORING_OP_WRITEV, then addr points to an iovec array of
452 len entries.
453
454 rw_flags, specified for read and write operations, contains a bitwise
455 OR of per-I/O flags, as described in the preadv2(2) man page.
456
457 The fsync_flags bit mask may contain either 0, for a normal file
458 integrity sync, or IORING_FSYNC_DATASYNC to provide data sync only
459 semantics. See the descriptions of O_SYNC and O_DSYNC in the open(2)
460 manual page for more information.
461
462 The bits that may be set in poll_events are defined in <poll.h>, and
463 documented in poll(2).
464
465 user_data is an application-supplied value that will be copied into the
466 completion queue entry (see below). buf_index is an index into an
467 array of fixed buffers, and is only valid if fixed buffers were regis‐
468 tered. personality is the credentials id to use for this operation.
469 See io_uring_register(2) for how to register personalities with
470 io_uring. If set to 0, the current personality of the submitting task
471 is used.
472
473 Once the submission queue entry is initialized, I/O is submitted by
474 placing the index of the submission queue entry into the tail of the
475 submission queue. After one or more indexes are added to the queue,
476 and the queue tail is advanced, the io_uring_enter(2) system call can
477 be invoked to initiate the I/O.
478
479 Completions use the following data structure:
480
481 /*
482 * IO completion data structure (Completion Queue Entry)
483 */
484 struct io_uring_cqe {
485 __u64 user_data; /* sqe->data submission passed back */
486 __s32 res; /* result code for this event */
487 __u32 flags;
488 };
489
490 user_data is copied from the field of the same name in the submission
491 queue entry. The primary use case is to store data that the applica‐
492 tion will need to access upon completion of this particular I/O. The
493 flags is reserved for future use. res is the operation-specific
494 result.
495
496 For read and write opcodes, the return values match those documented in
497 the preadv2(2) and pwritev2(2) man pages. Return codes for the
498 io_uring-specific opcodes are documented in the description of the
499 opcodes above.
500
502 io_uring_enter() returns the number of I/Os successfully consumed.
503 This can be zero if to_submit was zero or if the submission queue was
504 empty. The errors below that refer to an error in a submission queue
505 entry will be returned though a completion queue entry, rather than
506 through the system call itself.
507
508 Errors that occur not on behalf of a submission queue entry are
509 returned via the system call directly. On such an error, -1 is returned
510 and errno is set appropriately.
511
513 EAGAIN The kernel was unable to allocate memory for the request, or
514 otherwise ran out of resources to handle it. The application
515 should wait for some completions and try again.
516
517 EBUSY The application is attempting to overcommit the number of
518 requests it can have pending. The application should wait for
519 some completions and try again. May occur if the application
520 tries to queue more requests than we have room for in the CQ
521 ring.
522
523 EBADF The fd field in the submission queue entry is invalid, or the
524 IOSQE_FIXED_FILE flag was set in the submission queue entry, but
525 no files were registered with the io_uring instance.
526
527 EFAULT buffer is outside of the process' accessible address space
528
529 EFAULT IORING_OP_READ_FIXED or IORING_OP_WRITE_FIXED was specified in
530 the opcode field of the submission queue entry, but either buf‐
531 fers were not registered for this io_uring instance, or the
532 address range described by addr and len does not fit within the
533 buffer registered at buf_index.
534
535 EINVAL The index member of the submission queue entry is invalid.
536
537 EINVAL The flags field or opcode in a submission queue entry is
538 invalid.
539
540 EINVAL IORING_OP_NOP was specified in the submission queue entry, but
541 the io_uring context was setup for polling (IORING_SETUP_IOPOLL
542 was specified in the call to io_uring_setup).
543
544 EINVAL IORING_OP_READV or IORING_OP_WRITEV was specified in the submis‐
545 sion queue entry, but the io_uring instance has fixed buffers
546 registered.
547
548 EINVAL IORING_OP_READ_FIXED or IORING_OP_WRITE_FIXED was specified in
549 the submission queue entry, and the buf_index is invalid.
550
551 EINVAL IORING_OP_READV, IORING_OP_WRITEV, IORING_OP_READ_FIXED, IOR‐
552 ING_OP_WRITE_FIXED or IORING_OP_FSYNC was specified in the sub‐
553 mission queue entry, but the io_uring instance was configured
554 for IOPOLLing, or any of addr, ioprio, off, len, or buf_index
555 was set in the submission queue entry.
556
557 EINVAL IORING_OP_POLL_ADD or IORING_OP_POLL_REMOVE was specified in the
558 opcode field of the submission queue entry, but the io_uring
559 instance was configured for busy-wait polling (IORING_SET‐
560 UP_IOPOLL), or any of ioprio, off, len, or buf_index was non-
561 zero in the submission queue entry.
562
563 EINVAL IORING_OP_POLL_ADD was specified in the opcode field of the sub‐
564 mission queue entry, and the addr field was non-zero.
565
566 ENXIO The io_uring instance is in the process of being torn down.
567
568 EOPNOTSUPP
569 fd does not refer to an io_uring instance.
570
571 EOPNOTSUPP
572 opcode is valid, but not supported by this kernel.
573
574 EINTR The operation was interrupted by a delivery of a signal before
575 it could complete; see signal(7). Can happen while waiting for
576 events with IORING_ENTER_GETEVENTS.
577
578
579
580Linux 2019-01-22 IO_URING_ENTER(2)