1io_uring_enter(2) Linux Programmer's Manual io_uring_enter(2)
2
3
4
6 io_uring_enter - initiate and/or complete asynchronous I/O
7
9 #include <liburing.h>
10
11 int io_uring_enter(unsigned int fd, unsigned int to_submit,
12 unsigned int min_complete, unsigned int flags,
13 sigset_t *sig);
14
15 int io_uring_enter2(unsigned int fd, unsigned int to_submit,
16 unsigned int min_complete, unsigned int flags,
17 sigset_t *sig, size_t sz);
18
20 io_uring_enter(2) is used to initiate and complete I/O using the shared
21 submission and completion queues setup by a call to io_uring_setup(2).
22 A single call can both submit new I/O and wait for completions of I/O
23 initiated by this call or previous calls to io_uring_enter(2).
24
25 fd is the file descriptor returned by io_uring_setup(2). to_submit
26 specifies the number of I/Os to submit from the submission queue.
27 flags is a bitmask of the following values:
28
29 IORING_ENTER_GETEVENTS
30 If this flag is set, then the system call will wait for the
31 specified number of events in min_complete before returning.
32 This flag can be set along with to_submit to both submit and
33 complete events in a single system call.
34
35 IORING_ENTER_SQ_WAKEUP
36 If the ring has been created with IORING_SETUP_SQPOLL, then this
37 flag asks the kernel to wakeup the SQ kernel thread to submit
38 IO.
39
40 IORING_ENTER_SQ_WAIT
41 If the ring has been created with IORING_SETUP_SQPOLL, then the
42 application has no real insight into when the SQ kernel thread
43 has consumed entries from the SQ ring. This can lead to a situa‐
44 tion where the application can no longer get a free SQE entry to
45 submit, without knowing when it one becomes available as the SQ
46 kernel thread consumes them. If the system call is used with
47 this flag set, then it will wait until at least one entry is
48 free in the SQ ring.
49
50 IORING_ENTER_EXT_ARG
51 Since kernel 5.11, the system calls arguments have been modified
52 to look like the following:
53
54 int io_uring_enter2(unsigned int fd, unsigned int to_submit,
55 unsigned int min_complete, unsigned int flags,
56 const void *arg, size_t argsz);
57
58 which behaves just like the original definition by default. How‐
59 ever, if IORING_ENTER_EXT_ARG is set, then instead of a sigset_t
60 being passed in, a pointer to a struct io_uring_getevents_arg is
61 used instead and argsz must be set to the size of this struc‐
62 ture. The definition is as follows:
63
64 struct io_uring_getevents_arg {
65 __u64 sigmask;
66 __u32 sigmask_sz;
67 __u32 pad;
68 __u64 ts;
69 };
70
71 which allows passing in both a signal mask as well as pointer to
72 a struct __kernel_timespec timeout value. If ts is set to a
73 valid pointer, then this time value indicates the timeout for
74 waiting on events. If an application is waiting on events and
75 wishes to stop waiting after a specified amount of time, then
76 this can be accomplished directly in version 5.11 and newer by
77 using this feature.
78
79 IORING_ENTER_REGISTERED_RING
80 If the ring file descriptor has been registered through use of
81 IORING_REGISTER_RING_FDS, then setting this flag will tell the
82 kernel that the ring_fd passed in is the registered ring offset
83 rather than a normal file descriptor.
84
85
86 If the io_uring instance was configured for polling, by specifying IOR‐
87 ING_SETUP_IOPOLL in the call to io_uring_setup(2), then min_complete
88 has a slightly different meaning. Passing a value of 0 instructs the
89 kernel to return any events which are already complete, without block‐
90 ing. If min_complete is a non-zero value, the kernel will still return
91 immediately if any completion events are available. If no event com‐
92 pletions are available, then the call will poll either until one or
93 more completions become available, or until the process has exceeded
94 its scheduler time slice.
95
96 Note that, for interrupt driven I/O (where IORING_SETUP_IOPOLL was not
97 specified in the call to io_uring_setup(2)), an application may check
98 the completion queue for event completions without entering the kernel
99 at all.
100
101 When the system call returns that a certain amount of SQEs have been
102 consumed and submitted, it's safe to reuse SQE entries in the ring.
103 This is true even if the actual IO submission had to be punted to async
104 context, which means that the SQE may in fact not have been submitted
105 yet. If the kernel requires later use of a particular SQE entry, it
106 will have made a private copy of it.
107
108 sig is a pointer to a signal mask (see sigprocmask(2)); if sig is not
109 NULL, io_uring_enter(2) first replaces the current signal mask by the
110 one pointed to by sig, then waits for events to become available in the
111 completion queue, and then restores the original signal mask. The fol‐
112 lowing io_uring_enter(2) call:
113
114 ret = io_uring_enter(fd, 0, 1, IORING_ENTER_GETEVENTS, &sig);
115
116 is equivalent to atomically executing the following calls:
117
118 pthread_sigmask(SIG_SETMASK, &sig, &orig);
119 ret = io_uring_enter(fd, 0, 1, IORING_ENTER_GETEVENTS, NULL);
120 pthread_sigmask(SIG_SETMASK, &orig, NULL);
121
122 See the description of pselect(2) for an explanation of why the sig pa‐
123 rameter is necessary.
124
125 Submission queue entries are represented using the following data
126 structure:
127
128 /*
129 * IO submission data structure (Submission Queue Entry)
130 */
131 struct io_uring_sqe {
132 __u8 opcode; /* type of operation for this sqe */
133 __u8 flags; /* IOSQE_ flags */
134 __u16 ioprio; /* ioprio for the request */
135 __s32 fd; /* file descriptor to do IO on */
136 union {
137 __u64 off; /* offset into file */
138 __u64 addr2;
139 };
140 union {
141 __u64 addr; /* pointer to buffer or iovecs */
142 __u64 splice_off_in;
143 }
144 __u32 len; /* buffer size or number of iovecs */
145 union {
146 __kernel_rwf_t rw_flags;
147 __u32 fsync_flags;
148 __u16 poll_events; /* compatibility */
149 __u32 poll32_events; /* word-reversed for BE */
150 __u32 sync_range_flags;
151 __u32 msg_flags;
152 __u32 timeout_flags;
153 __u32 accept_flags;
154 __u32 cancel_flags;
155 __u32 open_flags;
156 __u32 statx_flags;
157 __u32 fadvise_advice;
158 __u32 splice_flags;
159 __u32 rename_flags;
160 __u32 unlink_flags;
161 __u32 hardlink_flags;
162 };
163 __u64 user_data; /* data to be passed back at completion time */
164 union {
165 struct {
166 /* index into fixed buffers, if used */
167 union {
168 /* index into fixed buffers, if used */
169 __u16 buf_index;
170 /* for grouped buffer selection */
171 __u16 buf_group;
172 }
173 /* personality to use, if used */
174 __u16 personality;
175 union {
176 __s32 splice_fd_in;
177 __u32 file_index;
178 };
179 };
180 __u64 __pad2[3];
181 };
182 };
183
184 The opcode describes the operation to be performed. It can be one of:
185
186 IORING_OP_NOP
187 Do not perform any I/O. This is useful for testing the perfor‐
188 mance of the io_uring implementation itself.
189
190 IORING_OP_READV
191
192 IORING_OP_WRITEV
193 Vectored read and write operations, similar to preadv2(2) and
194 pwritev2(2). If the file is not seekable, off must be set to
195 zero or -1.
196
197
198 IORING_OP_READ_FIXED
199
200 IORING_OP_WRITE_FIXED
201 Read from or write to pre-mapped buffers. See io_uring_regis‐
202 ter(2) for details on how to setup a context for fixed reads and
203 writes.
204
205
206 IORING_OP_FSYNC
207 File sync. See also fsync(2). Note that, while I/O is initi‐
208 ated in the order in which it appears in the submission queue,
209 completions are unordered. For example, an application which
210 places a write I/O followed by an fsync in the submission queue
211 cannot expect the fsync to apply to the write. The two opera‐
212 tions execute in parallel, so the fsync may complete before the
213 write is issued to the storage. The same is also true for pre‐
214 viously issued writes that have not completed prior to the
215 fsync.
216
217
218 IORING_OP_POLL_ADD
219 Poll the fd specified in the submission queue entry for the
220 events specified in the poll_events field. Unlike poll or epoll
221 without EPOLLONESHOT, by default this interface always works in
222 one shot mode. That is, once the poll operation is completed,
223 it will have to be resubmitted.
224
225 If IORING_POLL_ADD_MULTI is set in the SQE len field, then the
226 poll will work in multi shot mode instead. That means it'll
227 repatedly trigger when the requested event becomes true, and
228 hence multiple CQEs can be generated from this single SQE. The
229 CQE flags field will have IORING_CQE_F_MORE set on completion if
230 the application should expect further CQE entries from the orig‐
231 inal request. If this flag isn't set on completion, then the
232 poll request has been terminated and no further events will be
233 generated. This mode is available since 5.13.
234
235 If IORING_POLL_UPDATE_EVENTS is set in the SQE len field, then
236 the request will update an existing poll request with the mask
237 of events passed in with this request. The lookup is based on
238 the user_data field of the original SQE submitted, and this val‐
239 ues is passed in the addr field of the SQE. This mode is avail‐
240 able since 5.13.
241
242 If IORING_POLL_UPDATE_USER_DATA is set in the SQE len field,
243 then the request will update the user_data of an existing poll
244 request based on the value passed in the off field. This mode is
245 available since 5.13.
246
247 This command works like an async poll(2) and the completion
248 event result is the returned mask of events. For the variants
249 that update user_data or events , the completion result will be
250 similar to IORING_OP_POLL_REMOVE.
251
252
253 IORING_OP_POLL_REMOVE
254 Remove an existing poll request. If found, the res field of the
255 struct io_uring_cqe will contain 0. If not found, res will con‐
256 tain -ENOENT, or -EALREADY if the poll request was in the
257 process of completing already.
258
259
260 IORING_OP_EPOLL_CTL
261 Add, remove or modify entries in the interest list of epoll(7).
262 See epoll_ctl(2) for details of the system call. fd holds the
263 file descriptor that represents the epoll instance, addr holds
264 the file descriptor to add, remove or modify, len holds the op‐
265 eration (EPOLL_CTL_ADD, EPOLL_CTL_DEL, EPOLL_CTL_MOD) to perform
266 and, off holds a pointer to the epoll_events structure. Avail‐
267 able since 5.6.
268
269
270 IORING_OP_SYNC_FILE_RANGE
271 Issue the equivalent of a sync_file_range (2) on the file de‐
272 scriptor. The fd field is the file descriptor to sync, the off
273 field holds the offset in bytes, the len field holds the length
274 in bytes, and the sync_range_flags field holds the flags for the
275 command. See also sync_file_range(2) for the general description
276 of the related system call. Available since 5.2.
277
278
279 IORING_OP_SENDMSG
280 Issue the equivalent of a sendmsg(2) system call. fd must be
281 set to the socket file descriptor, addr must contain a pointer
282 to the msghdr structure, and msg_flags holds the flags associ‐
283 ated with the system call. See also sendmsg(2) for the general
284 description of the related system call. Available since 5.3.
285
286 This command also supports the following modifiers in ioprio:
287
288
289 IORING_RECVSEND_POLL_FIRST If set, io_uring will assume the
290 socket is currently full and attempting to send data will
291 be unsuccessful. For this case, io_uring will arm internal
292 poll and trigger a send of the data when there is enough
293 space available. This initial send attempt can be wasteful
294 for the case where the socket is expected to be full, set‐
295 ting this flag will bypass the initial send attempt and go
296 straight to arming poll. If poll does indicate that data
297 can be sent, the operation will proceed.
298
299 IORING_OP_RECVMSG
300 Works just like IORING_OP_SENDMSG, except for recvmsg(2) in‐
301 stead. See the description of IORING_OP_SENDMSG. Available since
302 5.3.
303
304 This command also supports the following modifiers in ioprio:
305
306
307 IORING_RECVSEND_POLL_FIRST If set, io_uring will assume the
308 socket is currently empty and attempting to receive data
309 will be unsuccessful. For this case, io_uring will arm in‐
310 ternal poll and trigger a receive of the data when the
311 socket has data to be read. This initial receive attempt
312 can be wasteful for the case where the socket is expected
313 to be empty, setting this flag will bypass the initial re‐
314 ceive attempt and go straight to arming poll. If poll does
315 indicate that data is ready to be received, the operation
316 will proceed.
317
318 IORING_OP_SEND
319 Issue the equivalent of a send(2) system call. fd must be set
320 to the socket file descriptor, addr must contain a pointer to
321 the buffer, len denotes the length of the buffer to send, and
322 msg_flags holds the flags associated with the system call. See
323 also send(2) for the general description of the related system
324 call. Available since 5.6.
325
326 This command also supports the following modifiers in ioprio:
327
328
329 IORING_RECVSEND_POLL_FIRST If set, io_uring will assume the
330 socket is currently full and attempting to send data will
331 be unsuccessful. For this case, io_uring will arm internal
332 poll and trigger a send of the data when there is enough
333 space available. This initial send attempt can be wasteful
334 for the case where the socket is expected to be full, set‐
335 ting this flag will bypass the initial send attempt and go
336 straight to arming poll. If poll does indicate that data
337 can be sent, the operation will proceed.
338
339 IORING_OP_RECV
340 Works just like IORING_OP_SEND, except for recv(2) instead. See
341 the description of IORING_OP_SEND. Available since 5.6.
342
343 This command also supports the following modifiers in ioprio:
344
345
346 IORING_RECVSEND_POLL_FIRST If set, io_uring will assume the
347 socket is currently empty and attempting to receive data
348 will be unsuccessful. For this case, io_uring will arm in‐
349 ternal poll and trigger a receive of the data when the
350 socket has data to be read. This initial receive attempt
351 can be wasteful for the case where the socket is expected
352 to be empty, setting this flag will bypass the initial re‐
353 ceive attempt and go straight to arming poll. If poll does
354 indicate that data is ready to be received, the operation
355 will proceed.
356
357 IORING_OP_TIMEOUT
358 This command will register a timeout operation. The addr field
359 must contain a pointer to a struct timespec64 structure, len
360 must contain 1 to signify one timespec64 structure, time‐
361 out_flags may contain IORING_TIMEOUT_ABS for an absolute timeout
362 value, or 0 for a relative timeout. off may contain a comple‐
363 tion event count. A timeout will trigger a wakeup event on the
364 completion ring for anyone waiting for events. A timeout condi‐
365 tion is met when either the specified timeout expires, or the
366 specified number of events have completed. Either condition will
367 trigger the event. If set to 0, completed events are not
368 counted, which effectively acts like a timer. io_uring timeouts
369 use the CLOCK_MONOTONIC clock source. The request will complete
370 with -ETIME if the timeout got completed through expiration of
371 the timer, or 0 if the timeout got completed through requests
372 completing on their own. If the timeout was canceled before it
373 expired, the request will complete with -ECANCELED. Available
374 since 5.4.
375
376 Since 5.15, this command also supports the following modifiers
377 in timeout_flags:
378
379
380 IORING_TIMEOUT_BOOTTIME If set, then the clocksource used
381 is CLOCK_BOOTTIME instead of CLOCK_MONOTONIC. This clock‐
382 source differs in that it includes time elapsed if the sys‐
383 tem was suspend while having a timeout request in-flight.
384
385 IORING_TIMEOUT_REALTIME If set, then the clocksource used
386 is CLOCK_REALTIME instead of CLOCK_MONOTONIC.
387
388 IORING_OP_TIMEOUT_REMOVE
389 If timeout_flags are zero, then it attempts to remove an exist‐
390 ing timeout operation. addr must contain the user_data field of
391 the previously issued timeout operation. If the specified time‐
392 out request is found and canceled successfully, this request
393 will terminate with a result value of 0 If the timeout request
394 was found but expiration was already in progress, this request
395 will terminate with a result value of -EBUSY If the timeout re‐
396 quest wasn't found, the request will terminate with a result
397 value of -ENOENT Available since 5.5.
398
399 If timeout_flags contain IORING_TIMEOUT_UPDATE, instead of re‐
400 moving an existing operation, it updates it. addr and return
401 values are same as before. addr2 field must contain a pointer
402 to a struct timespec64 structure. timeout_flags may also con‐
403 tain IORING_TIMEOUT_ABS, in which case the value given is an ab‐
404 solute one, not a relative one. Available since 5.11.
405
406
407 IORING_OP_ACCEPT
408 Issue the equivalent of an accept4(2) system call. fd must be
409 set to the socket file descriptor, addr must contain the pointer
410 to the sockaddr structure, and addr2 must contain a pointer to
411 the socklen_t addrlen field. Flags can be passed using the ac‐
412 cept_flags field. See also accept4(2) for the general descrip‐
413 tion of the related system call. Available since 5.5.
414
415 If the file_index field is set to a positive number, the file
416 won't be installed into the normal file table as usual but will
417 be placed into the fixed file table at index file_index - 1. In
418 this case, instead of returning a file descriptor, the result
419 will contain either 0 on success or an error. If the index
420 points to a valid empty slot, the installation is guaranteed to
421 not fail. If there is already a file in the slot, it will be re‐
422 placed, similar to IORING_OP_FILES_UPDATE. Please note that
423 only io_uring has access to such files and no other syscall can
424 use them. See IOSQE_FIXED_FILE and IORING_REGISTER_FILES.
425
426 Available since 5.5.
427
428
429 IORING_OP_ASYNC_CANCEL
430 Attempt to cancel an already issued request. addr must contain
431 the user_data field of the request that should be canceled. The
432 cancelation request will complete with one of the following re‐
433 sults codes. If found, the res field of the cqe will contain 0.
434 If not found, res will contain -ENOENT. If found and attempted
435 canceled, the res field will contain -EALREADY. In this case,
436 the request may or may not terminate. In general, requests that
437 are interruptible (like socket IO) will get canceled, while disk
438 IO requests cannot be canceled if already started. Available
439 since 5.5.
440
441
442 IORING_OP_LINK_TIMEOUT
443 This request must be linked with another request through
444 IOSQE_IO_LINK which is described below. Unlike IORING_OP_TIME‐
445 OUT, IORING_OP_LINK_TIMEOUT acts on the linked request, not the
446 completion queue. The format of the command is otherwise like
447 IORING_OP_TIMEOUT, except there's no completion event count as
448 it's tied to a specific request. If used, the timeout specified
449 in the command will cancel the linked command, unless the linked
450 command completes before the timeout. The timeout will complete
451 with -ETIME if the timer expired and the linked request was at‐
452 tempted canceled, or -ECANCELED if the timer got canceled be‐
453 cause of completion of the linked request. Like IORING_OP_TIME‐
454 OUT the clock source used is CLOCK_MONOTONIC Available since
455 5.5.
456
457
458
459 IORING_OP_CONNECT
460 Issue the equivalent of a connect(2) system call. fd must be
461 set to the socket file descriptor, addr must contain the const
462 pointer to the sockaddr structure, and off must contain the
463 socklen_t addrlen field. See also connect(2) for the general de‐
464 scription of the related system call. Available since 5.5.
465
466
467 IORING_OP_FALLOCATE
468 Issue the equivalent of a fallocate(2) system call. fd must be
469 set to the file descriptor, len must contain the mode associated
470 with the operation, off must contain the offset on which to op‐
471 erate, and addr must contain the length. See also fallocate(2)
472 for the general description of the related system call. Avail‐
473 able since 5.6.
474
475
476 IORING_OP_FADVISE
477 Issue the equivalent of a posix_fadvise(2) system call. fd must
478 be set to the file descriptor, off must contain the offset on
479 which to operate, len must contain the length, and fadvise_ad‐
480 vice must contain the advice associated with the operation. See
481 also posix_fadvise(2) for the general description of the related
482 system call. Available since 5.6.
483
484
485 IORING_OP_MADVISE
486 Issue the equivalent of a madvise(2) system call. addr must
487 contain the address to operate on, len must contain the length
488 on which to operate, and fadvise_advice must contain the advice
489 associated with the operation. See also madvise(2) for the gen‐
490 eral description of the related system call. Available since
491 5.6.
492
493
494 IORING_OP_OPENAT
495 Issue the equivalent of a openat(2) system call. fd is the
496 dirfd argument, addr must contain a pointer to the *pathname ar‐
497 gument, open_flags should contain any flags passed in, and len
498 is access mode of the file. See also openat(2) for the general
499 description of the related system call. Available since 5.6.
500
501 If the file_index field is set to a positive number, the file
502 won't be installed into the normal file table as usual but will
503 be placed into the fixed file table at index file_index - 1. In
504 this case, instead of returning a file descriptor, the result
505 will contain either 0 on success or an error. If the index
506 points to a valid empty slot, the installation is guaranteed to
507 not fail. If there is already a file in the slot, it will be re‐
508 placed, similar to IORING_OP_FILES_UPDATE. Please note that
509 only io_uring has access to such files and no other syscall can
510 use them. See IOSQE_FIXED_FILE and IORING_REGISTER_FILES.
511
512 Available since 5.15.
513
514
515 IORING_OP_OPENAT2
516 Issue the equivalent of a openat2(2) system call. fd is the
517 dirfd argument, addr must contain a pointer to the *pathname ar‐
518 gument, len should contain the size of the open_how structure,
519 and off should be set to the address of the open_how structure.
520 See also openat2(2) for the general description of the related
521 system call. Available since 5.6.
522
523 If the file_index field is set to a positive number, the file
524 won't be installed into the normal file table as usual but will
525 be placed into the fixed file table at index file_index - 1. In
526 this case, instead of returning a file descriptor, the result
527 will contain either 0 on success or an error. If the index
528 points to a valid empty slot, the installation is guaranteed to
529 not fail. If there is already a file in the slot, it will be re‐
530 placed, similar to IORING_OP_FILES_UPDATE. Please note that
531 only io_uring has access to such files and no other syscall can
532 use them. See IOSQE_FIXED_FILE and IORING_REGISTER_FILES.
533
534 Available since 5.15.
535
536
537 IORING_OP_CLOSE
538 Issue the equivalent of a close(2) system call. fd is the file
539 descriptor to be closed. See also close(2) for the general de‐
540 scription of the related system call. Available since 5.6. If
541 the file_index field is set to a positive number, this command
542 can be used to close files that were direct opened through IOR‐
543 ING_OP_OPENAT , IORING_OP_OPENAT2 , or IORING_OP_ACCEPT using
544 the io_uring specific direct descriptors. Note that only one of
545 the descriptor fields may be set. The direct close feature is
546 available since the 5.15 kernel, where direct descriptors were
547 introduced.
548
549
550 IORING_OP_STATX
551 Issue the equivalent of a statx(2) system call. fd is the dirfd
552 argument, addr must contain a pointer to the *pathname string,
553 statx_flags is the flags argument, len should be the mask argu‐
554 ment, and off must contain a pointer to the statxbuf to be
555 filled in. See also statx(2) for the general description of the
556 related system call. Available since 5.6.
557
558
559 IORING_OP_READ
560
561 IORING_OP_WRITE
562 Issue the equivalent of a pread(2) or pwrite(2) system call. fd
563 is the file descriptor to be operated on, addr contains the buf‐
564 fer in question, len contains the length of the IO operation,
565 and offs contains the read or write offset. If fd does not refer
566 to a seekable file, off must be set to zero or -1. If offs is
567 set to -1 , the offset will use (and advance) the file position,
568 like the read(2) and write(2) system calls. These are non-vec‐
569 tored versions of the IORING_OP_READV and IORING_OP_WRITEV op‐
570 codes. See also read(2) and write(2) for the general description
571 of the related system call. Available since 5.6.
572
573
574 IORING_OP_SPLICE
575 Issue the equivalent of a splice(2) system call. splice_fd_in
576 is the file descriptor to read from, splice_off_in is an offset
577 to read from, fd is the file descriptor to write to, off is an
578 offset from which to start writing to. A sentinel value of -1 is
579 used to pass the equivalent of a NULL for the offsets to
580 splice(2). len contains the number of bytes to copy.
581 splice_flags contains a bit mask for the flag field associated
582 with the system call. Please note that one of the file descrip‐
583 tors must refer to a pipe. See also splice(2) for the general
584 description of the related system call. Available since 5.7.
585
586
587 IORING_OP_TEE
588 Issue the equivalent of a tee(2) system call. splice_fd_in is
589 the file descriptor to read from, fd is the file descriptor to
590 write to, len contains the number of bytes to copy, and
591 splice_flags contains a bit mask for the flag field associated
592 with the system call. Please note that both of the file de‐
593 scriptors must refer to a pipe. See also tee(2) for the general
594 description of the related system call. Available since 5.8.
595
596
597 IORING_OP_FILES_UPDATE
598 This command is an alternative to using IORING_REGIS‐
599 TER_FILES_UPDATE which then works in an async fashion, like the
600 rest of the io_uring commands. The arguments passed in are the
601 same. addr must contain a pointer to the array of file descrip‐
602 tors, len must contain the length of the array, and off must
603 contain the offset at which to operate. Note that the array of
604 file descriptors pointed to in addr must remain valid until this
605 operation has completed. Available since 5.6.
606
607
608 IORING_OP_PROVIDE_BUFFERS
609 This command allows an application to register a group of buf‐
610 fers to be used by commands that read/receive data. Using buf‐
611 fers in this manner can eliminate the need to separate the poll
612 + read, which provides a convenient point in time to allocate a
613 buffer for a given request. It's often infeasible to have as
614 many buffers available as pending reads or receive. With this
615 feature, the application can have its pool of buffers ready in
616 the kernel, and when the file or socket is ready to read/receive
617 data, a buffer can be selected for the operation. fd must con‐
618 tain the number of buffers to provide, addr must contain the
619 starting address to add buffers from, len must contain the
620 length of each buffer to add from the range, buf_group must con‐
621 tain the group ID of this range of buffers, and off must contain
622 the starting buffer ID of this range of buffers. With that set,
623 the kernel adds buffers starting with the memory address in
624 addr, each with a length of len. Hence the application should
625 provide len * fd worth of memory in addr. Buffers are grouped
626 by the group ID, and each buffer within this group will be iden‐
627 tical in size according to the above arguments. This allows the
628 application to provide different groups of buffers, and this is
629 often used to have differently sized buffers available depending
630 on what the expectations are of the individual request. When
631 submitting a request that should use a provided buffer, the
632 IOSQE_BUFFER_SELECT flag must be set, and buf_group must be set
633 to the desired buffer group ID where the buffer should be se‐
634 lected from. Available since 5.7.
635
636
637 IORING_OP_REMOVE_BUFFERS
638 Remove buffers previously registered with IORING_OP_PROVIDE_BUF‐
639 FERS. fd must contain the number of buffers to remove, and
640 buf_group must contain the buffer group ID from which to remove
641 the buffers. Available since 5.7.
642
643
644 IORING_OP_SHUTDOWN
645 Issue the equivalent of a shutdown(2) system call. fd is the
646 file descriptor to the socket being shutdown, and len must be
647 set to the how argument. No no other fields should be set.
648 Available since 5.11.
649
650
651 IORING_OP_RENAMEAT
652 Issue the equivalent of a renameat2(2) system call. fd should
653 be set to the olddirfd, addr should be set to the oldpath, len
654 should be set to the newdirfd, addr should be set to the old‐
655 path, addr2 should be set to the newpath, and finally re‐
656 name_flags should be set to the flags passed in to renameat2(2).
657 Available since 5.11.
658
659
660 IORING_OP_UNLINKAT
661 Issue the equivalent of a unlinkat2(2) system call. fd should
662 be set to the dirfd, addr should be set to the pathname, and un‐
663 link_flags should be set to the flags being passed in to un‐
664 linkat(2). Available since 5.11.
665
666
667 IORING_OP_MKDIRAT
668 Issue the equivalent of a mkdirat2(2) system call. fd should be
669 set to the dirfd, addr should be set to the pathname, and len
670 should be set to the mode being passed in to mkdirat(2). Avail‐
671 able since 5.15.
672
673
674 IORING_OP_SYMLINKAT
675 Issue the equivalent of a symlinkat2(2) system call. fd should
676 be set to the newdirfd, addr should be set to the target and
677 addr2 should be set to the linkpath being passed in to sym‐
678 linkat(2). Available since 5.15.
679
680
681 IORING_OP_LINKAT
682 Issue the equivalent of a linkat2(2) system call. fd should be
683 set to the olddirfd, addr should be set to the oldpath, len
684 should be set to the newdirfd, addr2 should be set to the new‐
685 path, and hardlink_flags should be set to the flags being passed
686 in to linkat(2). Available since 5.15.
687
688
689 IORING_OP_MSG_RING
690 Send a message to an io_uring. fd must be set to a file de‐
691 scriptor of a ring that the application has access to, len can
692 be set to any 32-bit value that the application wishes to pass
693 on, and off should be set any 64-bit value that the application
694 wishes to send. On the target ring, a CQE will be posted with
695 the res field matching the len set, and a user_data field match‐
696 ing the off value being passed in. This request type can be used
697 to either just wake or interrupt anyone waiting for completions
698 on the target ring, or it can be used to pass messages via the
699 two fields. Available since 5.18.
700
701
702 IORING_OP_SOCKET
703 Issue the equivalent of a socket(2) system call. fd must con‐
704 tain the communication domain, off must contain the communica‐
705 tion type, len must contain the protocol, and rw_flags is cur‐
706 rently unused and must be set to zero. See also socket(2) for
707 the general description of the related system call. Available
708 since 5.19.
709
710 If the file_index field is set to a positive number, the file
711 won't be installed into the normal file table as usual but will
712 be placed into the fixed file table at index file_index - 1. In
713 this case, instead of returning a file descriptor, the result
714 will contain either 0 on success or an error. If the index
715 points to a valid empty slot, the installation is guaranteed to
716 not fail. If there is already a file in the slot, it will be re‐
717 placed, similar to IORING_OP_FILES_UPDATE. Please note that
718 only io_uring has access to such files and no other syscall can
719 use them. See IOSQE_FIXED_FILE and IORING_REGISTER_FILES.
720
721 Available since 5.19.
722
723
724 IORING_OP_SEND_ZC
725 Issue the zerocopy equivalent of a send(2) system call. Similar
726 to IORING_OP_SEND, but tries to avoid making intermediate copies
727 of data. Zerocopy execution is not guaranteed and may fall back
728 to copying. The request may also fail with -EOPNOTSUPP , when a
729 protocol doesn't support zerocopy, in which case users are rec‐
730 ommended to use copying sends instead.
731
732 The flags field of the first struct io_uring_cqe may likely con‐
733 tain IORING_CQE_F_MORE , which means that there will be a second
734 completion event / notification for the request, with the
735 user_data field set to the same value. The user must not modify
736 the data buffer until the notification is posted. The first cqe
737 follows the usual rules and so its res field will contain the
738 number of bytes sent or a negative error code. The notifica‐
739 tion's res field will be set to zero and the flags field will
740 contain IORING_CQE_F_NOTIF . The two step model is needed be‐
741 cause the kernel may hold on to buffers for a long time, e.g.
742 waiting for a TCP ACK, and having a separate cqe for request
743 completions allows userspace to push more data without extra de‐
744 lays. Note, notifications are only responsible for controlling
745 the lifetime of the buffers, and as such don't mean anything
746 about whether the data has atually been sent out or received by
747 the other end. Even errored requests may generate a notifica‐
748 tion, and the user must check for IORING_CQE_F_MORE rather than
749 relying on the result.
750
751 fd must be set to the socket file descriptor, addr must contain
752 a pointer to the buffer, len denotes the length of the buffer to
753 send, and msg_flags holds the flags associated with the system
754 call. When addr2 is non-zero it points to the address of the
755 target with addr_len specifying its size, turning the request
756 into a sendto(2) system call equivalent.
757
758 Available since 6.0.
759
760 This command also supports the following modifiers in ioprio:
761
762
763 IORING_RECVSEND_POLL_FIRST If set, io_uring will assume the
764 socket is currently full and attempting to send data will
765 be unsuccessful. For this case, io_uring will arm internal
766 poll and trigger a send of the data when there is enough
767 space available. This initial send attempt can be wasteful
768 for the case where the socket is expected to be full, set‐
769 ting this flag will bypass the initial send attempt and go
770 straight to arming poll. If poll does indicate that data
771 can be sent, the operation will proceed.
772
773 IORING_RECVSEND_FIXED_BUF If set, instructs io_uring to use
774 a pre-mapped buffer. The buf_index field should contain an
775 index into an array of fixed buffers. See io_uring_regis‐
776 ter(2) for details on how to setup a context for fixed buf‐
777 fer I/O.
778
779 The flags field is a bit mask. The supported flags are:
780
781 IOSQE_FIXED_FILE
782 When this flag is specified, fd is an index into the files array
783 registered with the io_uring instance (see the IORING_REGIS‐
784 TER_FILES section of the io_uring_register(2) man page). Note
785 that this isn't always available for all commands. If used on a
786 command that doesn't support fixed files, the SQE will error
787 with -EBADF. Available since 5.1.
788
789 IOSQE_IO_DRAIN
790 When this flag is specified, the SQE will not be started before
791 previously submitted SQEs have completed, and new SQEs will not
792 be started before this one completes. Available since 5.2.
793
794 IOSQE_IO_LINK
795 When this flag is specified, the SQE forms a link with the next
796 SQE in the submission ring. That next SQE will not be started
797 before the previous request completes. This, in effect, forms a
798 chain of SQEs, which can be arbitrarily long. The tail of the
799 chain is denoted by the first SQE that does not have this flag
800 set. Chains are not supported across submission boundaries. Even
801 if the last SQE in a submission has this flag set, it will still
802 terminate the current chain. This flag has no effect on previous
803 SQE submissions, nor does it impact SQEs that are outside of the
804 chain tail. This means that multiple chains can be executing in
805 parallel, or chains and individual SQEs. Only members inside the
806 chain are serialized. A chain of SQEs will be broken, if any re‐
807 quest in that chain ends in error. io_uring considers any unex‐
808 pected result an error. This means that, eg, a short read will
809 also terminate the remainder of the chain. If a chain of SQE
810 links is broken, the remaining unstarted part of the chain will
811 be terminated and completed with -ECANCELED as the error code.
812 Available since 5.3.
813
814 IOSQE_IO_HARDLINK
815 Like IOSQE_IO_LINK, but it doesn't sever regardless of the com‐
816 pletion result. Note that the link will still sever if we fail
817 submitting the parent request, hard links are only resilient in
818 the presence of completion results for requests that did submit
819 correctly. IOSQE_IO_HARDLINK implies IOSQE_IO_LINK. Available
820 since 5.5.
821
822 IOSQE_ASYNC
823 Normal operation for io_uring is to try and issue an sqe as non-
824 blocking first, and if that fails, execute it in an async man‐
825 ner. To support more efficient overlapped operation of requests
826 that the application knows/assumes will always (or most of the
827 time) block, the application can ask for an sqe to be issued
828 async from the start. Available since 5.6.
829
830 IOSQE_BUFFER_SELECT
831 Used in conjunction with the IORING_OP_PROVIDE_BUFFERS command,
832 which registers a pool of buffers to be used by commands that
833 read or receive data. When buffers are registered for this use
834 case, and this flag is set in the command, io_uring will grab a
835 buffer from this pool when the request is ready to receive or
836 read data. If successful, the resulting CQE will have IOR‐
837 ING_CQE_F_BUFFER set in the flags part of the struct, and the
838 upper IORING_CQE_BUFFER_SHIFT bits will contain the ID of the
839 selected buffers. This allows the application to know exactly
840 which buffer was selected for the operation. If no buffers are
841 available and this flag is set, then the request will fail with
842 -ENOBUFS as the error code. Once a buffer has been used, it is
843 no longer available in the kernel pool. The application must re-
844 register the given buffer again when it is ready to recycle it
845 (eg has completed using it). Available since 5.7.
846
847 IOSQE_CQE_SKIP_SUCCESS
848 Don't generate a CQE if the request completes successfully. If
849 the request fails, an appropriate CQE will be posted as usual
850 and if there is no IOSQE_IO_HARDLINK, CQEs for all linked re‐
851 quests will be omitted. The notion of failure/success is opcode
852 specific and is the same as with breaking chains of
853 IOSQE_IO_LINK. One special case is when the request has a
854 linked timeout, then the CQE generation for the linked timeout
855 is decided solely by whether it has IOSQE_CQE_SKIP_SUCCESS set,
856 regardless whether it timed out or was canceled. In other words,
857 if a linked timeout has the flag set, it's guaranteed to not
858 post a CQE.
859
860 The semantics are chosen to accommodate several use cases.
861 First, when all but the last request of a normal link without
862 linked timeouts are marked with the flag, only one CQE per lin
863 is posted. Additionally, it enables suppression of CQEs in cases
864 where the side effects of a successfully executed operation is
865 enough for userspace to know the state of the system. One such
866 example would be writing to a synchronisation file.
867
868 This flag is incompatible with IOSQE_IO_DRAIN. Using both of
869 them in a single ring is undefined behavior, even when they are
870 not used together in a single request. Currently, after the
871 first request with IOSQE_CQE_SKIP_SUCCESS, all subsequent re‐
872 quests marked with drain will be failed at submission time.
873 Note that the error reporting is best effort only, and restric‐
874 tions may change in the future.
875
876 Available since 5.17.
877
878
879 ioprio specifies the I/O priority. See ioprio_get(2) for a description
880 of Linux I/O priorities.
881
882 fd specifies the file descriptor against which the operation will be
883 performed, with the exception noted above.
884
885 If the operation is one of IORING_OP_READ_FIXED or IOR‐
886 ING_OP_WRITE_FIXED, addr and len must fall within the buffer located at
887 buf_index in the fixed buffer array. If the operation is either IOR‐
888 ING_OP_READV or IORING_OP_WRITEV, then addr points to an iovec array of
889 len entries.
890
891 rw_flags, specified for read and write operations, contains a bitwise
892 OR of per-I/O flags, as described in the preadv2(2) man page.
893
894 The fsync_flags bit mask may contain either 0, for a normal file integ‐
895 rity sync, or IORING_FSYNC_DATASYNC to provide data sync only seman‐
896 tics. See the descriptions of O_SYNC and O_DSYNC in the open(2) manual
897 page for more information.
898
899 The bits that may be set in poll_events are defined in <poll.h>, and
900 documented in poll(2).
901
902 user_data is an application-supplied value that will be copied into the
903 completion queue entry (see below). buf_index is an index into an ar‐
904 ray of fixed buffers, and is only valid if fixed buffers were regis‐
905 tered. personality is the credentials id to use for this operation.
906 See io_uring_register(2) for how to register personalities with io_ur‐
907 ing. If set to 0, the current personality of the submitting task is
908 used.
909
910 Once the submission queue entry is initialized, I/O is submitted by
911 placing the index of the submission queue entry into the tail of the
912 submission queue. After one or more indexes are added to the queue,
913 and the queue tail is advanced, the io_uring_enter(2) system call can
914 be invoked to initiate the I/O.
915
916 Completions use the following data structure:
917
918 /*
919 * IO completion data structure (Completion Queue Entry)
920 */
921 struct io_uring_cqe {
922 __u64 user_data; /* sqe->data submission passed back */
923 __s32 res; /* result code for this event */
924 __u32 flags;
925 };
926
927 user_data is copied from the field of the same name in the submission
928 queue entry. The primary use case is to store data that the applica‐
929 tion will need to access upon completion of this particular I/O. The
930 flags is used for certain commands, like IORING_OP_POLL_ADD or in con‐
931 junction with IOSQE_BUFFER_SELECT or IORING_OP_MSG_RING, , see those
932 entries for details. res is the operation-specific result, but io_ur‐
933 ing-specific errors (e.g. flags or opcode invalid) are returned through
934 this field. They are described in section CQE ERRORS.
935
936 For read and write opcodes, the return values match errno values docu‐
937 mented in the preadv2(2) and pwritev2(2) man pages, with res holding
938 the equivalent of -errno for error cases, or the transferred number of
939 bytes in case the operation is successful. Hence both error and success
940 return can be found in that field in the CQE. For other request types,
941 the return values are documented in the matching man page for that
942 type, or in the opcodes section above for io_uring-specific opcodes.
943
945 io_uring_enter(2) returns the number of I/Os successfully consumed.
946 This can be zero if to_submit was zero or if the submission queue was
947 empty. Note that if the ring was created with IORING_SETUP_SQPOLL spec‐
948 ified, then the return value will generally be the same as to_submit as
949 submission happens outside the context of the system call.
950
951 The errors related to a submission queue entry will be returned through
952 a completion queue entry (see section CQE ERRORS), rather than through
953 the system call itself.
954
955 Errors that occur not on behalf of a submission queue entry are re‐
956 turned via the system call directly. On such an error, a negative error
957 code is returned. The caller should not rely on errno variable.
958
960 These are the errors returned by io_uring_enter(2) system call.
961
962 EAGAIN The kernel was unable to allocate memory for the request, or
963 otherwise ran out of resources to handle it. The application
964 should wait for some completions and try again.
965
966 EBADF fd is not a valid file descriptor.
967
968 EBADFD fd is a valid file descriptor, but the io_uring ring is not in
969 the right state (enabled). See io_uring_register(2) for details
970 on how to enable the ring.
971
972 EBADR At least one CQE was dropped even with the IORING_FEAT_NODROP
973 feature, and there are no otherwise available CQEs. This clears
974 the error state and so with no other changes the next call to
975 io_uring_setup(2) will not have this error. This error should be
976 extremely rare and indicates the machine is running critically
977 low on memory and. It may be reasonable for the application to
978 terminate running unless it is able to safely handle any CQE be‐
979 ing lost.
980
981 EBUSY If the IORING_FEAT_NODROP feature flag is set, then EBUSY will
982 be returned if there were overflow entries, IORING_EN‐
983 TER_GETEVENTS flag is set and not all of the overflow entries
984 were able to be flushed to the CQ ring.
985
986 Without IORING_FEAT_NODROP the application is attempting to
987 overcommit the number of requests it can have pending. The ap‐
988 plication should wait for some completions and try again. May
989 occur if the application tries to queue more requests than we
990 have room for in the CQ ring, or if the application attempts to
991 wait for more events without having reaped the ones already
992 present in the CQ ring.
993
994 EINVAL Some bits in the flags argument are invalid.
995
996 EFAULT An invalid user space address was specified for the sig argu‐
997 ment.
998
999 ENXIO The io_uring instance is in the process of being torn down.
1000
1001 EOPNOTSUPP
1002 fd does not refer to an io_uring instance.
1003
1004 EINTR The operation was interrupted by a delivery of a signal before
1005 it could complete; see signal(7). Can happen while waiting for
1006 events with IORING_ENTER_GETEVENTS.
1007
1008
1010 These io_uring-specific errors are returned as a negative value in the
1011 res field of the completion queue entry.
1012
1013 EACCES The flags field or opcode in a submission queue entry is not al‐
1014 lowed due to registered restrictions. See io_uring_register(2)
1015 for details on how restrictions work.
1016
1017 EBADF The fd field in the submission queue entry is invalid, or the
1018 IOSQE_FIXED_FILE flag was set in the submission queue entry, but
1019 no files were registered with the io_uring instance.
1020
1021 EFAULT buffer is outside of the process' accessible address space
1022
1023 EFAULT IORING_OP_READ_FIXED or IORING_OP_WRITE_FIXED was specified in
1024 the opcode field of the submission queue entry, but either buf‐
1025 fers were not registered for this io_uring instance, or the ad‐
1026 dress range described by addr and len does not fit within the
1027 buffer registered at buf_index.
1028
1029 EINVAL The flags field or opcode in a submission queue entry is in‐
1030 valid.
1031
1032 EINVAL The buf_index member of the submission queue entry is invalid.
1033
1034 EINVAL The personality field in a submission queue entry is invalid.
1035
1036 EINVAL IORING_OP_NOP was specified in the submission queue entry, but
1037 the io_uring context was setup for polling (IORING_SETUP_IOPOLL
1038 was specified in the call to io_uring_setup).
1039
1040 EINVAL IORING_OP_READV or IORING_OP_WRITEV was specified in the submis‐
1041 sion queue entry, but the io_uring instance has fixed buffers
1042 registered.
1043
1044 EINVAL IORING_OP_READ_FIXED or IORING_OP_WRITE_FIXED was specified in
1045 the submission queue entry, and the buf_index is invalid.
1046
1047 EINVAL IORING_OP_READV, IORING_OP_WRITEV, IORING_OP_READ_FIXED, IOR‐
1048 ING_OP_WRITE_FIXED or IORING_OP_FSYNC was specified in the sub‐
1049 mission queue entry, but the io_uring instance was configured
1050 for IOPOLLing, or any of addr, ioprio, off, len, or buf_index
1051 was set in the submission queue entry.
1052
1053 EINVAL IORING_OP_POLL_ADD or IORING_OP_POLL_REMOVE was specified in the
1054 opcode field of the submission queue entry, but the io_uring in‐
1055 stance was configured for busy-wait polling (IOR‐
1056 ING_SETUP_IOPOLL), or any of ioprio, off, len, or buf_index was
1057 non-zero in the submission queue entry.
1058
1059 EINVAL IORING_OP_POLL_ADD was specified in the opcode field of the sub‐
1060 mission queue entry, and the addr field was non-zero.
1061
1062 EOPNOTSUPP
1063 opcode is valid, but not supported by this kernel.
1064
1065 EOPNOTSUPP
1066 IOSQE_BUFFER_SELECT was set in the flags field of the submission
1067 queue entry, but the opcode doesn't support buffer selection.
1068
1069
1070
1071Linux 2019-01-22 io_uring_enter(2)