1IO_URING_ENTER(2) Linux Programmer's Manual IO_URING_ENTER(2)
2
3
4
6 io_uring_enter - initiate and/or complete asynchronous I/O
7
9 #include <linux/io_uring.h>
10
11 int io_uring_enter(unsigned int fd, unsigned int to_submit,
12 unsigned int min_complete, unsigned int flags,
13 sigset_t *sig);
14
16 io_uring_enter() is used to initiate and complete I/O using the shared
17 submission and completion queues setup by a call to io_uring_setup(2).
18 A single call can both submit new I/O and wait for completions of I/O
19 initiated by this call or previous calls to io_uring_enter().
20
21 fd is the file descriptor returned by io_uring_setup(2). to_submit
22 specifies the number of I/Os to submit from the submission queue. If
23 the IORING_ENTER_GETEVENTS bit is set in flags, then the system call
24 will attempt to wait for min_complete event completions before return‐
25 ing. If the io_uring instance was configured for polling, by specify‐
26 ing IORING_SETUP_IOPOLL in the call to io_uring_setup(2), then min_com‐
27 plete has a slightly different meaning. Passing a value of 0 instructs
28 the kernel to return any events which are already complete, without
29 blocking. If min_complete is a non-zero value, the kernel will still
30 return immediately if any completion events are available. If no event
31 completions are available, then the call will poll either until one or
32 more completions become available, or until the process has exceeded
33 its scheduler time slice.
34
35 Note that, for interrupt driven I/O (where IORING_SETUP_IOPOLL was not
36 specified in the call to io_uring_setup(2)), an application may check
37 the completion queue for event completions without entering the kernel
38 at all.
39
40 When the system call returns that a certain amount of SQEs have been
41 consumed and submitted, it's safe to reuse SQE entries in the ring.
42 This is true even if the actual IO submission had to be punted to async
43 context, which means that the SQE may in fact not have been submitted
44 yet. If the kernel requires later use of a particular SQE entry, it
45 will have made a private copy of it.
46
47 sig is a pointer to a signal mask (see sigprocmask(2)); if sig is not
48 NULL, io_uring_enter() first replaces the current signal mask by the
49 one pointed to by sig, then waits for events to become available in the
50 completion queue, and then restores the original signal mask. The fol‐
51 lowing io_uring_enter() call:
52
53 ret = io_uring_enter(fd, 0, 1, IORING_ENTER_GETEVENTS, &sig);
54
55 is equivalent to atomically executing the following calls:
56
57 pthread_sigmask(SIG_SETMASK, &sig, &orig);
58 ret = io_uring_enter(fd, 0, 1, IORING_ENTER_GETEVENTS, NULL);
59 pthread_sigmask(SIG_SETMASK, &orig, NULL);
60
61 See the description of pselect(2) for an explanation of why the sig
62 parameter is necessary.
63
64 Submission queue entries are represented using the following data
65 structure:
66
67 /*
68 * IO submission data structure (Submission Queue Entry)
69 */
70 struct io_uring_sqe {
71 __u8 opcode; /* type of operation for this sqe */
72 __u8 flags; /* IOSQE_ flags */
73 __u16 ioprio; /* ioprio for the request */
74 __s32 fd; /* file descriptor to do IO on */
75 union {
76 __u64 off; /* offset into file */
77 __u64 addr2;
78 };
79 __u64 addr; /* pointer to buffer or iovecs */
80 __u32 len; /* buffer size or number of iovecs */
81 union {
82 __kernel_rwf_t rw_flags;
83 __u32 fsync_flags;
84 __u16 poll_events;
85 __u32 sync_range_flags;
86 __u32 msg_flags;
87 __u32 timeout_flags;
88 __u32 accept_flags;
89 __u32 cancel_flags;
90 };
91 __u64 user_data; /* data to be passed back at completion time */
92 union {
93 struct {
94 /* index into fixed buffers, if used */
95 __u16 buf_index;
96 /* personality to use, if used */
97 __u16 personality;
98 };
99 __u64 __pad2[3];
100 };
101 };
102
103 The opcode describes the operation to be performed. It can be one of:
104
105 IORING_OP_NOP
106 Do not perform any I/O. This is useful for testing the perfor‐
107 mance of the io_uring implementation itself.
108
109 IORING_OP_READV
110
111 IORING_OP_WRITEV
112 Vectored read and write operations, similar to preadv2(2) and
113 pwritev2(2).
114
115
116 IORING_OP_READ_FIXED
117
118 IORING_OP_WRITE_FIXED
119 Read from or write to pre-mapped buffers. See io_uring_regis‐
120 ter(2) for details on how to setup a context for fixed reads and
121 writes.
122
123
124 IORING_OP_FSYNC
125 File sync. See also fsync(2). Note that, while I/O is initi‐
126 ated in the order in which it appears in the submission queue,
127 completions are unordered. For example, an application which
128 places a write I/O followed by an fsync in the submission queue
129 cannot expect the fsync to apply to the write. The two opera‐
130 tions execute in parallel, so the fsync may complete before the
131 write is issued to the storage. The same is also true for pre‐
132 viously issued writes that have not completed prior to the
133 fsync.
134
135
136 IORING_OP_POLL_ADD
137 Poll the fd specified in the submission queue entry for the
138 events specified in the poll_events field. Unlike poll or epoll
139 without EPOLLONESHOT, this interface always works in one shot
140 mode. That is, once the poll operation is completed, it will
141 have to be resubmitted.
142
143
144 IORING_OP_POLL_REMOVE
145 Remove an existing poll request. If found, the res field of the
146 struct io_uring_cqe will contain 0. If not found, res will con‐
147 tain -ENOENT.
148
149
150 IORING_OP_SYNC_FILE_RANGE
151 Issue the equivalent of a sync_file_range [22m(2) on the file
152 descriptor. The fd field is the file descriptor to sync, the off
153 field holds the offset in bytes, the len field holds the length
154 in bytes, and the flags field holds the flags for the command.
155 See also sync_file_range(2). for the general description of the
156 related system call. Available since 5.2.
157
158
159 IORING_OP_SENDMSG
160 Issue the equivalent of a sendmsg(2) system call. fd must be
161 set to the socket file descriptor, addr must contain a pointer
162 to the msghdr structure, and flags holds the flags associated
163 with the system call. See also sendmsg(2). for the general
164 description of the related system call. Available since 5.3.
165
166
167 IORING_OP_RECVMSG
168 Works just like IORING_OP_SENDMSG, except for recvmsg(2)
169 instead. See the description of IORING_OP_SENDMSG. Available
170 since 5.3.
171
172
173 IORING_OP_SEND
174 Issue the equivalent of a send(2) system call. fd must be set
175 to the socket file descriptor, addr must contain a pointer to
176 the buffer, len denotes the length of the buffer to send, and
177 flags holds the flags associated with the system call. See also
178 send(2). for the general description of the related system
179 call. Available since 5.6.
180
181
182 IORING_OP_RECV
183 Works just like IORING_OP_SEND, except for recv(2) instead. See
184 the description of IORING_OP_SEND. Available since 5.6.
185
186
187 IORING_OP_TIMEOUT
188 This command will register a timeout operation. The addr field
189 must contain a pointer to a struct timespec64 structure, len
190 must contain 1 to signify one timespec64 structure, time‐
191 out_flags may contain IORING_TIMEOUT_ABS for an absolutel time‐
192 out value, or 0 for a relative timeout. off may contain a com‐
193 pletion event count. If not set, this defaults to 1. A timeout
194 will trigger a wakeup event on the completion ring for anyone
195 waiting for events. A timeout condition is met when either the
196 specified timeout expires, or the specified number of events
197 have completed. Either condition will trigger the event.
198 io_uring timeouts use the CLOCK_MONOTONIC clock source. The
199 request will complete with -ETIME if the timeout got completed
200 through expiration of the timer, or 0 if the timeout got com‐
201 pleted through requests completing on their own. If the timeout
202 was cancelled before it expired, the request will complete with
203 -ECANCELED. Available since 5.4.
204
205
206 IORING_OP_TIMEOUT_REMOVE
207 Attempt to remove an existing timeout operation. addr must con‐
208 tain the user_data field of the previously issued timeout opera‐
209 tion. If the specified timeout request is found and cancelled
210 successfully, this request will terminate with a result value of
211 0 If the timeout request was found but expiration was already in
212 progress, this request will terminate with a result value of
213 -EBUSY If the timeout request wasn't found, the request will
214 terminate with a result value of -ENOENT Available since 5.5.
215
216
217 IORING_OP_ACCEPT
218 Issue the equivalent of an accept4(2) system call. fd must be
219 set to the socket file descriptor, addr must contain the pointer
220 to the sockaddr structure, and addr2 must contain a pointer to
221 the socklen_t addrlen field. See also accept4(2) for the general
222 description of the related system call. Available since 5.5.
223
224
225 IORING_OP_ASYNC_CANCEL
226 Attempt to cancel an already issued request. addr must contain
227 the user_data field of the request that should be cancelled. The
228 cancellation request will complete with one of the following
229 results codes. I found, the res field of the cqe will contain 0.
230 If not found, res will contain -ENOENT. If found and attempted
231 cancelled, the res field will contain -EALREADY. In this case,
232 the request may or may not terminate. In general, requests that
233 are interruptible (like socket IO) will get cancelled, while
234 disk IO requests cannot be cancelled if already started. Avail‐
235 able since 5.5.
236
237
238 IORING_OP_LINK_TIMEOUT
239 This request must be linked with another request through
240 IOSQE_IO_LINK which is described below. Unlike IORING_OP_TIME‐
241 OUT, IORING_OP_LINK_TIMEOUT acts on the linked request, not the
242 completion queue. The format of the command is otherwise like
243 IORING_OP_TIMEOUT, except there's no completion event count as
244 it's tied to a specific request. If used, the timeout specified
245 in the command will cancel the linked command, unless the linked
246 command completes before the timeout. The timeout will complete
247 with -ETIME if the timer expired and the linked request was
248 attempted cancelled, or -ECANCELED if the timer got cancelled
249 because of completion of the linked request. Like IOR‐
250 ING_OP_TIMEOUT the clock source used is CLOCK_MONOTONIC Avail‐
251 able since 5.5.
252
253
254
255 IORING_OP_CONNECT
256 Issue the equivalent of a connect(2) system call. fd must be
257 set to the socket file descriptor, addr must contain the pointer
258 to the sockaddr structure, and off must contain the socklen_t
259 addrlen field. See also connect(2) for the general description
260 of the related system call. Available since 5.5.
261
262
263 IORING_OP_FALLOCATE
264 Issue the equivalent of a fallocate(2) system call. fd must be
265 set to the file descriptor, off must contain the offset on which
266 to operate, and len must contain the length. See also fallo‐
267 cate(2) for the general description of the related system call.
268 Available since 5.6.
269
270
271 IORING_OP_FADVISE
272 Issue the equivalent of a posix_fadvise(2) system call. fd must
273 be set to the file descriptor, off must contain the offset on
274 which to operate, len must contain the length, and fad‐
275 vise_advice must contain the advice associated with the opera‐
276 tion. See also posix_fadvise(2) for the general description of
277 the related system call. Available since 5.6.
278
279
280 IORING_OP_MADVISE
281 Issue the equivalent of a madvise(2) system call. addr must
282 contain the address to operate on, len must contain the length
283 on which to operate, and fadvise_advice must contain the advice
284 associated with the operation. See also madvise(2) for the gen‐
285 eral description of the related system call. Available since
286 5.6.
287
288
289 IORING_OP_OPENAT
290 Issue the equivalent of a openat(2) system call. fd is the
291 dirfd argument, addr must contain a pointer to the *pathname
292 argument, open_flags should contain any flags passed in, and
293 mode is access mode of the file. See also openat(2) for the gen‐
294 eral description of the related system call. Available since
295 5.6.
296
297
298 IORING_OP_CLOSE
299 Issue the equivalent of a close(2) system call. fd is the file
300 descriptor to be closed. See also close(2) for the general
301 description of the related system call. Available since 5.6.
302
303
304 IORING_OP_STATX
305 Issue the equivalent of a statx(2) system call. fd is the dirfd
306 argument, addr must contain a pointer to the *pathname string,
307 statx_flags is the flags argument, len should be the mask argu‐
308 ment, and off must contain a pointer to the statxbuf to be
309 filled in. See also statx(2) for the general description of the
310 related system call. Available since 5.6.
311
312
313 IORING_OP_READ IORING_OP_WRITE
314 Issue the equivalent of a read(2) or write(2) system call. fd
315 is the file descriptor to be operated on, addr contains the buf‐
316 fer in question, and len contains the length of the IO opera‐
317 tion. These are non-vectored versions of the IORING_OP_READV and
318 IORING_OP_WRITEV opcodes. See also read(2) and write(2) for the
319 general description of the related system call. Available since
320 5.6.
321
322
323 IORING_OP_FILES_UPDATE
324 This command is an alternative to using IORING_REGIS‐
325 TER_FILES_UPDATE which then works in an async fashion, like the
326 rest of the io_uring commands. The arguments passed in are the
327 same. addr must contain a pointer to the array of file descrip‐
328 tors, len must contain the length of the array, and off must
329 contain the offset at which to operate. Note that the array of
330 file descriptors pointed to in addr must remain valid until this
331 operation has completed. Available since 5.6.
332
333
334 The flags field is a bit mask. The supported flags are:
335
336 IOSQE_FIXED_FILE
337 When this flag is specified, fd is an index into the files array
338 registered with the io_uring instance (see the IORING_REGIS‐
339 TER_FILES section of the io_uring_register(2) man page). Avail‐
340 able since 5.1.
341
342 IOSQE_IO_DRAIN
343 When this flag is specified, the SQE will not be started before
344 previously submitted SQEs have completed, and new SQEs will not
345 be started before this one completes. Available since 5.2.
346
347 IOSQE_IO_LINK
348 When this flag is specified, it forms a link with the next SQE
349 in the submission ring. That next SQE will not be started before
350 this one completes. This, in effect, forms a chain of SQEs,
351 which can be arbitrarily long. The tail of the chain is denoted
352 by the first SQE that does not have this flag set. This flag
353 has no effect on previous SQE submissions, nor does it impact
354 SQEs that are outside of the chain tail. This means that multi‐
355 ple chains can be executing in parallel, or chains and individ‐
356 ual SQEs. Only members inside the chain are serialized. A chain
357 of SQEs will be broken, if any request in that chain ends in
358 error. io_uring considers any unexpected result an error. This
359 means that, eg, a short read will also terminate the remainder
360 of the chain. If a chain of SQE links is broken, the remaining
361 unstarted part of the chain will be terminated and completed
362 with -ECANCELED as the error code. Available since 5.3.
363
364 IOSQE_IO_HARDLINK
365 Like IOSQE_IO_LINK, but it doesn't sever regardless of the com‐
366 pletion result. Note that the link will still sever if we fail
367 submitting the parent request, hard links are only resilient in
368 the presence of completion results for requests that did submit
369 correctly. IOSQE_IO_HARDLINK implies IOSQE_IO_LINK. Available
370 since 5.5.
371
372 IOSQE_ASYNC
373 Normal operation for io_uring is to try and issue an sqe as non-
374 blocking first, and if that fails, execute it in an async man‐
375 ner. To support more efficient overlapped operation of requests
376 that the application knows/assumes will always (or most of the
377 time) block, the application can ask for an sqe to be issued
378 async from the start. Available since 5.6.
379
380
381
382 ioprio specifies the I/O priority. See ioprio_get(2) for a description
383 of Linux I/O priorities.
384
385 fd specifies the file descriptor against which the operation will be
386 performed, with the exception noted above.
387
388 If the operation is one of IORING_OP_READ_FIXED or IOR‐
389 ING_OP_WRITE_FIXED, addr and len must fall within the buffer located at
390 buf_index in the fixed buffer array. If the operation is either IOR‐
391 ING_OP_READV or IORING_OP_WRITEV, then addr points to an iovec array of
392 len entries.
393
394 rw_flags, specified for read and write operations, contains a bitwise
395 OR of per-I/O flags, as described in the preadv2(2) man page.
396
397 The fsync_flags bit mask may contain either 0, for a normal file
398 integrity sync, or IORING_FSYNC_DATASYNC to provide data sync only
399 semantics. See the descriptions of O_SYNC and O_DSYNC in the open(2)
400 manual page for more information.
401
402 The bits that may be set in poll_events are defined in <poll.h>, and
403 documented in poll(2).
404
405 user_data is an application-supplied value that will be copied into the
406 completion queue entry (see below). buf_index is an index into an
407 array of fixed buffers, and is only valid if fixed buffers were regis‐
408 tered. personality is the credentials id to use for this operation.
409 See io_uring_register(2) for how to register personalities with
410 io_uring. If set to 0, the current personality of the submitting task
411 is used.
412
413 Once the submission queue entry is initialized, I/O is submitted by
414 placing the index of the submission queue entry into the tail of the
415 submission queue. After one or more indexes are added to the queue,
416 and the queue tail is advanced, the io_uring_enter(2) system call can
417 be invoked to initiate the I/O.
418
419 Completions use the following data structure:
420
421 /*
422 * IO completion data structure (Completion Queue Entry)
423 */
424 struct io_uring_cqe {
425 __u64 user_data; /* sqe->data submission passed back */
426 __s32 res; /* result code for this event */
427 __u32 flags;
428 };
429
430 user_data is copied from the field of the same name in the submission
431 queue entry. The primary use case is to store data that the applica‐
432 tion will need to access upon completion of this particular I/O. The
433 flags is reserved for future use. res is the operation-specific
434 result.
435
436 For read and write opcodes, the return values match those documented in
437 the preadv2(2) and pwritev2(2) man pages. Return codes for the
438 io_uring-specific opcodes are documented in the description of the
439 opcodes above.
440
442 io_uring_enter() returns the number of I/Os successfully consumed.
443 This can be zero if to_submit was zero or if the submission queue was
444 empty. The errors below that refer to an error in a submission queue
445 entry will be returned though a completion queue entry, rather than
446 through the system call itself.
447
448 Errors that occur not on behalf of a submission queue entry are
449 returned via the system call directly. On such an error, -1 is returned
450 and errno is set appropriately.
451
453 EAGAIN The kernel was unable to allocate memory for the request, or
454 otherwise ran out of resources to handle it. The application
455 should wait for some completions and try again.
456
457 EBUSY The application is attempting to overcommit the number of
458 requests it can have pending. The application should wait for
459 some completions and try again. May occur if the application
460 tries to queue more requests than we have room for in the CQ
461 ring.
462
463 EBADF The fd field in the submission queue entry is invalid, or the
464 IOSQE_FIXED_FILE flag was set in the submission queue entry, but
465 no files were registered with the io_uring instance.
466
467 EFAULT buffer is outside of the process' accessible address space
468
469 EFAULT IORING_OP_READ_FIXED or IORING_OP_WRITE_FIXED was specified in
470 the opcode field of the submission queue entry, but either buf‐
471 fers were not registered for this io_uring instance, or the
472 address range described by addr and len does not fit within the
473 buffer registered at buf_index.
474
475 EINVAL The index member of the submission queue entry is invalid.
476
477 EINVAL The flags field or opcode in a submission queue entry is
478 invalid.
479
480 EINVAL IORING_OP_NOP was specified in the submission queue entry, but
481 the io_uring context was setup for polling (IORING_SETUP_IOPOLL
482 was specified in the call to io_uring_setup).
483
484 EINVAL IORING_OP_READV or IORING_OP_WRITEV was specified in the submis‐
485 sion queue entry, but the io_uring instance has fixed buffers
486 registered.
487
488 EINVAL IORING_OP_READ_FIXED or IORING_OP_WRITE_FIXED was specified in
489 the submission queue entry, and the buf_index is invalid.
490
491 EINVAL IORING_OP_READV, IORING_OP_WRITEV, IORING_OP_READ_FIXED, IOR‐
492 ING_OP_WRITE_FIXED or IORING_OP_FSYNC was specified in the sub‐
493 mission queue entry, but the io_uring instance was configured
494 for IOPOLLing, or any of addr, ioprio, off, len, or buf_index
495 was set in the submission queue entry.
496
497 EINVAL IORING_OP_POLL_ADD or IORING_OP_POLL_REMOVE was specified in the
498 opcode field of the submission queue entry, but the io_uring
499 instance was configured for busy-wait polling (IORING_SET‐
500 UP_IOPOLL), or any of ioprio, off, len, or buf_index was non-
501 zero in the submission queue entry.
502
503 EINVAL IORING_OP_POLL_ADD was specified in the opcode field of the sub‐
504 mission queue entry, and the addr field was non-zero.
505
506 ENXIO The io_uring instance is in the process of being torn down.
507
508 EOPNOTSUPP
509 fd does not refer to an io_uring instance.
510
511 EOPNOTSUPP
512 opcode is valid, but not supported by this kernel.
513
514
515
516Linux 2019-01-22 IO_URING_ENTER(2)