1
2io_uring(7) Linux Programmer's Manual io_uring(7)
3
4
5
7 io_uring - Asynchronous I/O facility
8
10 #include <linux/io_uring.h>
11
13 io_uring is a Linux-specific API for asynchronous I/O. It allows the
14 user to submit one or more I/O requests, which are processed asyn‐
15 chronously without blocking the calling process. io_uring gets its
16 name from ring buffers which are shared between user space and kernel
17 space. This arrangement allows for efficient I/O, while avoiding the
18 overhead of copying buffers between them, where possible. This inter‐
19 face makes io_uring different from other UNIX I/O APIs, wherein, rather
20 than just communicate between kernel and user space with system calls,
21 ring buffers are used as the main mode of communication. This arrange‐
22 ment has various performance benefits which are discussed in a separate
23 section below. This man page uses the terms shared buffers, shared
24 ring buffers and queues interchangeably.
25
26 The general programming model you need to follow for io_uring is out‐
27 lined below
28
29 • Set up shared buffers with io_uring_setup(2) and mmap(2), map‐
30 ping into user space shared buffers for the submission queue
31 (SQ) and the completion queue (CQ). You place I/O requests you
32 want to make on the SQ, while the kernel places the results of
33 those operations on the CQ.
34
35 • For every I/O request you need to make (like to read a file,
36 write a file, accept a socket connection, etc), you create a
37 submission queue entry, or SQE, describe the I/O operation you
38 need to get done and add it to the tail of the submission queue
39 (SQ). Each I/O operation is, in essence, the equivalent of a
40 system call you would have made otherwise, if you were not using
41 io_uring. You can add more than one SQE to the queue depending
42 on the number of operations you want to request.
43
44 • After you add one or more SQEs, you need to call io_uring_en‐
45 ter(2) to tell the kernel to dequeue your I/O requests off the
46 SQ and begin processing them.
47
48 • For each SQE you submit, once it is done processing the request,
49 the kernel places a completion queue event or CQE at the tail of
50 the completion queue or CQ. The kernel places exactly one
51 matching CQE in the CQ for every SQE you submit on the SQ. Af‐
52 ter you retrieve a CQE, minimally, you might be interested in
53 checking the res field of the CQE structure, which corresponds
54 to the return value of the system call's equivalent, had you
55 used it directly without using io_uring. For instance, a read
56 operation under io_uring, started with the IORING_OP_READ opera‐
57 tion, issues the equivalent of the read(2) system call. In prac‐
58 tice, it mixes the semantics of pread(2) and preadv2(2) in that
59 it takes an explicit offset, and supports using -1 for the off‐
60 set to indicate that the current file position should be used
61 instead of passing in an explicit offset. See the opcode docu‐
62 mentation for more details. Given that io_uring is an async in‐
63 terface, errno is never used for passing back error information.
64 Instead, res will contain what the equivalent system call would
65 have returned in case of success, and in case of error res will
66 contain -errno . For example, if the normal read system call
67 would have returned -1 and set errno to EINVAL , then res would
68 contain -EINVAL . If the normal system call would have returned
69 a read size of 1024, then res would contain 1024.
70
71 • Optionally, io_uring_enter(2) can also wait for a specified num‐
72 ber of requests to be processed by the kernel before it returns.
73 If you specified a certain number of completions to wait for,
74 the kernel would have placed at least those many number of CQEs
75 on the CQ, which you can then readily read, right after the re‐
76 turn from io_uring_enter(2).
77
78 • It is important to remember that I/O requests submitted to the
79 kernel can complete in any order. It is not necessary for the
80 kernel to process one request after another, in the order you
81 placed them. Given that the interface is a ring, the requests
82 are attempted in order, however that doesn't imply any sort of
83 ordering on their completion. When more than one request is in
84 flight, it is not possible to determine which one will complete
85 first. When you dequeue CQEs off the CQ, you should always
86 check which submitted request it corresponds to. The most com‐
87 mon method for doing so is utilizing the user_data field in the
88 request, which is passed back on the completion side.
89
90 Adding to and reading from the queues:
91
92 • You add SQEs to the tail of the SQ. The kernel reads SQEs off
93 the head of the queue.
94
95 • The kernel adds CQEs to the tail of the CQ. You read CQEs off
96 the head of the queue.
97
98 Submission queue polling
99 One of the goals of io_uring is to provide a means for efficient I/O.
100 To this end, io_uring supports a polling mode that lets you avoid the
101 call to io_uring_enter(2), which you use to inform the kernel that you
102 have queued SQEs on to the SQ. With SQ Polling, io_uring starts a ker‐
103 nel thread that polls the submission queue for any I/O requests you
104 submit by adding SQEs. With SQ Polling enabled, there is no need for
105 you to call io_uring_enter(2), letting you avoid the overhead of system
106 calls. A designated kernel thread dequeues SQEs off the SQ as you add
107 them and dispatches them for asynchronous processing.
108
109 Setting up io_uring
110 The main steps in setting up io_uring consist of mapping in the shared
111 buffers with mmap(2) calls. In the example program included in this
112 man page, the function app_setup_uring() sets up io_uring with a
113 QUEUE_DEPTH deep submission queue. Pay attention to the 2 mmap(2)
114 calls that set up the shared submission and completion queues. If your
115 kernel is older than version 5.4, three mmap(2) calls are required.
116
117 Submitting I/O requests
118 The process of submitting a request consists of describing the I/O op‐
119 eration you need to get done using an io_uring_sqe structure instance.
120 These details describe the equivalent system call and its parameters.
121 Because the range of I/O operations Linux supports are very varied and
122 the io_uring_sqe structure needs to be able to describe them, it has
123 several fields, some packed into unions for space efficiency. Here is
124 a simplified version of struct io_uring_sqe with some of the most often
125 used fields:
126
127 struct io_uring_sqe {
128 __u8 opcode; /* type of operation for this sqe */
129 __s32 fd; /* file descriptor to do IO on */
130 __u64 off; /* offset into file */
131 __u64 addr; /* pointer to buffer or iovecs */
132 __u32 len; /* buffer size or number of iovecs */
133 __u64 user_data; /* data to be passed back at completion time */
134 __u8 flags; /* IOSQE_ flags */
135 ...
136 };
137
138 Here is struct io_uring_sqe in full:
139
140 struct io_uring_sqe {
141 __u8 opcode; /* type of operation for this sqe */
142 __u8 flags; /* IOSQE_ flags */
143 __u16 ioprio; /* ioprio for the request */
144 __s32 fd; /* file descriptor to do IO on */
145 union {
146 __u64 off; /* offset into file */
147 __u64 addr2;
148 };
149 union {
150 __u64 addr; /* pointer to buffer or iovecs */
151 __u64 splice_off_in;
152 };
153 __u32 len; /* buffer size or number of iovecs */
154 union {
155 __kernel_rwf_t rw_flags;
156 __u32 fsync_flags;
157 __u16 poll_events; /* compatibility */
158 __u32 poll32_events; /* word-reversed for BE */
159 __u32 sync_range_flags;
160 __u32 msg_flags;
161 __u32 timeout_flags;
162 __u32 accept_flags;
163 __u32 cancel_flags;
164 __u32 open_flags;
165 __u32 statx_flags;
166 __u32 fadvise_advice;
167 __u32 splice_flags;
168 };
169 __u64 user_data; /* data to be passed back at completion time */
170 union {
171 struct {
172 /* pack this to avoid bogus arm OABI complaints */
173 union {
174 /* index into fixed buffers, if used */
175 __u16 buf_index;
176 /* for grouped buffer selection */
177 __u16 buf_group;
178 } __attribute__((packed));
179 /* personality to use, if used */
180 __u16 personality;
181 __s32 splice_fd_in;
182 };
183 __u64 __pad2[3];
184 };
185 };
186
187 To submit an I/O request to io_uring, you need to acquire a submission
188 queue entry (SQE) from the submission queue (SQ), fill it up with de‐
189 tails of the operation you want to submit and call io_uring_enter(2).
190 There are helper functions of the form io_uring_prep_X to enable proper
191 setup of the SQE. If you want to avoid calling io_uring_enter(2), you
192 have the option of setting up Submission Queue Polling.
193
194 SQEs are added to the tail of the submission queue. The kernel picks
195 up SQEs off the head of the SQ. The general algorithm to get the next
196 available SQE and update the tail is as follows.
197
198 struct io_uring_sqe *sqe;
199 unsigned tail, index;
200 tail = *sqring->tail;
201 index = tail & (*sqring->ring_mask);
202 sqe = &sqring->sqes[index];
203 /* fill up details about this I/O request */
204 describe_io(sqe);
205 /* fill the sqe index into the SQ ring array */
206 sqring->array[index] = index;
207 tail++;
208 atomic_store_explicit(sqring->tail, tail, memory_order_release);
209
210 To get the index of an entry, the application must mask the current
211 tail index with the size mask of the ring. This holds true for both
212 SQs and CQs. Once the SQE is acquired, the necessary fields are filled
213 in, describing the request. While the CQ ring directly indexes the
214 shared array of CQEs, the submission side has an indirection array be‐
215 tween them. The submission side ring buffer is an index into this ar‐
216 ray, which in turn contains the index into the SQEs.
217
218 The following code snippet demonstrates how a read operation, an equiv‐
219 alent of a preadv2(2) system call is described by filling up an SQE
220 with the necessary parameters.
221
222 struct iovec iovecs[16];
223 ...
224 sqe->opcode = IORING_OP_READV;
225 sqe->fd = fd;
226 sqe->addr = (unsigned long) iovecs;
227 sqe->len = 16;
228 sqe->off = offset;
229 sqe->flags = 0;
230
231 Memory ordering
232 Modern compilers and CPUs freely reorder reads and writes with‐
233 out affecting the program's outcome to optimize performance.
234 Some aspects of this need to be kept in mind on SMP systems
235 since io_uring involves buffers shared between kernel and user
236 space. These buffers are both visible and modifiable from ker‐
237 nel and user space. As heads and tails belonging to these
238 shared buffers are updated by kernel and user space, changes
239 need to be coherently visible on either side, irrespective of
240 whether a CPU switch took place after the kernel-user mode
241 switch happened. We use memory barriers to enforce this co‐
242 herency. Being significantly large subjects on their own, mem‐
243 ory barriers are out of scope for further discussion on this man
244 page.
245
246 Letting the kernel know about I/O submissions
247 Once you place one or more SQEs on to the SQ, you need to let
248 the kernel know that you've done so. You can do this by calling
249 the io_uring_enter(2) system call. This system call is also ca‐
250 pable of waiting for a specified count of events to complete.
251 This way, you can be sure to find completion events in the com‐
252 pletion queue without having to poll it for events later.
253
254 Reading completion events
255 Similar to the submission queue (SQ), the completion queue (CQ) is a
256 shared buffer between the kernel and user space. Whereas you placed
257 submission queue entries on the tail of the SQ and the kernel read off
258 the head, when it comes to the CQ, the kernel places completion queue
259 events or CQEs on the tail of the CQ and you read off its head.
260
261 Submission is flexible (and thus a bit more complicated) since it needs
262 to be able to encode different types of system calls that take various
263 parameters. Completion, on the other hand is simpler since we're look‐
264 ing only for a return value back from the kernel. This is easily un‐
265 derstood by looking at the completion queue event structure, struct
266 io_uring_cqe:
267
268 struct io_uring_cqe {
269 __u64 user_data; /* sqe->data submission passed back */
270 __s32 res; /* result code for this event */
271 __u32 flags;
272 };
273
274 Here, user_data is custom data that is passed unchanged from submission
275 to completion. That is, from SQEs to CQEs. This field can be used to
276 set context, uniquely identifying submissions that got completed.
277 Given that I/O requests can complete in any order, this field can be
278 used to correlate a submission with a completion. res is the result
279 from the system call that was performed as part of the submission; its
280 return value.
281
282 The flags field carries request-specific information. As of the 6.0
283 kernel, the following flags are defined:
284
285
286 IORING_CQE_F_BUFFER
287 If set, the upper 16 bits of the flags field carries the buffer
288 ID that was chosen for this request. The request must have been
289 issued with IOSQE_BUFFER_SELECT set, and used with a request
290 type that supports buffer selection. Additionally, buffers must
291 have been provided upfront either via the IORING_OP_PROVIDE_BUF‐
292 FERS or the IORING_REGISTER_PBUF_RING methods.
293
294 IORING_CQE_F_MORE
295 If set, the application should expect more completions from the
296 request. This is used for requests that can generate multiple
297 completions, such as multi-shot requests, receive, or accept.
298
299 IORING_CQE_F_SOCK_NONEMPTY
300 If set, upon receiving the data from the socket in the current
301 request, the socket still had data left on completion of this
302 request.
303
304 IORING_CQE_F_NOTIF
305 Set for notification CQEs, as seen with the zero-copy networking
306 send and receive support.
307
308 The general sequence to read completion events off the completion queue
309 is as follows:
310
311 unsigned head;
312 head = *cqring->head;
313 if (head != atomic_load_acquire(cqring->tail)) {
314 struct io_uring_cqe *cqe;
315 unsigned index;
316 index = head & (cqring->mask);
317 cqe = &cqring->cqes[index];
318 /* process completed CQE */
319 process_cqe(cqe);
320 /* CQE consumption complete */
321 head++;
322 }
323 atomic_store_explicit(cqring->head, head, memory_order_release);
324
325 It helps to be reminded that the kernel adds CQEs to the tail of the
326 CQ, while you need to dequeue them off the head. To get the index of
327 an entry at the head, the application must mask the current head index
328 with the size mask of the ring. Once the CQE has been consumed or pro‐
329 cessed, the head needs to be updated to reflect the consumption of the
330 CQE. Attention should be paid to the read and write barriers to ensure
331 successful read and update of the head.
332
333 io_uring performance
334 Because of the shared ring buffers between kernel and user space,
335 io_uring can be a zero-copy system. Copying buffers to and from be‐
336 comes necessary when system calls that transfer data between kernel and
337 user space are involved. But since the bulk of the communication in
338 io_uring is via buffers shared between the kernel and user space, this
339 huge performance overhead is completely avoided.
340
341 While system calls may not seem like a significant overhead, in high
342 performance applications, making a lot of them will begin to matter.
343 While workarounds the operating system has in place to deal with Spec‐
344 tre and Meltdown are ideally best done away with, unfortunately, some
345 of these workarounds are around the system call interface, making sys‐
346 tem calls not as cheap as before on affected hardware. While newer
347 hardware should not need these workarounds, hardware with these vulner‐
348 abilities can be expected to be in the wild for a long time. While us‐
349 ing synchronous programming interfaces or even when using asynchronous
350 programming interfaces under Linux, there is at least one system call
351 involved in the submission of each request. In io_uring, on the other
352 hand, you can batch several requests in one go, simply by queueing up
353 multiple SQEs, each describing an I/O operation you want and make a
354 single call to io_uring_enter(2). This is possible due to io_uring's
355 shared buffers based design.
356
357 While this batching in itself can avoid the overhead associated with
358 potentially multiple and frequent system calls, you can reduce even
359 this overhead further with Submission Queue Polling, by having the ker‐
360 nel poll and pick up your SQEs for processing as you add them to the
361 submission queue. This avoids the io_uring_enter(2) call you need to
362 make to tell the kernel to pick SQEs up. For high-performance applica‐
363 tions, this means even fewer system call overheads.
364
366 io_uring is Linux-specific.
367
369 The following example uses io_uring to copy stdin to stdout. Using
370 shell redirection, you should be able to copy files with this example.
371 Because it uses a queue depth of only one, this example processes I/O
372 requests one after the other. It is purposefully kept this way to aid
373 understanding. In real-world scenarios however, you'll want to have a
374 larger queue depth to parallelize I/O request processing so as to gain
375 the kind of performance benefits io_uring provides with its asynchro‐
376 nous processing of requests.
377
378 #include <stdio.h>
379 #include <stdlib.h>
380 #include <sys/stat.h>
381 #include <sys/ioctl.h>
382 #include <sys/syscall.h>
383 #include <sys/mman.h>
384 #include <sys/uio.h>
385 #include <linux/fs.h>
386 #include <fcntl.h>
387 #include <unistd.h>
388 #include <string.h>
389 #include <stdatomic.h>
390
391 #include <linux/io_uring.h>
392
393 #define QUEUE_DEPTH 1
394 #define BLOCK_SZ 1024
395
396 /* Macros for barriers needed by io_uring */
397 #define io_uring_smp_store_release(p, v) \
398 atomic_store_explicit((_Atomic typeof(*(p)) *)(p), (v), \
399 memory_order_release)
400 #define io_uring_smp_load_acquire(p) \
401 atomic_load_explicit((_Atomic typeof(*(p)) *)(p), \
402 memory_order_acquire)
403
404 int ring_fd;
405 unsigned *sring_tail, *sring_mask, *sring_array,
406 *cring_head, *cring_tail, *cring_mask;
407 struct io_uring_sqe *sqes;
408 struct io_uring_cqe *cqes;
409 char buff[BLOCK_SZ];
410 off_t offset;
411
412 /*
413 * System call wrappers provided since glibc does not yet
414 * provide wrappers for io_uring system calls.
415 * */
416
417 int io_uring_setup(unsigned entries, struct io_uring_params *p)
418 {
419 return (int) syscall(__NR_io_uring_setup, entries, p);
420 }
421
422 int io_uring_enter(int ring_fd, unsigned int to_submit,
423 unsigned int min_complete, unsigned int flags)
424 {
425 return (int) syscall(__NR_io_uring_enter, ring_fd, to_submit,
426 min_complete, flags, NULL, 0);
427 }
428
429 int app_setup_uring(void) {
430 struct io_uring_params p;
431 void *sq_ptr, *cq_ptr;
432
433 /* See io_uring_setup(2) for io_uring_params.flags you can set */
434 memset(&p, 0, sizeof(p));
435 ring_fd = io_uring_setup(QUEUE_DEPTH, &p);
436 if (ring_fd < 0) {
437 perror("io_uring_setup");
438 return 1;
439 }
440
441 /*
442 * io_uring communication happens via 2 shared kernel-user space ring
443 * buffers, which can be jointly mapped with a single mmap() call in
444 * kernels >= 5.4.
445 */
446
447 int sring_sz = p.sq_off.array + p.sq_entries * sizeof(unsigned);
448 int cring_sz = p.cq_off.cqes + p.cq_entries * sizeof(struct io_uring_cqe);
449
450 /* Rather than check for kernel version, the recommended way is to
451 * check the features field of the io_uring_params structure, which is a
452 * bitmask. If IORING_FEAT_SINGLE_MMAP is set, we can do away with the
453 * second mmap() call to map in the completion ring separately.
454 */
455 if (p.features & IORING_FEAT_SINGLE_MMAP) {
456 if (cring_sz > sring_sz)
457 sring_sz = cring_sz;
458 cring_sz = sring_sz;
459 }
460
461 /* Map in the submission and completion queue ring buffers.
462 * Kernels < 5.4 only map in the submission queue, though.
463 */
464 sq_ptr = mmap(0, sring_sz, PROT_READ | PROT_WRITE,
465 MAP_SHARED | MAP_POPULATE,
466 ring_fd, IORING_OFF_SQ_RING);
467 if (sq_ptr == MAP_FAILED) {
468 perror("mmap");
469 return 1;
470 }
471
472 if (p.features & IORING_FEAT_SINGLE_MMAP) {
473 cq_ptr = sq_ptr;
474 } else {
475 /* Map in the completion queue ring buffer in older kernels separately */
476 cq_ptr = mmap(0, cring_sz, PROT_READ | PROT_WRITE,
477 MAP_SHARED | MAP_POPULATE,
478 ring_fd, IORING_OFF_CQ_RING);
479 if (cq_ptr == MAP_FAILED) {
480 perror("mmap");
481 return 1;
482 }
483 }
484 /* Save useful fields for later easy reference */
485 sring_tail = sq_ptr + p.sq_off.tail;
486 sring_mask = sq_ptr + p.sq_off.ring_mask;
487 sring_array = sq_ptr + p.sq_off.array;
488
489 /* Map in the submission queue entries array */
490 sqes = mmap(0, p.sq_entries * sizeof(struct io_uring_sqe),
491 PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE,
492 ring_fd, IORING_OFF_SQES);
493 if (sqes == MAP_FAILED) {
494 perror("mmap");
495 return 1;
496 }
497
498 /* Save useful fields for later easy reference */
499 cring_head = cq_ptr + p.cq_off.head;
500 cring_tail = cq_ptr + p.cq_off.tail;
501 cring_mask = cq_ptr + p.cq_off.ring_mask;
502 cqes = cq_ptr + p.cq_off.cqes;
503
504 return 0;
505 }
506
507 /*
508 * Read from completion queue.
509 * In this function, we read completion events from the completion queue.
510 * We dequeue the CQE, update and head and return the result of the operation.
511 * */
512
513 int read_from_cq() {
514 struct io_uring_cqe *cqe;
515 unsigned head;
516
517 /* Read barrier */
518 head = io_uring_smp_load_acquire(cring_head);
519 /*
520 * Remember, this is a ring buffer. If head == tail, it means that the
521 * buffer is empty.
522 * */
523 if (head == *cring_tail)
524 return -1;
525
526 /* Get the entry */
527 cqe = &cqes[head & (*cring_mask)];
528 if (cqe->res < 0)
529 fprintf(stderr, "Error: %s\n", strerror(abs(cqe->res)));
530
531 head++;
532
533 /* Write barrier so that update to the head are made visible */
534 io_uring_smp_store_release(cring_head, head);
535
536 return cqe->res;
537 }
538
539 /*
540 * Submit a read or a write request to the submission queue.
541 * */
542
543 int submit_to_sq(int fd, int op) {
544 unsigned index, tail;
545
546 /* Add our submission queue entry to the tail of the SQE ring buffer */
547 tail = *sring_tail;
548 index = tail & *sring_mask;
549 struct io_uring_sqe *sqe = &sqes[index];
550 /* Fill in the parameters required for the read or write operation */
551 sqe->opcode = op;
552 sqe->fd = fd;
553 sqe->addr = (unsigned long) buff;
554 if (op == IORING_OP_READ) {
555 memset(buff, 0, sizeof(buff));
556 sqe->len = BLOCK_SZ;
557 }
558 else {
559 sqe->len = strlen(buff);
560 }
561 sqe->off = offset;
562
563 sring_array[index] = index;
564 tail++;
565
566 /* Update the tail */
567 io_uring_smp_store_release(sring_tail, tail);
568
569 /*
570 * Tell the kernel we have submitted events with the io_uring_enter()
571 * system call. We also pass in the IOURING_ENTER_GETEVENTS flag which
572 * causes the io_uring_enter() call to wait until min_complete
573 * (the 3rd param) events complete.
574 * */
575 int ret = io_uring_enter(ring_fd, 1,1,
576 IORING_ENTER_GETEVENTS);
577 if(ret < 0) {
578 perror("io_uring_enter");
579 return -1;
580 }
581
582 return ret;
583 }
584
585 int main(int argc, char *argv[]) {
586 int res;
587
588 /* Setup io_uring for use */
589 if(app_setup_uring()) {
590 fprintf(stderr, "Unable to setup uring!\n");
591 return 1;
592 }
593
594 /*
595 * A while loop that reads from stdin and writes to stdout.
596 * Breaks on EOF.
597 */
598 while (1) {
599 /* Initiate read from stdin and wait for it to complete */
600 submit_to_sq(STDIN_FILENO, IORING_OP_READ);
601 /* Read completion queue entry */
602 res = read_from_cq();
603 if (res > 0) {
604 /* Read successful. Write to stdout. */
605 submit_to_sq(STDOUT_FILENO, IORING_OP_WRITE);
606 read_from_cq();
607 } else if (res == 0) {
608 /* reached EOF */
609 break;
610 }
611 else if (res < 0) {
612 /* Error reading file */
613 fprintf(stderr, "Error: %s\n", strerror(abs(res)));
614 break;
615 }
616 offset += res;
617 }
618
619 return 0;
620 }
621
623 io_uring_enter(2) io_uring_register(2) io_uring_setup(2)
624
625
626
627Linux 2020-07-26 io_uring(7)