1
2IO_URING(7) Linux Programmer's Manual IO_URING(7)
3
4
5
7 io_uring - Asynchronous I/O facility
8
10 #include <linux/io_uring.h>
11
13 io_uring is a Linux-specific API for asynchronous I/O. It allows the
14 user to submit one or more I/O requests, which are processed asyn‐
15 chronously without blocking the calling process. io_uring gets its
16 name from ring buffers which are shared between user space and kernel
17 space. This arrangement allows for efficient I/O, while avoiding the
18 overhead of copying buffers between them, where possible. This inter‐
19 face makes io_uring different from other UNIX I/O APIs, wherein, rather
20 than just communicate between kernel and user space with system calls,
21 ring buffers are used as the main mode of communication. This arrange‐
22 ment has various performance benefits which are discussed in a separate
23 section below. This man page uses the terms shared buffers, shared
24 ring buffers and queues interchangeably.
25
26 The general programming model you need to follow for io_uring is out‐
27 lined below
28
29 • Set up shared buffers with io_uring_setup(2) and mmap(2), map‐
30 ping into user space shared buffers for the submission queue
31 (SQ) and the completion queue (CQ). You place I/O requests you
32 want to make on the SQ, while the kernel places the results of
33 those operations on the CQ.
34
35 • For every I/O request you need to make (like to read a file,
36 write a file, accept a socket connection, etc), you create a
37 submission queue entry, or SQE, describe the I/O operation you
38 need to get done and add it to the tail of the submission queue
39 (SQ). Each I/O operation is, in essence, the equivalent of a
40 system call you would have made otherwise, if you were not using
41 io_uring. You can add more than one SQE to the queue depending
42 on the number of operations you want to request.
43
44 • After you add one or more SQEs, you need to call io_uring_en‐
45 ter(2) to tell the kernel to dequeue your I/O requests off the
46 SQ and begin processing them.
47
48 • For each SQE you submit, once it is done processing the request,
49 the kernel places a completion queue event or CQE at the tail of
50 the completion queue or CQ. The kernel places exactly one
51 matching CQE in the CQ for every SQE you submit on the SQ. Af‐
52 ter you retrieve a CQE, minimally, you might be interested in
53 checking the res field of the CQE structure, which corresponds
54 to the return value of the system call's equivalent, had you
55 used it directly without using io_uring. For instance, a read
56 operation under io_uring, started with the IORING_OP_READ opera‐
57 tion, which issues the equivalent of the read(2) system call,
58 would return as part of res what read(2) would have returned if
59 called directly, without using io_uring.
60
61 • Optionally, io_uring_enter(2) can also wait for a specified num‐
62 ber of requests to be processed by the kernel before it returns.
63 If you specified a certain number of completions to wait for,
64 the kernel would have placed at least those many number of CQEs
65 on the CQ, which you can then readily read, right after the re‐
66 turn from io_uring_enter(2).
67
68 • It is important to remember that I/O requests submitted to the
69 kernel can complete in any order. It is not necessary for the
70 kernel to process one request after another, in the order you
71 placed them. Given that the interface is a ring, the requests
72 are attempted in order, however that doesn't imply any sort of
73 ordering on their completion. When more than one request is in
74 flight, it is not possible to determine which one will complete
75 first. When you dequeue CQEs off the CQ, you should always
76 check which submitted request it corresponds to. The most com‐
77 mon method for doing so is utilizing the user_data field in the
78 request, which is passed back on the completion side.
79
80 Adding to and reading from the queues:
81
82 • You add SQEs to the tail of the SQ. The kernel reads SQEs off
83 the head of the queue.
84
85 • The kernel adds CQEs to the tail of the CQ. You read CQEs off
86 the head of the queue.
87
88 Submission queue polling
89 One of the goals of io_uring is to provide a means for efficient I/O.
90 To this end, io_uring supports a polling mode that lets you avoid the
91 call to io_uring_enter(2), which you use to inform the kernel that you
92 have queued SQEs on to the SQ. With SQ Polling, io_uring starts a ker‐
93 nel thread that polls the submission queue for any I/O requests you
94 submit by adding SQEs. With SQ Polling enabled, there is no need for
95 you to call io_uring_enter(2), letting you avoid the overhead of system
96 calls. A designated kernel thread dequeues SQEs off the SQ as you add
97 them and dispatches them for asynchronous processing.
98
99 Setting up io_uring
100 The main steps in setting up io_uring consist of mapping in the shared
101 buffers with mmap(2) calls. In the example program included in this
102 man page, the function app_setup_uring() sets up io_uring with a
103 QUEUE_DEPTH deep submission queue. Pay attention to the 2 mmap(2)
104 calls that set up the shared submission and completion queues. If your
105 kernel is older than version 5.4, three mmap(2) calls are required.
106
107 Submitting I/O requests
108 The process of submitting a request consists of describing the I/O op‐
109 eration you need to get done using an io_uring_sqe structure instance.
110 These details describe the equivalent system call and its parameters.
111 Because the range of I/O operations Linux supports are very varied and
112 the io_uring_sqe structure needs to be able to describe them, it has
113 several fields, some packed into unions for space efficiency. Here is
114 a simplified version of struct io_uring_sqe with some of the most often
115 used fields:
116
117 struct io_uring_sqe {
118 __u8 opcode; /* type of operation for this sqe */
119 __s32 fd; /* file descriptor to do IO on */
120 __u64 off; /* offset into file */
121 __u64 addr; /* pointer to buffer or iovecs */
122 __u32 len; /* buffer size or number of iovecs */
123 __u64 user_data; /* data to be passed back at completion time */
124 __u8 flags; /* IOSQE_ flags */
125 ...
126 };
127
128 Here is struct io_uring_sqe in full:
129
130 struct io_uring_sqe {
131 __u8 opcode; /* type of operation for this sqe */
132 __u8 flags; /* IOSQE_ flags */
133 __u16 ioprio; /* ioprio for the request */
134 __s32 fd; /* file descriptor to do IO on */
135 union {
136 __u64 off; /* offset into file */
137 __u64 addr2;
138 };
139 union {
140 __u64 addr; /* pointer to buffer or iovecs */
141 __u64 splice_off_in;
142 };
143 __u32 len; /* buffer size or number of iovecs */
144 union {
145 __kernel_rwf_t rw_flags;
146 __u32 fsync_flags;
147 __u16 poll_events; /* compatibility */
148 __u32 poll32_events; /* word-reversed for BE */
149 __u32 sync_range_flags;
150 __u32 msg_flags;
151 __u32 timeout_flags;
152 __u32 accept_flags;
153 __u32 cancel_flags;
154 __u32 open_flags;
155 __u32 statx_flags;
156 __u32 fadvise_advice;
157 __u32 splice_flags;
158 };
159 __u64 user_data; /* data to be passed back at completion time */
160 union {
161 struct {
162 /* pack this to avoid bogus arm OABI complaints */
163 union {
164 /* index into fixed buffers, if used */
165 __u16 buf_index;
166 /* for grouped buffer selection */
167 __u16 buf_group;
168 } __attribute__((packed));
169 /* personality to use, if used */
170 __u16 personality;
171 __s32 splice_fd_in;
172 };
173 __u64 __pad2[3];
174 };
175 };
176
177 To submit an I/O request to io_uring, you need to acquire a submission
178 queue entry (SQE) from the submission queue (SQ), fill it up with de‐
179 tails of the operation you want to submit and call io_uring_enter(2).
180 If you want to avoid calling io_uring_enter(2), you have the option of
181 setting up Submission Queue Polling.
182
183 SQEs are added to the tail of the submission queue. The kernel picks
184 up SQEs off the head of the SQ. The general algorithm to get the next
185 available SQE and update the tail is as follows.
186
187 struct io_uring_sqe *sqe;
188 unsigned tail, index;
189 tail = *sqring->tail;
190 index = tail & (*sqring->ring_mask);
191 sqe = &sqring->sqes[index];
192 /* fill up details about this I/O request */
193 describe_io(sqe);
194 /* fill the sqe index into the SQ ring array */
195 sqring->array[index] = index;
196 tail++;
197 atomic_store_release(sqring->tail, tail);
198
199 To get the index of an entry, the application must mask the current
200 tail index with the size mask of the ring. This holds true for both
201 SQs and CQs. Once the SQE is acquired, the necessary fields are filled
202 in, describing the request. While the CQ ring directly indexes the
203 shared array of CQEs, the submission side has an indirection array be‐
204 tween them. The submission side ring buffer is an index into this ar‐
205 ray, which in turn contains the index into the SQEs.
206
207 The following code snippet demonstrates how a read operation, an equiv‐
208 alent of a preadv2(2) system call is described by filling up an SQE
209 with the necessary parameters.
210
211 struct iovec iovecs[16];
212 ...
213 sqe->opcode = IORING_OP_READV;
214 sqe->fd = fd;
215 sqe->addr = (unsigned long) iovecs;
216 sqe->len = 16;
217 sqe->off = offset;
218 sqe->flags = 0;
219
220 Memory ordering
221 Modern compilers and CPUs freely reorder reads and writes with‐
222 out affecting the program's outcome to optimize performance.
223 Some aspects of this need to be kept in mind on SMP systems
224 since io_uring involves buffers shared between kernel and user
225 space. These buffers are both visible and modifiable from ker‐
226 nel and user space. As heads and tails belonging to these
227 shared buffers are updated by kernel and user space, changes
228 need to be coherently visible on either side, irrespective of
229 whether a CPU switch took place after the kernel-user mode
230 switch happened. We use memory barriers to enforce this co‐
231 herency. Being significantly large subjects on their own, mem‐
232 ory barriers are out of scope for further discussion on this man
233 page.
234
235 Letting the kernel know about I/O submissions
236 Once you place one or more SQEs on to the SQ, you need to let
237 the kernel know that you've done so. You can do this by calling
238 the io_uring_enter(2) system call. This system call is also ca‐
239 pable of waiting for a specified count of events to complete.
240 This way, you can be sure to find completion events in the com‐
241 pletion queue without having to poll it for events later.
242
243 Reading completion events
244 Similar to the submission queue (SQ), the completion queue (CQ) is a
245 shared buffer between the kernel and user space. Whereas you placed
246 submission queue entries on the tail of the SQ and the kernel read off
247 the head, when it comes to the CQ, the kernel places completion queue
248 events or CQEs on the tail of the CQ and you read off its head.
249
250 Submission is flexible (and thus a bit more complicated) since it needs
251 to be able to encode different types of system calls that take various
252 parameters. Completion, on the other hand is simpler since we're look‐
253 ing only for a return value back from the kernel. This is easily un‐
254 derstood by looking at the completion queue event structure, struct
255 io_uring_cqe:
256
257 struct io_uring_cqe {
258 __u64 user_data; /* sqe->data submission passed back */
259 __s32 res; /* result code for this event */
260 __u32 flags;
261 };
262
263 Here, user_data is custom data that is passed unchanged from submission
264 to completion. That is, from SQEs to CQEs. This field can be used to
265 set context, uniquely identifying submissions that got completed.
266 Given that I/O requests can complete in any order, this field can be
267 used to correlate a submission with a completion. res is the result
268 from the system call that was performed as part of the submission; its
269 return value. The flags field could carry request-specific metadata in
270 the future, but is currently unused.
271
272 The general sequence to read completion events off the completion queue
273 is as follows:
274
275 unsigned head;
276 head = *cqring->head;
277 if (head != atomic_load_acquire(cqring->tail)) {
278 struct io_uring_cqe *cqe;
279 unsigned index;
280 index = head & (cqring->mask);
281 cqe = &cqring->cqes[index];
282 /* process completed CQE */
283 process_cqe(cqe);
284 /* CQE consumption complete */
285 head++;
286 }
287 atomic_store_release(cqring->head, head);
288
289 It helps to be reminded that the kernel adds CQEs to the tail of the
290 CQ, while you need to dequeue them off the head. To get the index of
291 an entry at the head, the application must mask the current head index
292 with the size mask of the ring. Once the CQE has been consumed or pro‐
293 cessed, the head needs to be updated to reflect the consumption of the
294 CQE. Attention should be paid to the read and write barriers to ensure
295 successful read and update of the head.
296
297 io_uring performance
298 Because of the shared ring buffers between kernel and user space,
299 io_uring can be a zero-copy system. Copying buffers to and fro becomes
300 necessary when system calls that transfer data between kernel and user
301 space are involved. But since the bulk of the communication in io_ur‐
302 ing is via buffers shared between the kernel and user space, this huge
303 performance overhead is completely avoided.
304
305 While system calls may not seem like a significant overhead, in high
306 performance applications, making a lot of them will begin to matter.
307 While workarounds the operating system has in place to deal with
308 Specter and Meltdown are ideally best done away with, unfortunately,
309 some of these workarounds are around the system call interface, making
310 system calls not as cheap as before on affected hardware. While newer
311 hardware should not need these workarounds, hardware with these vulner‐
312 abilities can be expected to be in the wild for a long time. While us‐
313 ing synchronous programming interfaces or even when using asynchronous
314 programming interfaces under Linux, there is at least one system call
315 involved in the submission of each request. In io_uring, on the other
316 hand, you can batch several requests in one go, simply by queueing up
317 multiple SQEs, each describing an I/O operation you want and make a
318 single call to io_uring_enter(2). This is possible due to io_uring's
319 shared buffers based design.
320
321 While this batching in itself can avoid the overhead associated with
322 potentially multiple and frequent system calls, you can reduce even
323 this overhead further with Submission Queue Polling, by having the ker‐
324 nel poll and pick up your SQEs for processing as you add them to the
325 submission queue. This avoids the io_uring_enter(2) call you need to
326 make to tell the kernel to pick SQEs up. For high-performance applica‐
327 tions, this means even lesser system call overheads.
328
330 io_uring is Linux-specific.
331
333 The following example uses io_uring to copy stdin to stdout. Using
334 shell redirection, you should be able to copy files with this example.
335 Because it uses a queue depth of only one, this example processes I/O
336 requests one after the other. It is purposefully kept this way to aid
337 understanding. In real-world scenarios however, you'll want to have a
338 larger queue depth to parallelize I/O request processing so as to gain
339 the kind of performance benefits io_uring provides with its asynchro‐
340 nous processing of requests.
341
342 #include <stdio.h>
343 #include <stdlib.h>
344 #include <sys/stat.h>
345 #include <sys/ioctl.h>
346 #include <sys/syscall.h>
347 #include <sys/mman.h>
348 #include <sys/uio.h>
349 #include <linux/fs.h>
350 #include <fcntl.h>
351 #include <unistd.h>
352 #include <string.h>
353 #include <stdatomic.h>
354
355 #include <linux/io_uring.h>
356
357 #define QUEUE_DEPTH 1
358 #define BLOCK_SZ 1024
359
360 /* Macros for barriers needed by io_uring */
361 #define io_uring_smp_store_release(p, v) \
362 atomic_store_explicit((_Atomic typeof(*(p)) *)(p), (v), \
363 memory_order_release)
364 #define io_uring_smp_load_acquire(p) \
365 atomic_load_explicit((_Atomic typeof(*(p)) *)(p), \
366 memory_order_acquire)
367
368 int ring_fd;
369 unsigned *sring_tail, *sring_mask, *sring_array,
370 *cring_head, *cring_tail, *cring_mask;
371 struct io_uring_sqe *sqes;
372 struct io_uring_cqe *cqes;
373 char buff[BLOCK_SZ];
374 off_t offset;
375
376 /*
377 * System call wrappers provided since glibc does not yet
378 * provide wrappers for io_uring system calls.
379 * */
380
381 int io_uring_setup(unsigned entries, struct io_uring_params *p)
382 {
383 return (int) syscall(__NR_io_uring_setup, entries, p);
384 }
385
386 int io_uring_enter(int ring_fd, unsigned int to_submit,
387 unsigned int min_complete, unsigned int flags)
388 {
389 return (int) syscall(__NR_io_uring_enter, ring_fd, to_submit, min_complete,
390 flags, NULL, 0);
391 }
392
393 int app_setup_uring(void) {
394 struct io_uring_params p;
395 void *sq_ptr, *cq_ptr;
396
397 /* See io_uring_setup(2) for io_uring_params.flags you can set */
398 memset(&p, 0, sizeof(p));
399 ring_fd = io_uring_setup(QUEUE_DEPTH, &p);
400 if (ring_fd < 0) {
401 perror("io_uring_setup");
402 return 1;
403 }
404
405 /*
406 * io_uring communication happens via 2 shared kernel-user space ring
407 * buffers, which can be jointly mapped with a single mmap() call in
408 * kernels >= 5.4.
409 */
410
411 int sring_sz = p.sq_off.array + p.sq_entries * sizeof(unsigned);
412 int cring_sz = p.cq_off.cqes + p.cq_entries * sizeof(struct io_uring_cqe);
413
414 /* Rather than check for kernel version, the recommended way is to
415 * check the features field of the io_uring_params structure, which is a
416 * bitmask. If IORING_FEAT_SINGLE_MMAP is set, we can do away with the
417 * second mmap() call to map in the completion ring separately.
418 */
419 if (p.features & IORING_FEAT_SINGLE_MMAP) {
420 if (cring_sz > sring_sz)
421 sring_sz = cring_sz;
422 cring_sz = sring_sz;
423 }
424
425 /* Map in the submission and completion queue ring buffers.
426 * Kernels < 5.4 only map in the submission queue, though.
427 */
428 sq_ptr = mmap(0, sring_sz, PROT_READ | PROT_WRITE,
429 MAP_SHARED | MAP_POPULATE,
430 ring_fd, IORING_OFF_SQ_RING);
431 if (sq_ptr == MAP_FAILED) {
432 perror("mmap");
433 return 1;
434 }
435
436 if (p.features & IORING_FEAT_SINGLE_MMAP) {
437 cq_ptr = sq_ptr;
438 } else {
439 /* Map in the completion queue ring buffer in older kernels separately */
440 cq_ptr = mmap(0, cring_sz, PROT_READ | PROT_WRITE,
441 MAP_SHARED | MAP_POPULATE,
442 ring_fd, IORING_OFF_CQ_RING);
443 if (cq_ptr == MAP_FAILED) {
444 perror("mmap");
445 return 1;
446 }
447 }
448 /* Save useful fields for later easy reference */
449 sring_tail = sq_ptr + p.sq_off.tail;
450 sring_mask = sq_ptr + p.sq_off.ring_mask;
451 sring_array = sq_ptr + p.sq_off.array;
452
453 /* Map in the submission queue entries array */
454 sqes = mmap(0, p.sq_entries * sizeof(struct io_uring_sqe),
455 PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE,
456 ring_fd, IORING_OFF_SQES);
457 if (sqes == MAP_FAILED) {
458 perror("mmap");
459 return 1;
460 }
461
462 /* Save useful fields for later easy reference */
463 cring_head = cq_ptr + p.cq_off.head;
464 cring_tail = cq_ptr + p.cq_off.tail;
465 cring_mask = cq_ptr + p.cq_off.ring_mask;
466 cqes = cq_ptr + p.cq_off.cqes;
467
468 return 0;
469 }
470
471 /*
472 * Read from completion queue.
473 * In this function, we read completion events from the completion queue.
474 * We dequeue the CQE, update and head and return the result of the operation.
475 * */
476
477 int read_from_cq() {
478 struct io_uring_cqe *cqe;
479 unsigned head, reaped = 0;
480
481 /* Read barrier */
482 head = io_uring_smp_load_acquire(cring_head);
483 /*
484 * Remember, this is a ring buffer. If head == tail, it means that the
485 * buffer is empty.
486 * */
487 if (head == *cring_tail)
488 return -1;
489
490 /* Get the entry */
491 cqe = &cqes[head & (*cring_mask)];
492 if (cqe->res < 0)
493 fprintf(stderr, "Error: %s\n", strerror(abs(cqe->res)));
494
495 head++;
496
497 /* Write barrier so that update to the head are made visible */
498 io_uring_smp_store_release(cring_head, head);
499
500 return cqe->res;
501 }
502
503 /*
504 * Submit a read or a write request to the submission queue.
505 * */
506
507 int submit_to_sq(int fd, int op) {
508 unsigned index, tail;
509
510 /* Add our submission queue entry to the tail of the SQE ring buffer */
511 tail = *sring_tail;
512 index = tail & *sring_mask;
513 struct io_uring_sqe *sqe = &sqes[index];
514 /* Fill in the parameters required for the read or write operation */
515 sqe->opcode = op;
516 sqe->fd = fd;
517 sqe->addr = (unsigned long) buff;
518 if (op == IORING_OP_READ) {
519 memset(buff, 0, sizeof(buff));
520 sqe->len = BLOCK_SZ;
521 }
522 else {
523 sqe->len = strlen(buff);
524 }
525 sqe->off = offset;
526
527 sring_array[index] = index;
528 tail++;
529
530 /* Update the tail */
531 io_uring_smp_store_release(sring_tail, tail);
532
533 /*
534 * Tell the kernel we have submitted events with the io_uring_enter() system
535 * call. We also pass in the IOURING_ENTER_GETEVENTS flag which causes the
536 * io_uring_enter() call to wait until min_complete (the 3rd param) events
537 * complete.
538 * */
539 int ret = io_uring_enter(ring_fd, 1,1,
540 IORING_ENTER_GETEVENTS);
541 if(ret < 0) {
542 perror("io_uring_enter");
543 return -1;
544 }
545
546 return ret;
547 }
548
549 int main(int argc, char *argv[]) {
550 int res;
551
552 /* Setup io_uring for use */
553 if(app_setup_uring()) {
554 fprintf(stderr, "Unable to setup uring!\n");
555 return 1;
556 }
557
558 /*
559 * A while loop that reads from stdin and writes to stdout.
560 * Breaks on EOF.
561 */
562 while (1) {
563 /* Initiate read from stdin and wait for it to complete */
564 submit_to_sq(STDIN_FILENO, IORING_OP_READ);
565 /* Read completion queue entry */
566 res = read_from_cq();
567 if (res > 0) {
568 /* Read successful. Write to stdout. */
569 submit_to_sq(STDOUT_FILENO, IORING_OP_WRITE);
570 read_from_cq();
571 } else if (res == 0) {
572 /* reached EOF */
573 break;
574 }
575 else if (res < 0) {
576 /* Error reading file */
577 fprintf(stderr, "Error: %s\n", strerror(abs(res)));
578 break;
579 }
580 offset += res;
581 }
582
583 return 0;
584 }
585
587 io_uring_enter(2) io_uring_register(2) io_uring_setup(2)
588
589
590
591Linux 2020-07-26 IO_URING(7)