io_uring(7)

1
2IO_URING(7)                Linux Programmer's Manual               IO_URING(7)
3
4
5

NAME

7       io_uring - Asynchronous I/O facility
8

SYNOPSIS

10       #include <linux/io_uring.h>
11

DESCRIPTION

13       io_uring  is  a Linux-specific API for asynchronous I/O.  It allows the
14       user to submit one or more I/O  requests,  which  are  processed  asyn‐
15       chronously  without  blocking  the  calling process.  io_uring gets its
16       name from ring buffers which are shared between user space  and  kernel
17       space.  This  arrangement  allows for efficient I/O, while avoiding the
18       overhead of copying buffers between them, where possible.  This  inter‐
19       face makes io_uring different from other UNIX I/O APIs, wherein, rather
20       than just communicate between kernel and user space with system  calls,
21       ring buffers are used as the main mode of communication.  This arrange‐
22       ment has various performance benefits which are discussed in a separate
23       section  below.   This  man  page uses the terms shared buffers, shared
24       ring buffers and queues interchangeably.
25
26       The general programming model you need to follow for io_uring  is  out‐
27       lined below
28
29       •      Set  up  shared buffers with io_uring_setup(2) and mmap(2), map‐
30              ping into user space shared buffers  for  the  submission  queue
31              (SQ)  and the completion queue (CQ).  You place I/O requests you
32              want to make on the SQ, while the kernel places the  results  of
33              those operations on the CQ.
34
35       •      For  every  I/O  request  you need to make (like to read a file,
36              write a file, accept a socket connection,  etc),  you  create  a
37              submission  queue  entry, or SQE, describe the I/O operation you
38              need to get done and add it to the tail of the submission  queue
39              (SQ).   Each  I/O  operation is, in essence, the equivalent of a
40              system call you would have made otherwise, if you were not using
41              io_uring.   You can add more than one SQE to the queue depending
42              on the number of operations you want to request.
43
44       •      After you add one or more SQEs, you need  to  call  io_uring_en‐
45              ter(2)  to  tell the kernel to dequeue your I/O requests off the
46              SQ and begin processing them.
47
48       •      For each SQE you submit, once it is done processing the request,
49              the kernel places a completion queue event or CQE at the tail of
50              the completion queue or  CQ.   The  kernel  places  exactly  one
51              matching  CQE in the CQ for every SQE you submit on the SQ.  Af‐
52              ter you retrieve a CQE, minimally, you might  be  interested  in
53              checking  the  res field of the CQE structure, which corresponds
54              to the return value of the system  call's  equivalent,  had  you
55              used  it  directly without using io_uring.  For instance, a read
56              operation under io_uring, started with the IORING_OP_READ opera‐
57              tion,  which  issues  the equivalent of the read(2) system call,
58              would return as part of res what read(2) would have returned  if
59              called directly, without using io_uring.
60
61       •      Optionally, io_uring_enter(2) can also wait for a specified num‐
62              ber of requests to be processed by the kernel before it returns.
63              If  you  specified  a certain number of completions to wait for,
64              the kernel would have placed at least those many number of  CQEs
65              on  the CQ, which you can then readily read, right after the re‐
66              turn from io_uring_enter(2).
67
68       •      It is important to remember that I/O requests submitted  to  the
69              kernel  can  complete in any order.  It is not necessary for the
70              kernel to process one request after another, in  the  order  you
71              placed  them.   Given that the interface is a ring, the requests
72              are attempted in order, however that doesn't imply any  sort  of
73              ordering  on their completion.  When more than one request is in
74              flight, it is not possible to determine which one will  complete
75              first.   When  you  dequeue  CQEs  off the CQ, you should always
76              check which submitted request it corresponds to.  The most  com‐
77              mon  method for doing so is utilizing the user_data field in the
78              request, which is passed back on the completion side.
79
80       Adding to and reading from the queues:
81
82       •      You add SQEs to the tail of the SQ.  The kernel reads  SQEs  off
83              the head of the queue.
84
85       •      The  kernel  adds CQEs to the tail of the CQ.  You read CQEs off
86              the head of the queue.
87
88   Submission queue polling
89       One of the goals of io_uring is to provide a means for  efficient  I/O.
90       To  this  end, io_uring supports a polling mode that lets you avoid the
91       call to io_uring_enter(2), which you use to inform the kernel that  you
92       have queued SQEs on to the SQ.  With SQ Polling, io_uring starts a ker‐
93       nel thread that polls the submission queue for  any  I/O  requests  you
94       submit  by  adding SQEs.  With SQ Polling enabled, there is no need for
95       you to call io_uring_enter(2), letting you avoid the overhead of system
96       calls.   A designated kernel thread dequeues SQEs off the SQ as you add
97       them and dispatches them for asynchronous processing.
98
99   Setting up io_uring
100       The main steps in setting up io_uring consist of mapping in the  shared
101       buffers  with  mmap(2)  calls.  In the example program included in this
102       man page, the  function  app_setup_uring()  sets  up  io_uring  with  a
103       QUEUE_DEPTH  deep  submission  queue.   Pay  attention to the 2 mmap(2)
104       calls that set up the shared submission and completion queues.  If your
105       kernel is older than version 5.4, three mmap(2) calls are required.
106
107   Submitting I/O requests
108       The  process of submitting a request consists of describing the I/O op‐
109       eration you need to get done using an io_uring_sqe structure  instance.
110       These  details  describe the equivalent system call and its parameters.
111       Because the range of I/O operations Linux supports are very varied  and
112       the  io_uring_sqe  structure  needs to be able to describe them, it has
113       several fields, some packed into unions for space efficiency.  Here  is
114       a simplified version of struct io_uring_sqe with some of the most often
115       used fields:
116
117           struct io_uring_sqe {
118                   __u8    opcode;         /* type of operation for this sqe */
119                   __s32   fd;             /* file descriptor to do IO on */
120                   __u64   off;            /* offset into file */
121                   __u64   addr;           /* pointer to buffer or iovecs */
122                   __u32   len;            /* buffer size or number of iovecs */
123                   __u64   user_data;      /* data to be passed back at completion time */
124                   __u8    flags;          /* IOSQE_ flags */
125                   ...
126           };
127
128       Here is struct io_uring_sqe in full:
129
130           struct io_uring_sqe {
131                   __u8    opcode;         /* type of operation for this sqe */
132                   __u8    flags;          /* IOSQE_ flags */
133                   __u16   ioprio;         /* ioprio for the request */
134                   __s32   fd;             /* file descriptor to do IO on */
135                   union {
136                           __u64   off;    /* offset into file */
137                           __u64   addr2;
138                   };
139                   union {
140                           __u64   addr;   /* pointer to buffer or iovecs */
141                           __u64   splice_off_in;
142                   };
143                   __u32   len;            /* buffer size or number of iovecs */
144                   union {
145                           __kernel_rwf_t  rw_flags;
146                           __u32           fsync_flags;
147                           __u16           poll_events;    /* compatibility */
148                           __u32           poll32_events;  /* word-reversed for BE */
149                           __u32           sync_range_flags;
150                           __u32           msg_flags;
151                           __u32           timeout_flags;
152                           __u32           accept_flags;
153                           __u32           cancel_flags;
154                           __u32           open_flags;
155                           __u32           statx_flags;
156                           __u32           fadvise_advice;
157                           __u32           splice_flags;
158                   };
159                   __u64   user_data;      /* data to be passed back at completion time */
160                   union {
161                           struct {
162                                   /* pack this to avoid bogus arm OABI complaints */
163                                   union {
164                                           /* index into fixed buffers, if used */
165                                           __u16   buf_index;
166                                           /* for grouped buffer selection */
167                                           __u16   buf_group;
168                                   } __attribute__((packed));
169                                   /* personality to use, if used */
170                                   __u16   personality;
171                                   __s32   splice_fd_in;
172                           };
173                           __u64   __pad2[3];
174                   };
175           };
176
177       To submit an I/O request to io_uring, you need to acquire a  submission
178       queue  entry  (SQE) from the submission queue (SQ), fill it up with de‐
179       tails of the operation you want to submit and  call  io_uring_enter(2).
180       If  you want to avoid calling io_uring_enter(2), you have the option of
181       setting up Submission Queue Polling.
182
183       SQEs are added to the tail of the submission queue.  The  kernel  picks
184       up  SQEs off the head of the SQ.  The general algorithm to get the next
185       available SQE and update the tail is as follows.
186
187           struct io_uring_sqe *sqe;
188           unsigned tail, index;
189           tail = *sqring->tail;
190           index = tail & (*sqring->ring_mask);
191           sqe = &sqring->sqes[index];
192           /* fill up details about this I/O request */
193           describe_io(sqe);
194           /* fill the sqe index into the SQ ring array */
195           sqring->array[index] = index;
196           tail++;
197           atomic_store_release(sqring->tail, tail);
198
199       To get the index of an entry, the application  must  mask  the  current
200       tail  index  with  the size mask of the ring.  This holds true for both
201       SQs and CQs.  Once the SQE is acquired, the necessary fields are filled
202       in,  describing  the  request.   While the CQ ring directly indexes the
203       shared array of CQEs, the submission side has an indirection array  be‐
204       tween  them.  The submission side ring buffer is an index into this ar‐
205       ray, which in turn contains the index into the SQEs.
206
207       The following code snippet demonstrates how a read operation, an equiv‐
208       alent  of  a  preadv2(2)  system call is described by filling up an SQE
209       with the necessary parameters.
210
211           struct iovec iovecs[16];
212            ...
213           sqe->opcode = IORING_OP_READV;
214           sqe->fd = fd;
215           sqe->addr = (unsigned long) iovecs;
216           sqe->len = 16;
217           sqe->off = offset;
218           sqe->flags = 0;
219
220       Memory ordering
221              Modern compilers and CPUs freely reorder reads and writes  with‐
222              out  affecting  the  program's  outcome to optimize performance.
223              Some aspects of this need to be kept  in  mind  on  SMP  systems
224              since  io_uring  involves buffers shared between kernel and user
225              space.  These buffers are both visible and modifiable from  ker‐
226              nel  and  user  space.   As  heads  and tails belonging to these
227              shared buffers are updated by kernel  and  user  space,  changes
228              need  to  be  coherently visible on either side, irrespective of
229              whether a CPU switch  took  place  after  the  kernel-user  mode
230              switch  happened.   We  use  memory barriers to enforce this co‐
231              herency.  Being significantly large subjects on their own,  mem‐
232              ory barriers are out of scope for further discussion on this man
233              page.
234
235       Letting the kernel know about I/O submissions
236              Once you place one or more SQEs on to the SQ, you  need  to  let
237              the kernel know that you've done so.  You can do this by calling
238              the io_uring_enter(2) system call.  This system call is also ca‐
239              pable  of  waiting  for a specified count of events to complete.
240              This way, you can be sure to find completion events in the  com‐
241              pletion queue without having to poll it for events later.
242
243   Reading completion events
244       Similar  to  the  submission queue (SQ), the completion queue (CQ) is a
245       shared buffer between the kernel and user space.   Whereas  you  placed
246       submission  queue entries on the tail of the SQ and the kernel read off
247       the head, when it comes to the CQ, the kernel places  completion  queue
248       events or CQEs on the tail of the CQ and you read off its head.
249
250       Submission is flexible (and thus a bit more complicated) since it needs
251       to be able to encode different types of system calls that take  various
252       parameters.  Completion, on the other hand is simpler since we're look‐
253       ing only for a return value back from the kernel.  This is  easily  un‐
254       derstood  by  looking  at  the completion queue event structure, struct
255       io_uring_cqe:
256
257           struct io_uring_cqe {
258                __u64     user_data;  /* sqe->data submission passed back */
259                __s32     res;        /* result code for this event */
260                __u32     flags;
261           };
262
263       Here, user_data is custom data that is passed unchanged from submission
264       to  completion.  That is, from SQEs to CQEs.  This field can be used to
265       set context,  uniquely  identifying  submissions  that  got  completed.
266       Given  that  I/O  requests can complete in any order, this field can be
267       used to correlate a submission with a completion.  res  is  the  result
268       from  the system call that was performed as part of the submission; its
269       return value.  The flags field could carry request-specific metadata in
270       the future, but is currently unused.
271
272       The general sequence to read completion events off the completion queue
273       is as follows:
274
275           unsigned head;
276           head = *cqring->head;
277           if (head != atomic_load_acquire(cqring->tail)) {
278               struct io_uring_cqe *cqe;
279               unsigned index;
280               index = head & (cqring->mask);
281               cqe = &cqring->cqes[index];
282               /* process completed CQE */
283               process_cqe(cqe);
284               /* CQE consumption complete */
285               head++;
286           }
287           atomic_store_release(cqring->head, head);
288
289       It helps to be reminded that the kernel adds CQEs to the  tail  of  the
290       CQ,  while  you need to dequeue them off the head.  To get the index of
291       an entry at the head, the application must mask the current head  index
292       with the size mask of the ring.  Once the CQE has been consumed or pro‐
293       cessed, the head needs to be updated to reflect the consumption of  the
294       CQE.  Attention should be paid to the read and write barriers to ensure
295       successful read and update of the head.
296
297   io_uring performance
298       Because of the shared ring  buffers  between  kernel  and  user  space,
299       io_uring can be a zero-copy system.  Copying buffers to and fro becomes
300       necessary when system calls that transfer data between kernel and  user
301       space  are involved.  But since the bulk of the communication in io_ur‐
302       ing is via buffers shared between the kernel and user space, this  huge
303       performance overhead is completely avoided.
304
305       While  system  calls  may not seem like a significant overhead, in high
306       performance applications, making a lot of them will  begin  to  matter.
307       While  workarounds  the  operating  system  has  in  place to deal with
308       Specter and Meltdown are ideally best done  away  with,  unfortunately,
309       some  of these workarounds are around the system call interface, making
310       system calls not as cheap as before on affected hardware.  While  newer
311       hardware should not need these workarounds, hardware with these vulner‐
312       abilities can be expected to be in the wild for a long time.  While us‐
313       ing  synchronous programming interfaces or even when using asynchronous
314       programming interfaces under Linux, there is at least one  system  call
315       involved  in the submission of each request.  In io_uring, on the other
316       hand, you can batch several requests in one go, simply by  queueing  up
317       multiple  SQEs,  each  describing  an I/O operation you want and make a
318       single call to io_uring_enter(2).  This is possible due  to  io_uring's
319       shared buffers based design.
320
321       While  this  batching  in itself can avoid the overhead associated with
322       potentially multiple and frequent system calls,  you  can  reduce  even
323       this overhead further with Submission Queue Polling, by having the ker‐
324       nel poll and pick up your SQEs for processing as you add  them  to  the
325       submission  queue.  This  avoids the io_uring_enter(2) call you need to
326       make to tell the kernel to pick SQEs up.  For high-performance applica‐
327       tions, this means even lesser system call overheads.
328

CONFORMING TO

330       io_uring is Linux-specific.
331

EXAMPLES

333       The  following  example  uses  io_uring to copy stdin to stdout.  Using
334       shell redirection, you should be able to copy files with this  example.
335       Because  it  uses a queue depth of only one, this example processes I/O
336       requests one after the other.  It is purposefully kept this way to  aid
337       understanding.   In real-world scenarios however, you'll want to have a
338       larger queue depth to parallelize I/O request processing so as to  gain
339       the  kind  of performance benefits io_uring provides with its asynchro‐
340       nous processing of requests.
341
342       #include <stdio.h>
343       #include <stdlib.h>
344       #include <sys/stat.h>
345       #include <sys/ioctl.h>
346       #include <sys/syscall.h>
347       #include <sys/mman.h>
348       #include <sys/uio.h>
349       #include <linux/fs.h>
350       #include <fcntl.h>
351       #include <unistd.h>
352       #include <string.h>
353       #include <stdatomic.h>
354
355       #include <linux/io_uring.h>
356
357       #define QUEUE_DEPTH 1
358       #define BLOCK_SZ    1024
359
360       /* Macros for barriers needed by io_uring */
361       #define io_uring_smp_store_release(p, v)            \
362           atomic_store_explicit((_Atomic typeof(*(p)) *)(p), (v), \
363                         memory_order_release)
364       #define io_uring_smp_load_acquire(p)                \
365           atomic_load_explicit((_Atomic typeof(*(p)) *)(p),   \
366                        memory_order_acquire)
367
368       int ring_fd;
369       unsigned *sring_tail, *sring_mask, *sring_array,
370                   *cring_head, *cring_tail, *cring_mask;
371       struct io_uring_sqe *sqes;
372       struct io_uring_cqe *cqes;
373       char buff[BLOCK_SZ];
374       off_t offset;
375
376       /*
377        * System call wrappers provided since glibc does not yet
378        * provide wrappers for io_uring system calls.
379       * */
380
381       int io_uring_setup(unsigned entries, struct io_uring_params *p)
382       {
383           return (int) syscall(__NR_io_uring_setup, entries, p);
384       }
385
386       int io_uring_enter(int ring_fd, unsigned int to_submit,
387                          unsigned int min_complete, unsigned int flags)
388       {
389           return (int) syscall(__NR_io_uring_enter, ring_fd, to_submit, min_complete,
390                                flags, NULL, 0);
391       }
392
393       int app_setup_uring(void) {
394           struct io_uring_params p;
395           void *sq_ptr, *cq_ptr;
396
397           /* See io_uring_setup(2) for io_uring_params.flags you can set */
398           memset(&p, 0, sizeof(p));
399           ring_fd = io_uring_setup(QUEUE_DEPTH, &p);
400           if (ring_fd < 0) {
401               perror("io_uring_setup");
402               return 1;
403           }
404
405           /*
406            * io_uring communication happens via 2 shared kernel-user space ring
407            * buffers, which can be jointly mapped with a single mmap() call in
408            * kernels >= 5.4.
409            */
410
411           int sring_sz = p.sq_off.array + p.sq_entries * sizeof(unsigned);
412           int cring_sz = p.cq_off.cqes + p.cq_entries * sizeof(struct io_uring_cqe);
413
414           /* Rather than check for kernel version, the recommended way is to
415            * check the features field of the io_uring_params structure, which is a
416            * bitmask. If IORING_FEAT_SINGLE_MMAP is set, we can do away with the
417            * second mmap() call to map in the completion ring separately.
418            */
419           if (p.features & IORING_FEAT_SINGLE_MMAP) {
420               if (cring_sz > sring_sz)
421                   sring_sz = cring_sz;
422               cring_sz = sring_sz;
423           }
424
425           /* Map in the submission and completion queue ring buffers.
426            *  Kernels < 5.4 only map in the submission queue, though.
427            */
428           sq_ptr = mmap(0, sring_sz, PROT_READ | PROT_WRITE,
429                         MAP_SHARED | MAP_POPULATE,
430                         ring_fd, IORING_OFF_SQ_RING);
431           if (sq_ptr == MAP_FAILED) {
432               perror("mmap");
433               return 1;
434           }
435
436           if (p.features & IORING_FEAT_SINGLE_MMAP) {
437               cq_ptr = sq_ptr;
438           } else {
439               /* Map in the completion queue ring buffer in older kernels separately */
440               cq_ptr = mmap(0, cring_sz, PROT_READ | PROT_WRITE,
441                             MAP_SHARED | MAP_POPULATE,
442                             ring_fd, IORING_OFF_CQ_RING);
443               if (cq_ptr == MAP_FAILED) {
444                   perror("mmap");
445                   return 1;
446               }
447           }
448           /* Save useful fields for later easy reference */
449           sring_tail = sq_ptr + p.sq_off.tail;
450           sring_mask = sq_ptr + p.sq_off.ring_mask;
451           sring_array = sq_ptr + p.sq_off.array;
452
453           /* Map in the submission queue entries array */
454           sqes = mmap(0, p.sq_entries * sizeof(struct io_uring_sqe),
455                          PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE,
456                          ring_fd, IORING_OFF_SQES);
457           if (sqes == MAP_FAILED) {
458               perror("mmap");
459               return 1;
460           }
461
462           /* Save useful fields for later easy reference */
463           cring_head = cq_ptr + p.cq_off.head;
464           cring_tail = cq_ptr + p.cq_off.tail;
465           cring_mask = cq_ptr + p.cq_off.ring_mask;
466           cqes = cq_ptr + p.cq_off.cqes;
467
468           return 0;
469       }
470
471       /*
472       * Read from completion queue.
473       * In this function, we read completion events from the completion queue.
474       * We dequeue the CQE, update and head and return the result of the operation.
475       * */
476
477       int read_from_cq() {
478           struct io_uring_cqe *cqe;
479           unsigned head, reaped = 0;
480
481           /* Read barrier */
482           head = io_uring_smp_load_acquire(cring_head);
483           /*
484           * Remember, this is a ring buffer. If head == tail, it means that the
485           * buffer is empty.
486           * */
487           if (head == *cring_tail)
488               return -1;
489
490           /* Get the entry */
491           cqe = &cqes[head & (*cring_mask)];
492           if (cqe->res < 0)
493               fprintf(stderr, "Error: %s\n", strerror(abs(cqe->res)));
494
495           head++;
496
497           /* Write barrier so that update to the head are made visible */
498           io_uring_smp_store_release(cring_head, head);
499
500           return cqe->res;
501       }
502
503       /*
504       * Submit a read or a write request to the submission queue.
505       * */
506
507       int submit_to_sq(int fd, int op) {
508           unsigned index, tail;
509
510           /* Add our submission queue entry to the tail of the SQE ring buffer */
511           tail = *sring_tail;
512           index = tail & *sring_mask;
513           struct io_uring_sqe *sqe = &sqes[index];
514           /* Fill in the parameters required for the read or write operation */
515           sqe->opcode = op;
516           sqe->fd = fd;
517           sqe->addr = (unsigned long) buff;
518           if (op == IORING_OP_READ) {
519               memset(buff, 0, sizeof(buff));
520               sqe->len = BLOCK_SZ;
521           }
522           else {
523               sqe->len = strlen(buff);
524           }
525           sqe->off = offset;
526
527           sring_array[index] = index;
528           tail++;
529
530           /* Update the tail */
531           io_uring_smp_store_release(sring_tail, tail);
532
533           /*
534           * Tell the kernel we have submitted events with the io_uring_enter() system
535           * call. We also pass in the IOURING_ENTER_GETEVENTS flag which causes the
536           * io_uring_enter() call to wait until min_complete (the 3rd param) events
537           * complete.
538           * */
539           int ret =  io_uring_enter(ring_fd, 1,1,
540                                     IORING_ENTER_GETEVENTS);
541           if(ret < 0) {
542               perror("io_uring_enter");
543               return -1;
544           }
545
546           return ret;
547       }
548
549       int main(int argc, char *argv[]) {
550           int res;
551
552           /* Setup io_uring for use */
553           if(app_setup_uring()) {
554               fprintf(stderr, "Unable to setup uring!\n");
555               return 1;
556           }
557
558           /*
559           * A while loop that reads from stdin and writes to stdout.
560           * Breaks on EOF.
561           */
562           while (1) {
563               /* Initiate read from stdin and wait for it to complete */
564               submit_to_sq(STDIN_FILENO, IORING_OP_READ);
565               /* Read completion queue entry */
566               res = read_from_cq();
567               if (res > 0) {
568                   /* Read successful. Write to stdout. */
569                   submit_to_sq(STDOUT_FILENO, IORING_OP_WRITE);
570                   read_from_cq();
571               } else if (res == 0) {
572                   /* reached EOF */
573                   break;
574               }
575               else if (res < 0) {
576                   /* Error reading file */
577                   fprintf(stderr, "Error: %s\n", strerror(abs(res)));
578                   break;
579               }
580               offset += res;
581           }
582
583           return 0;
584       }
585

NAME

SYNOPSIS

DESCRIPTION

CONFORMING TO

EXAMPLES

SEE ALSO