1IO_URING_SETUP(2) Linux Programmer's Manual IO_URING_SETUP(2)
2
3
4
6 io_uring_setup - setup a context for performing asynchronous I/O
7
9 #include <linux/io_uring.h>
10
11 int io_uring_setup(u32 entries, struct io_uring_params *p);
12
14 The io_uring_setup() system call sets up a submission queue (SQ) and
15 completion queue (CQ) with at least entries entries, and returns a file
16 descriptor which can be used to perform subsequent operations on the
17 io_uring instance. The submission and completion queues are shared
18 between userspace and the kernel, which eliminates the need to copy
19 data when initiating and completing I/O.
20
21 params is used by the application to pass options to the kernel, and by
22 the kernel to convey information about the ring buffers.
23
24 struct io_uring_params {
25 __u32 sq_entries;
26 __u32 cq_entries;
27 __u32 flags;
28 __u32 sq_thread_cpu;
29 __u32 sq_thread_idle;
30 __u32 features;
31 __u32 resv[4];
32 struct io_sqring_offsets sq_off;
33 struct io_cqring_offsets cq_off;
34 };
35
36 The flags, sq_thread_cpu, and sq_thread_idle fields are used to config‐
37 ure the io_uring instance. flags is a bit mask of 0 or more of the
38 following values ORed together:
39
40 IORING_SETUP_IOPOLL
41 Perform busy-waiting for an I/O completion, as opposed to get‐
42 ting notifications via an asynchronous IRQ (Interrupt Request).
43 The file system (if any) and block device must support polling
44 in order for this to work. Busy-waiting provides lower latency,
45 but may consume more CPU resources than interrupt driven I/O.
46 Currently, this feature is usable only on a file descriptor
47 opened using the O_DIRECT flag. When a read or write is submit‐
48 ted to a polled context, the application must poll for comple‐
49 tions on the CQ ring by calling io_uring_enter(2). It is ille‐
50 gal to mix and match polled and non-polled I/O on an io_uring
51 instance.
52
53
54 IORING_SETUP_SQPOLL
55 When this flag is specified, a kernel thread is created to per‐
56 form submission queue polling. An io_uring instance configured
57 in this way enables an application to issue I/O without ever
58 context switching into the kernel. By using the submission
59 queue to fill in new submission queue entries and watching for
60 completions on the completion queue, the application can submit
61 and reap I/Os without doing a single system call.
62
63 If the kernel thread is idle for more than sq_thread_idle mil‐
64 liseconds, it will set the IORING_SQ_NEED_WAKEUP bit in the
65 flags field of the struct io_sq_ring. When this happens, the
66 application must call io_uring_enter(2) to wake the kernel
67 thread. If I/O is kept busy, the kernel thread will never
68 sleep. An application making use of this feature will need to
69 guard the io_uring_enter(2) call with the following code
70 sequence:
71
72 /*
73 * Ensure that the wakeup flag is read after the tail pointer has been
74 * written.
75 */
76 smp_mb();
77 if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP)
78 io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);
79
80 where sq_ring is a submission queue ring setup using the struct
81 io_sqring_offsets described below.
82
83 To successfully use this feature, the application must register
84 a set of files to be used for IO through io_uring_register(2)
85 using the IORING_REGISTER_FILES opcode. Failure to do so will
86 result in submitted IO being errored with EBADF.
87
88 IORING_SETUP_SQ_AFF
89 If this flag is specified, then the poll thread will be bound to
90 the cpu set in the sq_thread_cpu field of the struct
91 io_uring_params. This flag is only meaningful when IORING_SET‐
92 UP_SQPOLL is specified.
93
94 IORING_SETUP_CQSIZE
95 Create the completion queue with struct
96 io_uring_params.cq_entries entries. The value must be greater
97 than entries, and may be rounded up to the next power-of-two.
98
99 If no flags are specified, the io_uring instance is setup for interrupt
100 driven I/O. I/O may be submitted using io_uring_enter(2) and can be
101 reaped by polling the completion queue.
102
103 The resv array must be initialized to zero.
104
105 features is filled in by the kernel, which specifies various features
106 supported by current kernel version.
107
108 IORING_FEAT_SINGLE_MMAP
109 If this flag is set, the two SQ and CQ rings can be mapped with
110 a single mmap(2) call. The SQEs must still be allocated sepa‐
111 rately. This brings the necessary mmap(2) calls down from three
112 to two.
113
114 IORING_FEAT_NODROP
115 If this flag is set, io_uring supports never dropping completion
116 events. If a completion event occurs and the CQ ring is full,
117 the kernel stores the event internally until such a time that
118 the CQ ring has room for more entries. If this overflow condi‐
119 tion is entered, attempting to submit more IO with fail with the
120 -EBUSY error value, if it can't flush the overflown events to
121 the CQ ring. If this happens, the application must reap events
122 from the CQ ring and attempt the submit again.
123
124 IORING_FEAT_SUBMIT_STABLE
125 If this flag is set, applications can be certain that any data
126 for async offload has been consumed when the kernel has consumed
127 the SQE.
128
129 IORING_FEAT_RW_CUR_POS
130 If this flag is set, applications can specify offset == -1 with
131 IORING_OP_{READV,WRITEV} , IORING_OP_{READ,WRITE}_FIXED , and
132 IORING_OP_{READ,WRITE} to mean current file position, which
133 behaves like preadv2(2) and pwritev2(2) with offset == -1. It'll
134 use (and update) the current file position. This obviously comes
135 with the caveat that if the application has multiple reads or
136 writes in flight, then the end result will not be as expected.
137 This is similar to threads sharing a file descriptor and doing
138 IO using the current file position.
139
140 IORING_FEAT_CUR_PERSONALITY
141 If this flag is set, then io_uring guarantees that both sync and
142 async execution of a request assumes the credentials of the task
143 that called io_uring_enter(2) to queue the requests. If this
144 flag isn't set, then requests are issued with the credentials of
145 the task that originally registered the io_uring. If only one
146 task is using a ring, then this flag doesn't matter as the cre‐
147 dentials will always be the same. Note that this is the default
148 behavior, tasks can still register different personalities
149 through io_uring_register(2) with IORING_REGISTER_PERSONALITY
150 and specify the personality to use in the sqe.
151
152
153 The rest of the fields in the struct io_uring_params are filled in by
154 the kernel, and provide the information necessary to memory map the
155 submission queue, completion queue, and the array of submission queue
156 entries. sq_entries specifies the number of submission queue entries
157 allocated. sq_off describes the offsets of various ring buffer fields:
158
159 struct io_sqring_offsets {
160 __u32 head;
161 __u32 tail;
162 __u32 ring_mask;
163 __u32 ring_entries;
164 __u32 flags;
165 __u32 dropped;
166 __u32 array;
167 __u32 resv[3];
168 };
169
170 Taken together, sq_entries and sq_off provide all of the information
171 necessary for accessing the submission queue ring buffer and the sub‐
172 mission queue entry array. The submission queue can be mapped with a
173 call like:
174
175 ptr = mmap(0, sq_off.array + sq_entries * sizeof(__u32),
176 PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,
177 ring_fd, IORING_OFF_SQ_RING);
178
179 where sq_off is the io_sqring_offsets structure, and ring_fd is the
180 file descriptor returned from io_uring_setup(2). The addition of
181 sq_off.array to the length of the region accounts for the fact that the
182 ring located at the end of the data structure. As an example, the ring
183 buffer head pointer can be accessed by adding sq_off.head to the
184 address returned from mmap(2):
185
186 head = ptr + sq_off.head;
187
188 The flags field is used by the kernel to communicate state information
189 to the application. Currently, it is used to inform the application
190 when a call to io_uring_enter(2) is necessary. See the documentation
191 for the IORING_SETUP_SQPOLL flag above. The dropped member is incre‐
192 mented for each invalid submission queue entry encountered in the ring
193 buffer.
194
195 The head and tail track the ring buffer state. The tail is incremented
196 by the application when submitting new I/O, and the head is incremented
197 by the kernel when the I/O has been successfully submitted. Determin‐
198 ing the index of the head or tail into the ring is accomplished by
199 applying a mask:
200
201 index = tail & ring_mask;
202
203 The array of submission queue entries is mapped with:
204
205 sqentries = mmap(0, sq_entries * sizeof(struct io_uring_sqe),
206 PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,
207 ring_fd, IORING_OFF_SQES);
208
209 The completion queue is described by cq_entries and cq_off shown here:
210
211 struct io_cqring_offsets {
212 __u32 head;
213 __u32 tail;
214 __u32 ring_mask;
215 __u32 ring_entries;
216 __u32 overflow;
217 __u32 cqes;
218 __u32 flags;
219 __u32 resv[3];
220 };
221
222 The completion queue is simpler, since the entries are not separated
223 from the queue itself, and can be mapped with:
224
225 ptr = mmap(0, cq_off.cqes + cq_entries * sizeof(struct io_uring_cqe),
226 PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, ring_fd,
227 IORING_OFF_CQ_RING);
228
229 Closing the file descriptor returned by io_uring_setup(2) will free all
230 resources associated with the io_uring context.
231
233 io_uring_setup(2) returns a new file descriptor on success. The appli‐
234 cation may then provide the file descriptor in a subsequent mmap(2)
235 call to map the submission and completion queues, or to the
236 io_uring_register(2) or io_uring_enter(2) system calls.
237
238 On error, -1 is returned and errno is set appropriately.
239
241 EFAULT params is outside your accessible address space.
242
243 EINVAL The resv array contains non-zero data, p.flags contains an
244 unsupported flag, entries is out of bounds, IORING_SETUP_SQ_AFF
245 was specified, but IORING_SETUP_SQPOLL was not, or IORING_SET‐
246 UP_CQSIZE was specified, but io_uring_params.cq_entries was
247 invalid.
248
249 EMFILE The per-process limit on the number of open file descriptors has
250 been reached (see the description of RLIMIT_NOFILE in getr‐
251 limit(2)).
252
253 ENFILE The system-wide limit on the total number of open files has been
254 reached.
255
256 ENOMEM Insufficient kernel resources are available.
257
258 EPERM IORING_SETUP_SQPOLL was specified, but the effective user ID of
259 the caller did not have sufficient privileges.
260
262 io_uring_register(2), io_uring_enter(2)
263
264
265
266Linux 2019-01-29 IO_URING_SETUP(2)