1IO_URING_SETUP(2) Linux Programmer's Manual IO_URING_SETUP(2)
2
3
4
6 io_uring_setup - setup a context for performing asynchronous I/O
7
9 #include <linux/io_uring.h>
10
11 int io_uring_setup(u32 entries, struct io_uring_params *p);
12
14 The io_uring_setup() system call sets up a submission queue (SQ) and
15 completion queue (CQ) with at least entries entries, and returns a file
16 descriptor which can be used to perform subsequent operations on the
17 io_uring instance. The submission and completion queues are shared be‐
18 tween userspace and the kernel, which eliminates the need to copy data
19 when initiating and completing I/O.
20
21 params is used by the application to pass options to the kernel, and by
22 the kernel to convey information about the ring buffers.
23
24 struct io_uring_params {
25 __u32 sq_entries;
26 __u32 cq_entries;
27 __u32 flags;
28 __u32 sq_thread_cpu;
29 __u32 sq_thread_idle;
30 __u32 features;
31 __u32 resv[4];
32 struct io_sqring_offsets sq_off;
33 struct io_cqring_offsets cq_off;
34 };
35
36 The flags, sq_thread_cpu, and sq_thread_idle fields are used to config‐
37 ure the io_uring instance. flags is a bit mask of 0 or more of the
38 following values ORed together:
39
40 IORING_SETUP_IOPOLL
41 Perform busy-waiting for an I/O completion, as opposed to get‐
42 ting notifications via an asynchronous IRQ (Interrupt Request).
43 The file system (if any) and block device must support polling
44 in order for this to work. Busy-waiting provides lower latency,
45 but may consume more CPU resources than interrupt driven I/O.
46 Currently, this feature is usable only on a file descriptor
47 opened using the O_DIRECT flag. When a read or write is submit‐
48 ted to a polled context, the application must poll for comple‐
49 tions on the CQ ring by calling io_uring_enter(2). It is ille‐
50 gal to mix and match polled and non-polled I/O on an io_uring
51 instance.
52
53
54 IORING_SETUP_SQPOLL
55 When this flag is specified, a kernel thread is created to per‐
56 form submission queue polling. An io_uring instance configured
57 in this way enables an application to issue I/O without ever
58 context switching into the kernel. By using the submission
59 queue to fill in new submission queue entries and watching for
60 completions on the completion queue, the application can submit
61 and reap I/Os without doing a single system call.
62
63 If the kernel thread is idle for more than sq_thread_idle mil‐
64 liseconds, it will set the IORING_SQ_NEED_WAKEUP bit in the
65 flags field of the struct io_sq_ring. When this happens, the
66 application must call io_uring_enter(2) to wake the kernel
67 thread. If I/O is kept busy, the kernel thread will never
68 sleep. An application making use of this feature will need to
69 guard the io_uring_enter(2) call with the following code se‐
70 quence:
71
72 /*
73 * Ensure that the wakeup flag is read after the tail pointer
74 * has been written. It's important to use memory load acquire
75 * semantics for the flags read, as otherwise the application
76 * and the kernel might not agree on the consistency of the
77 * wakeup flag.
78 */
79 unsigned flags = atomic_load_relaxed(sq_ring->flags);
80 if (flags & IORING_SQ_NEED_WAKEUP)
81 io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);
82
83 where sq_ring is a submission queue ring setup using the struct
84 io_sqring_offsets described below.
85
86 Before version 5.11 of the Linux kernel, to successfully use
87 this feature, the application must register a set of files to be
88 used for IO through io_uring_register(2) using the IORING_REGIS‐
89 TER_FILES opcode. Failure to do so will result in submitted IO
90 being errored with EBADF. The presence of this feature can be
91 detected by the IORING_FEAT_SQPOLL_NONFIXED feature flag. In
92 version 5.11 and later, it is no longer necessary to register
93 files to use this feature. 5.11 also allows using this as non-
94 root, if the user has the CAP_SYS_NICE capability.
95
96 IORING_SETUP_SQ_AFF
97 If this flag is specified, then the poll thread will be bound to
98 the cpu set in the sq_thread_cpu field of the struct io_ur‐
99 ing_params. This flag is only meaningful when IOR‐
100 ING_SETUP_SQPOLL is specified.
101
102 IORING_SETUP_CQSIZE
103 Create the completion queue with struct io_uring_params.cq_en‐
104 tries entries. The value must be greater than entries, and may
105 be rounded up to the next power-of-two.
106
107 IORING_SETUP_CLAMP
108 If this flag is specified, and if entries exceeds IORING_MAX_EN‐
109 TRIES , then entries will be clamped at IORING_MAX_ENTRIES . If
110 the flag IORING_SETUP_SQPOLL is set, and if the value of struct
111 io_uring_params.cq_entries exceeds IORING_MAX_CQ_ENTRIES , then
112 it will be clamped at IORING_MAX_CQ_ENTRIES .
113
114 IORING_SETUP_ATTACH_WQ
115 This flag should be set in conjunction with struct io_ur‐
116 ing_params.wq_fd being set to an existing io_uring ring file de‐
117 scriptor. When set, the io_uring instance being created will
118 share the asynchronous worker thread backend of the specified
119 io_uring ring, rather than create a new separate thread pool.
120
121 IORING_SETUP_R_DISABLED
122 If this flag is specified, the io_uring ring starts in a dis‐
123 abled state. In this state, restrictions can be registered, but
124 submissions are not allowed. See io_uring_register(2) for de‐
125 tails on how to enable the ring. Available since 5.10.
126
127 If no flags are specified, the io_uring instance is setup for interrupt
128 driven I/O. I/O may be submitted using io_uring_enter(2) and can be
129 reaped by polling the completion queue.
130
131 The resv array must be initialized to zero.
132
133 features is filled in by the kernel, which specifies various features
134 supported by current kernel version.
135
136 IORING_FEAT_SINGLE_MMAP
137 If this flag is set, the two SQ and CQ rings can be mapped with
138 a single mmap(2) call. The SQEs must still be allocated sepa‐
139 rately. This brings the necessary mmap(2) calls down from three
140 to two.
141
142 IORING_FEAT_NODROP
143 If this flag is set, io_uring supports never dropping completion
144 events. If a completion event occurs and the CQ ring is full,
145 the kernel stores the event internally until such a time that
146 the CQ ring has room for more entries. If this overflow condi‐
147 tion is entered, attempting to submit more IO with fail with the
148 -EBUSY error value, if it can't flush the overflown events to
149 the CQ ring. If this happens, the application must reap events
150 from the CQ ring and attempt the submit again.
151
152 IORING_FEAT_SUBMIT_STABLE
153 If this flag is set, applications can be certain that any data
154 for async offload has been consumed when the kernel has consumed
155 the SQE.
156
157 IORING_FEAT_RW_CUR_POS
158 If this flag is set, applications can specify offset == -1 with
159 IORING_OP_{READV,WRITEV} , IORING_OP_{READ,WRITE}_FIXED , and
160 IORING_OP_{READ,WRITE} to mean current file position, which be‐
161 haves like preadv2(2) and pwritev2(2) with offset == -1. It'll
162 use (and update) the current file position. This obviously comes
163 with the caveat that if the application has multiple reads or
164 writes in flight, then the end result will not be as expected.
165 This is similar to threads sharing a file descriptor and doing
166 IO using the current file position.
167
168 IORING_FEAT_CUR_PERSONALITY
169 If this flag is set, then io_uring guarantees that both sync and
170 async execution of a request assumes the credentials of the task
171 that called io_uring_enter(2) to queue the requests. If this
172 flag isn't set, then requests are issued with the credentials of
173 the task that originally registered the io_uring. If only one
174 task is using a ring, then this flag doesn't matter as the cre‐
175 dentials will always be the same. Note that this is the default
176 behavior, tasks can still register different personalities
177 through io_uring_register(2) with IORING_REGISTER_PERSONALITY
178 and specify the personality to use in the sqe.
179
180 IORING_FEAT_FAST_POLL
181 If this flag is set, then io_uring supports using an internal
182 poll mechanism to drive data/space readiness. This means that
183 requests that cannot read or write data to a file no longer need
184 to be punted to an async thread for handling, instead they will
185 begin operation when the file is ready. This is similar to doing
186 poll + read/write in userspace, but eliminates the need to do
187 so. If this flag is set, requests waiting on space/data consume
188 a lot less resources doing so as they are not blocking a thread.
189
190 IORING_FEAT_POLL_32BITS
191 If this flag is set, the IORING_OP_POLL_ADD command accepts the
192 full 32-bit range of epoll based flags. Most notably EPOLLEXCLU‐
193 SIVE which allows exclusive (waking single waiters) behavior.
194
195 IORING_FEAT_SQPOLL_NONFIXED
196 If this flag is set, the IORING_SETUP_SQPOLL feature no longer
197 requires the use of fixed files. Any normal file descriptor can
198 be used for IO commands without needing registration.
199
200
201 The rest of the fields in the struct io_uring_params are filled in by
202 the kernel, and provide the information necessary to memory map the
203 submission queue, completion queue, and the array of submission queue
204 entries. sq_entries specifies the number of submission queue entries
205 allocated. sq_off describes the offsets of various ring buffer fields:
206
207 struct io_sqring_offsets {
208 __u32 head;
209 __u32 tail;
210 __u32 ring_mask;
211 __u32 ring_entries;
212 __u32 flags;
213 __u32 dropped;
214 __u32 array;
215 __u32 resv[3];
216 };
217
218 Taken together, sq_entries and sq_off provide all of the information
219 necessary for accessing the submission queue ring buffer and the sub‐
220 mission queue entry array. The submission queue can be mapped with a
221 call like:
222
223 ptr = mmap(0, sq_off.array + sq_entries * sizeof(__u32),
224 PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,
225 ring_fd, IORING_OFF_SQ_RING);
226
227 where sq_off is the io_sqring_offsets structure, and ring_fd is the
228 file descriptor returned from io_uring_setup(2). The addition of
229 sq_off.array to the length of the region accounts for the fact that the
230 ring located at the end of the data structure. As an example, the ring
231 buffer head pointer can be accessed by adding sq_off.head to the ad‐
232 dress returned from mmap(2):
233
234 head = ptr + sq_off.head;
235
236 The flags field is used by the kernel to communicate state information
237 to the application. Currently, it is used to inform the application
238 when a call to io_uring_enter(2) is necessary. See the documentation
239 for the IORING_SETUP_SQPOLL flag above. The dropped member is incre‐
240 mented for each invalid submission queue entry encountered in the ring
241 buffer.
242
243 The head and tail track the ring buffer state. The tail is incremented
244 by the application when submitting new I/O, and the head is incremented
245 by the kernel when the I/O has been successfully submitted. Determin‐
246 ing the index of the head or tail into the ring is accomplished by ap‐
247 plying a mask:
248
249 index = tail & ring_mask;
250
251 The array of submission queue entries is mapped with:
252
253 sqentries = mmap(0, sq_entries * sizeof(struct io_uring_sqe),
254 PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,
255 ring_fd, IORING_OFF_SQES);
256
257 The completion queue is described by cq_entries and cq_off shown here:
258
259 struct io_cqring_offsets {
260 __u32 head;
261 __u32 tail;
262 __u32 ring_mask;
263 __u32 ring_entries;
264 __u32 overflow;
265 __u32 cqes;
266 __u32 flags;
267 __u32 resv[3];
268 };
269
270 The completion queue is simpler, since the entries are not separated
271 from the queue itself, and can be mapped with:
272
273 ptr = mmap(0, cq_off.cqes + cq_entries * sizeof(struct io_uring_cqe),
274 PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, ring_fd,
275 IORING_OFF_CQ_RING);
276
277 Closing the file descriptor returned by io_uring_setup(2) will free all
278 resources associated with the io_uring context.
279
281 io_uring_setup(2) returns a new file descriptor on success. The appli‐
282 cation may then provide the file descriptor in a subsequent mmap(2)
283 call to map the submission and completion queues, or to the io_ur‐
284 ing_register(2) or io_uring_enter(2) system calls.
285
286 On error, -1 is returned and errno is set appropriately.
287
289 EFAULT params is outside your accessible address space.
290
291 EINVAL The resv array contains non-zero data, p.flags contains an un‐
292 supported flag, entries is out of bounds, IORING_SETUP_SQ_AFF
293 was specified, but IORING_SETUP_SQPOLL was not, or IOR‐
294 ING_SETUP_CQSIZE was specified, but io_uring_params.cq_entries
295 was invalid.
296
297 EMFILE The per-process limit on the number of open file descriptors has
298 been reached (see the description of RLIMIT_NOFILE in getr‐
299 limit(2)).
300
301 ENFILE The system-wide limit on the total number of open files has been
302 reached.
303
304 ENOMEM Insufficient kernel resources are available.
305
306 EPERM IORING_SETUP_SQPOLL was specified, but the effective user ID of
307 the caller did not have sufficient privileges.
308
310 io_uring_register(2), io_uring_enter(2)
311
312
313
314Linux 2019-01-29 IO_URING_SETUP(2)