1RDS zerocopy(7) Miscellaneous Information Manual RDS zerocopy(7)
2
3
4
6 RDS zerocopy - Interface for RDMA over RDS
7
9 This manual page describes the zerocopy interface of RDS, which was
10 added in RDSv3. For a description of the basic RDS interface, please
11 refer to rds(7).
12
13 The principal mode of operation for RDS zerocopy is like this: one par‐
14 ticipant (the client) wishes to initiate a direct transfer to or from
15 some area of memory in its process address space. This memory does not
16 have to be aligned.
17
18 The client obtains a handle for this region of memory, and passes it to
19 the other participant (the server). This is called the RDMA cookie. To
20 the application, the cookie is an opaque 64bit data type.
21
22 The client sends this handle to the server application, along with
23 other details of the RDMA request (such as which data to transfer to
24 that memory area). Throughout the following discussion, we will refer
25 to this message as the RDMA request.
26
27 The server uses this RDMA cookie to initiate the requested RDMA trans‐
28 fer. The RDMA transfer is combined atomically with a normal RDS mes‐
29 sage, which is delivered to the client. This message is called the RDMA
30 ACK throughout the following. Atomic in this context means that either
31 both the RDMA succeeds and the RDMA ACK is delivered, or neither suc‐
32 ceeds.
33
34 Thus, when the client receives the RDMA ACK, it knows that the RDMA has
35 completed successfully. It can then release the RDMA cookie for this
36 memory region, if it wishes to.
37
38 RDMA operations are not reliable, in the sense that unlike normal RDS
39 messages, RDS RDMA operations may fail, and get dropped.
40
42 The interface is currently based on control messages (ancillary data)
43 sent or received via the sendmsg(2) and recvmsg(2) system calls.
44 Optionally, an older interface can be used that is based on the set‐
45 sockopt(2) system call. However, we recommend using control messages,
46 as this reduces the number of system calls required.
47
48 Control message interface
49 With the control message interface, the RDMA cookie is passed to the
50 server out-of-band, included in an extension header attached to the RDS
51 message.
52
53 The following outlines the mode of operation; the data types used will
54 be specified in details in a subsequent section.
55
56 Initially, the client will send RDMA requests along with a
57 RDS_CMSG_RDMA_MAP control message. The control message contains the
58 address and length of the memory region for which to obtain a handle,
59 some flags, and a pointer to a memory location (in the caller's address
60 space) where the kernel will store the RDMA cookie.
61
62 Alternatively, if the application has already obtained a RDMA cookie
63 for the memory range it wants to RDMA to/from, it can hand this cookie
64 to the kernel using the RDS_CMSG_RDMA_DEST control message.
65
66 Either way, the kernel will include the resulting RDMA cookie in an
67 extension header that is transmitted as part of the RDMA request to the
68 server.
69
70 When the server receives the RDMA request, the kernel will deliver the
71 cookie wrapped inside a RDS_CMSG_RDMA_DEST control message.
72
73 The server then initiates the data transfer by sending the RDMA ACK
74 message along with a RDS_CMSG_RDMA_ARGS control message. This message
75 contains the RDMA cookie, and the local memory to copy to or from.
76
77 The server process may request a notification when an RDMA operation
78 completes. Notifications are delivered as a RDS_CMSG_RDMA_STATUS con‐
79 trol messages. When an application calls recvmsg(2), it will either
80 receive a regular RDS message (possibly with other RDMA related control
81 messages), or an empty message with one or more status control mes‐
82 sages.
83
84 In addition, applications When an RDMA operation fails for some reason
85 and is discarded, the application can ask to receive notifications for
86 failed messages as well, regardless of whether it asked for success
87 notification of an individual message or not. This behavior is turned
88 on by setting the RDS_RECVERR socket option.
89
90 Setsockopt interface
91 In addition to the control message interface, RDS allows a process to
92 register and release memory ranges for RDMA through calls to setsock‐
93 opt(2).
94
95 RDS_GET_MR
96 To obtain a RDMA cookie for a given memory range, the applica‐
97 tion can use setsockopt with RDS_GET_MR. This operates essen‐
98 tially the same way as the RDS_CMSG_RDMA_MAP control message:
99 the argument contains the address and length of the memory range
100 to be registered, and a pointer to a RDMA cookie variable, in
101 which the system call will store the cookie for the registered
102 range.
103
104 RDS_FREE_MR
105 Memory ranges can be released by calling setsockopt with
106 RDS_FREE_MR, giving the RDMA cookie and additional flags as
107 arguments.
108
109 RDS_RECVERR
110 This is a boolean option which can be set as well as queried
111 (using getsockopt). When enabled, RDS will send RDMA notifica‐
112 tion messages to the application for any RDMA operation that
113 fails. This option defaults to off.
114
115 For all of these calls, the level argument to setsockopt is SOL_RDS.
116
118 RDMA cookie
119 typedef u_int64_t rds_rdma_cookie_t
120
121 This encapsulates a memory location in the client process. In
122 the current implementation, it contains the R_Key of the remote
123 memory region, and the offset into it (so that the application
124 does not have to worry about alignment.
125
126 The RDMA cookie is used in several struct types described below.
127 The RDS_CMSG_RDMA_DEST control message contains a
128 rds_rdma_cookie_t all by itself as payload.
129
130 Mapping arguments
131 The following data type is used with RDS_CMSG_RDMA_MAP control
132 messages and with the RDS_GET_MR socket option:
133
134 struct rds_iovec {
135 u_int64_t addr;
136 u_int64_t bytes;
137 };
138
139 struct rds_get_mr_args {
140 struct rds_iovec vec;
141 u_int64_t cookie_addr;
142 uint64_t flags;
143 };
144
145 The cookie_addr specifies a memory location where to store the
146 RDMA cookie.
147
148 The flags value is a bitwise OR of any of the following flags:
149
150 RDS_RDMA_USE_ONCE
151 This tells the kernel that the allocated RDMA cookie is
152 to be used exactly once. When the RDMA ACK message
153 arrives, the kernel will automatically unbind the memory
154 area and release any resources associated with the
155 cookie.
156
157 If this flag is not set, it is the application's respon‐
158 sibility to release the memory region at a later time
159 using the RDS_FREE_MR socket option.
160
161 RDS_RDMA_INVALIDATE
162 Normally, RDMA memory mappings are invalidated lazily, as
163 this requires some relatively costly synchronization with
164 the HCA. However, this means that the server application
165 can continue to access the registered memory for some
166 indeterminate amount of time. If this flag is set, the
167 RDS code will invalidate the mapping at the time it is
168 released (either upon arrival of the RDMA ACK, if
169 USE_ONCE was specified; or when the application destroys
170 it using FREE_MR).
171
172 RDMA Operation
173 RDMA operations are initiated by the server using the
174 RDS_CMSG_RDMA_ARGS control message, which takes the following
175 data as payload:
176
177 struct rds_rdma_args {
178 rds_rdma_cookie_t cookie;
179 struct rds_iovec remote_vec;
180 u_int64_t local_vec_addr;
181 u_int64_t nr_local;
182 u_int64_t flags;
183 u_int32_t user_token;
184 };
185
186 The cookie argument contains the RDMA cookie received from the
187 client. The local memory is given via an array of rds_iovecs.
188 The array address is given in local_vec_addr, and its number of
189 elements is given in nr_local.
190
191 The struct member remote_vec specifies a location relative to
192 the memory area identified by the cookie: remote_vec.addr is an
193 offset into that region, and remote_vec.bytes is the length of
194 the memory window to copy to/from. This length must match the
195 size of the local memory area, i.e. the sum of bytes in all mem‐
196 bers of the local iovec.
197
198 The flags field contains the bitwise OR of any of the following
199 flags:
200
201 RDS_RDMA_READWRITE
202 If set, any RDMA WRITE is initiated from the server's
203 memory to the client's. If not set, RDS will do a RDMA
204 READ from the client's memory to the server's memory.
205
206 RDS_RDMA_FENCE
207 By default, Infiniband makes no guarantee about the
208 ordering of an RDMA READ with respect to subsequent SEND
209 operations. Setting this flag asks that the RDMA READ
210 should be fenced off the subsequent RDS ACK message. Set‐
211 ting this flag requires an additional round-trip of the
212 IB fabric, but it is a good idea to use set this flag by
213 default, unless you are really sure you do not want it.
214
215 RDS_RDMA_NOTIFY_ME
216 This flag requests a notification upon completion of the
217 RDMA operation (successful or otherwise). The noticiation
218 will contain the value of the user_token field passed in
219 by the application. This allows the application to
220 release resources (such as buffers) assosicated with the
221 RDMA transfer.
222
223 The user_token can be used to pass an application specific iden‐
224 tifier to the kernel. This token is returned to the application
225 when a status notification is generated (see the following sec‐
226 tion).
227
228 RDMA Notification
229 The RDS kernel code is able to notify the server application
230 when an RDMA operation completes. These notifications are deliv‐
231 ered via RDS_CMSG_RDMA_STATUS control messages.
232
233 By default, no notifications are generated. There are two ways
234 an application can request them. On one hand, status notifica‐
235 tions can be enabled on a per-operation basis by setting the
236 RDS_RDMA_NOTIFY_ME flag in the RDMA arguments. On the other
237 hand, the application can request notifications for all RDMA
238 operations that fail by setting the RDS_RECVERR socket option
239 (see below). In both cases, the format of the notification is
240 the same; and at most one notification will be sent per com‐
241 pleted operation.
242
243 The message format is this:
244
245 struct rds_rdma_notify {
246 u_int32_t user_token;
247 int32_t status;
248 };
249
250 The user_token field contains the value previously given to the
251 kernel in the RDS_CMSG_RDMA_ARGS control message. The status
252 field contains a status value, with 0 indicating success, and
253 non-zero indicating an error.
254
255 The following status codes are currently defined:
256
257 RDS_RDMA_SUCCESS
258 The RDMA operation succeeded.
259
260 RDS_RDMA_REMOTE_ERROR
261 The RDMA operation failed due to a remote access error.
262 This is usually due to an invalid R_key, offset or trans‐
263 fer size.
264
265 RDS_RDMA_CANCELED
266 The RDMA operation was canceled by the application.
267 (This error code is not yet generated).
268
269 RDS_RDMA_DROPPED
270 RDMA operations were discarded after the connection broke
271 and was re-established. The RDMA operation may have been
272 processed partially.
273
274 RDS_RDMA_OTHER_ERROR
275 Any other failure.
276
277 RDMA setsockopt arguments
278 When using the RDS_GET_MR socket option to register a memory
279 range, the application passes a pointer to a struct
280 rds_get_mr_args variable, described above.
281
282 The RDS_FREE_MR call takes an argument of type struct
283 rds_free_mr_args:
284
285 struct rds_free_mr_args {
286 rds_rdma_cookie_t cookie;
287 u_int64_t flags;
288 };
289
290 cookie specifies the RDMA cookie to be released. RDMA access to
291 the memory range will usually not be invoked instantly, because
292 the operation is rather costly. However, if the flags argument
293 contains RDS_RDMA_INVALIDATE, RDS will invalidate the indicated
294 mapping immediately, as described in section Mapping arguments
295 above.
296
297 If the cookie argument is 0, and RDS_RDMA_INVALIDATE is set, RDS
298 will invalidate old memory mappings on all devices.
299
301 In addition to the usual error codes returned by sendmsg, recvmsg and
302 setsockopt, RDS returns the following error codes:
303
304 EAGAIN RDS was unable to map a memory range because the limit was
305 exceeded (returned by RDS_CMSG_RDMA_MAP and RDS_GET_MR).
306
307 EINVAL When sending a message, there were were conflicting control mes‐
308 sages (e.g. two RDMA_MAP messages, or a RDMA_MAP and a
309 RDMA_DEST message).
310
311 In a RDS_CMSG_RDMA_MAP or RDS_GET_MR operation, the application
312 specified memory range greater than the maximum size supported.
313
314 When setting up an RDMA operation with RDS_CMSG_RDMA_ARGS, the
315 size of the local memory (given in the rds_iovec) did not match
316 the size of the remote memory range.
317
318 EBUSY RDS was unable to obtain a DMA mapping for the indicated memory.
319
321 Currently, the following limits apply
322
323 · The maximum size of a zerocopy transfer is 1MB. This can be
324 adjusted via the fmr_message_size module parameter.
325
326 · The maximum number of memory ranges that can be mapped is lim‐
327 ited to 2048 at the moment. This can be adjusted via the
328 fmr_pool_size module parameter. However, the actual limit
329 imposed by the hardware may in fact be lower.
330
332 RDS was written and is Copyright (C) 2007-2008 by Oracle, Inc.
333
334
335
336 RDS zerocopy(7)