rds-rdma(7)

1RDS zerocopy(7)        Miscellaneous Information Manual        RDS zerocopy(7)
2
3
4

NAME

6       RDS zerocopy - Interface for RDMA over RDS
7

DESCRIPTION

9       This  manual  page  describes  the zerocopy interface of RDS, which was
10       added in RDSv3. For a description of the basic  RDS  interface,  please
11       refer to rds(7).
12
13       The principal mode of operation for RDS zerocopy is like this: one par‐
14       ticipant (the client) wishes to initiate a direct transfer to  or  from
15       some area of memory in its process address space.  This memory does not
16       have to be aligned.
17
18       The client obtains a handle for this region of memory, and passes it to
19       the  other participant (the server). This is called the RDMA cookie. To
20       the application, the cookie is an opaque 64bit data type.
21
22       The client sends this handle to  the  server  application,  along  with
23       other  details  of  the RDMA request (such as which data to transfer to
24       that memory area).  Throughout the following discussion, we will  refer
25       to this message as the RDMA request.
26
27       The  server uses this RDMA cookie to initiate the requested RDMA trans‐
28       fer. The RDMA transfer is combined atomically with a  normal  RDS  mes‐
29       sage, which is delivered to the client. This message is called the RDMA
30       ACK throughout the following.  Atomic in this context means that either
31       both  the  RDMA succeeds and the RDMA ACK is delivered, or neither suc‐
32       ceeds.
33
34       Thus, when the client receives the RDMA ACK, it knows that the RDMA has
35       completed  successfully.  It  can then release the RDMA cookie for this
36       memory region, if it wishes to.
37
38       RDMA operations are not reliable, in the sense that unlike  normal  RDS
39       messages, RDS RDMA operations may fail, and get dropped.
40

INTERFACE

42       The  interface  is currently based on control messages (ancillary data)
43       sent or received  via  the  sendmsg(2)  and  recvmsg(2)  system  calls.
44       Optionally,  an  older  interface can be used that is based on the set‐
45       sockopt(2) system call. However, we recommend using  control  messages,
46       as this reduces the number of system calls required.
47
48   Control message interface
49       With  the  control  message interface, the RDMA cookie is passed to the
50       server out-of-band, included in an extension header attached to the RDS
51       message.
52
53       The  following outlines the mode of operation; the data types used will
54       be specified in details in a subsequent section.
55
56       Initially,  the  client  will  send  RDMA   requests   along   with   a
57       RDS_CMSG_RDMA_MAP  control  message.  The  control message contains the
58       address and length of the memory region for which to obtain  a  handle,
59       some flags, and a pointer to a memory location (in the caller's address
60       space) where the kernel will store the RDMA cookie.
61
62       Alternatively, if the application has already obtained  a  RDMA  cookie
63       for  the memory range it wants to RDMA to/from, it can hand this cookie
64       to the kernel using the RDS_CMSG_RDMA_DEST control message.
65
66       Either way, the kernel will include the resulting  RDMA  cookie  in  an
67       extension header that is transmitted as part of the RDMA request to the
68       server.
69
70       When the server receives the RDMA request, the kernel will deliver  the
71       cookie wrapped inside a RDS_CMSG_RDMA_DEST control message.
72
73       The  server  then  initiates  the data transfer by sending the RDMA ACK
74       message along with a RDS_CMSG_RDMA_ARGS control message.  This  message
75       contains the RDMA cookie, and the local memory to copy to or from.
76
77       The  server  process  may request a notification when an RDMA operation
78       completes. Notifications are delivered as a  RDS_CMSG_RDMA_STATUS  con‐
79       trol  messages.  When  an  application calls recvmsg(2), it will either
80       receive a regular RDS message (possibly with other RDMA related control
81       messages),  or  an  empty  message with one or more status control mes‐
82       sages.
83
84       In addition, applications When an RDMA operation fails for some  reason
85       and  is discarded, the application can ask to receive notifications for
86       failed messages as well, regardless of whether  it  asked  for  success
87       notification  of  an individual message or not. This behavior is turned
88       on by setting the RDS_RECVERR socket option.
89
90   Setsockopt interface
91       In addition to the control message interface, RDS allows a  process  to
92       register  and  release memory ranges for RDMA through calls to setsock‐
93       opt(2).
94
95       RDS_GET_MR
96              To obtain a RDMA cookie for a given memory range,  the  applica‐
97              tion  can  use setsockopt with RDS_GET_MR.  This operates essen‐
98              tially the same way as the  RDS_CMSG_RDMA_MAP  control  message:
99              the argument contains the address and length of the memory range
100              to be registered, and a pointer to a RDMA  cookie  variable,  in
101              which  the  system call will store the cookie for the registered
102              range.
103
104       RDS_FREE_MR
105              Memory  ranges  can  be  released  by  calling  setsockopt  with
106              RDS_FREE_MR,  giving  the  RDMA  cookie  and additional flags as
107              arguments.
108
109       RDS_RECVERR
110              This is a boolean option which can be set  as  well  as  queried
111              (using  getsockopt).  When enabled, RDS will send RDMA notifica‐
112              tion messages to the application for  any  RDMA  operation  that
113              fails. This option defaults to off.
114
115       For all of these calls, the level argument to setsockopt is SOL_RDS.
116

RDMA MACROS AND TYPES

118       RDMA cookie
119              typedef u_int64_t       rds_rdma_cookie_t
120
121              This  encapsulates  a  memory location in the client process. In
122              the current implementation, it contains the R_Key of the  remote
123              memory  region,  and the offset into it (so that the application
124              does not have to worry about alignment.
125
126              The RDMA cookie is used in several struct types described below.
127              The    RDS_CMSG_RDMA_DEST    control    message    contains    a
128              rds_rdma_cookie_t all by itself as payload.
129
130       Mapping arguments
131              The following data type is used with  RDS_CMSG_RDMA_MAP  control
132              messages and with the RDS_GET_MR socket option:
133
134              struct rds_iovec {
135                      u_int64_t       addr;
136                      u_int64_t       bytes;
137              };
138
139              struct rds_get_mr_args {
140                      struct rds_iovec vec;
141                      u_int64_t       cookie_addr;
142                      uint64_t        flags;
143              };
144
145              The  cookie_addr  specifies a memory location where to store the
146              RDMA cookie.
147
148              The flags value is a bitwise OR of any of the following flags:
149
150              RDS_RDMA_USE_ONCE
151                     This tells the kernel that the allocated RDMA  cookie  is
152                     to  be  used  exactly  once.  When  the  RDMA ACK message
153                     arrives, the kernel will automatically unbind the  memory
154                     area  and  release  any  resources  associated  with  the
155                     cookie.
156
157                     If this flag is not set, it is the application's  respon‐
158                     sibility  to  release  the  memory region at a later time
159                     using the RDS_FREE_MR socket option.
160
161              RDS_RDMA_INVALIDATE
162                     Normally, RDMA memory mappings are invalidated lazily, as
163                     this requires some relatively costly synchronization with
164                     the HCA. However, this means that the server  application
165                     can  continue  to  access  the registered memory for some
166                     indeterminate amount of time.  If this flag is  set,  the
167                     RDS  code  will  invalidate the mapping at the time it is
168                     released  (either  upon  arrival  of  the  RDMA  ACK,  if
169                     USE_ONCE  was specified; or when the application destroys
170                     it using FREE_MR).
171
172       RDMA Operation
173              RDMA  operations  are  initiated  by  the   server   using   the
174              RDS_CMSG_RDMA_ARGS  control  message,  which takes the following
175              data as payload:
176
177              struct rds_rdma_args {
178                      rds_rdma_cookie_t cookie;
179                      struct rds_iovec remote_vec;
180                      u_int64_t       local_vec_addr;
181                      u_int64_t       nr_local;
182                      u_int64_t       flags;
183                      u_int32_t       user_token;
184              };
185
186              The cookie argument contains the RDMA cookie received  from  the
187              client.   The  local memory is given via an array of rds_iovecs.
188              The array address is given in local_vec_addr, and its number  of
189              elements is given in nr_local.
190
191              The  struct  member  remote_vec specifies a location relative to
192              the memory area identified by the cookie: remote_vec.addr is  an
193              offset  into  that region, and remote_vec.bytes is the length of
194              the memory window to copy to/from.  This length must  match  the
195              size of the local memory area, i.e. the sum of bytes in all mem‐
196              bers of the local iovec.
197
198              The flags field contains the bitwise OR of any of the  following
199              flags:
200
201              RDS_RDMA_READWRITE
202                     If  set,  any  RDMA  WRITE is initiated from the server's
203                     memory to the client's. If not set, RDS will  do  a  RDMA
204                     READ from the client's memory to the server's memory.
205
206              RDS_RDMA_FENCE
207                     By  default,  Infiniband  makes  no  guarantee  about the
208                     ordering of an RDMA READ with respect to subsequent  SEND
209                     operations.  Setting  this  flag  asks that the RDMA READ
210                     should be fenced off the subsequent RDS ACK message. Set‐
211                     ting  this  flag requires an additional round-trip of the
212                     IB fabric, but it is a good idea to use set this flag  by
213                     default, unless you are really sure you do not want it.
214
215              RDS_RDMA_NOTIFY_ME
216                     This  flag requests a notification upon completion of the
217                     RDMA operation (successful or otherwise). The noticiation
218                     will  contain the value of the user_token field passed in
219                     by  the  application.  This  allows  the  application  to
220                     release  resources (such as buffers) assosicated with the
221                     RDMA transfer.
222
223              The user_token can be used to pass an application specific iden‐
224              tifier  to the kernel. This token is returned to the application
225              when a status notification is generated (see the following  sec‐
226              tion).
227
228       RDMA Notification
229              The  RDS  kernel  code  is able to notify the server application
230              when an RDMA operation completes. These notifications are deliv‐
231              ered via RDS_CMSG_RDMA_STATUS control messages.
232
233              By  default,  no notifications are generated. There are two ways
234              an application can request them. On one hand,  status  notifica‐
235              tions  can  be  enabled  on a per-operation basis by setting the
236              RDS_RDMA_NOTIFY_ME flag in the  RDMA  arguments.  On  the  other
237              hand,  the  application  can  request notifications for all RDMA
238              operations that fail by setting the  RDS_RECVERR  socket  option
239              (see  below).   In both cases, the format of the notification is
240              the same; and at most one notification will  be  sent  per  com‐
241              pleted operation.
242
243              The message format is this:
244
245              struct rds_rdma_notify {
246                      u_int32_t       user_token;
247                      int32_t         status;
248              };
249
250              The  user_token field contains the value previously given to the
251              kernel in the RDS_CMSG_RDMA_ARGS  control  message.  The  status
252              field  contains  a  status value, with 0 indicating success, and
253              non-zero indicating an error.
254
255              The following status codes are currently defined:
256
257              RDS_RDMA_SUCCESS
258                     The RDMA operation succeeded.
259
260              RDS_RDMA_REMOTE_ERROR
261                     The RDMA operation failed due to a remote  access  error.
262                     This is usually due to an invalid R_key, offset or trans‐
263                     fer size.
264
265              RDS_RDMA_CANCELED
266                     The RDMA  operation  was  canceled  by  the  application.
267                     (This error code is not yet generated).
268
269              RDS_RDMA_DROPPED
270                     RDMA operations were discarded after the connection broke
271                     and was re-established. The RDMA operation may have  been
272                     processed partially.
273
274              RDS_RDMA_OTHER_ERROR
275                     Any other failure.
276
277       RDMA setsockopt arguments
278              When  using  the  RDS_GET_MR  socket option to register a memory
279              range,  the  application  passes   a   pointer   to   a   struct
280              rds_get_mr_args variable, described above.
281
282              The   RDS_FREE_MR   call   takes  an  argument  of  type  struct
283              rds_free_mr_args:
284
285              struct rds_free_mr_args {
286                      rds_rdma_cookie_t cookie;
287                      u_int64_t       flags;
288              };
289
290              cookie specifies the RDMA cookie to be released. RDMA access  to
291              the  memory range will usually not be invoked instantly, because
292              the operation is rather costly. However, if the  flags  argument
293              contains  RDS_RDMA_INVALIDATE, RDS will invalidate the indicated
294              mapping immediately, as described in section  Mapping  arguments
295              above.
296
297              If the cookie argument is 0, and RDS_RDMA_INVALIDATE is set, RDS
298              will invalidate old memory mappings on all devices.
299

ERRORS

301       In addition to the usual error codes returned by sendmsg,  recvmsg  and
302       setsockopt, RDS returns the following error codes:
303
304       EAGAIN RDS  was  unable  to  map  a  memory range because the limit was
305              exceeded (returned by RDS_CMSG_RDMA_MAP and RDS_GET_MR).
306
307       EINVAL When sending a message, there were were conflicting control mes‐
308              sages  (e.g.  two  RDMA_MAP  messages,  or  a  RDMA_MAP   and  a
309              RDMA_DEST message).
310
311              In a RDS_CMSG_RDMA_MAP or RDS_GET_MR operation, the  application
312              specified memory range greater than the maximum size supported.
313
314              When  setting  up an RDMA operation with RDS_CMSG_RDMA_ARGS, the
315              size of the local memory (given in the rds_iovec) did not  match
316              the size of the remote memory range.
317
318       EBUSY  RDS was unable to obtain a DMA mapping for the indicated memory.
319

LIMITS

321       Currently, the following limits apply
322
323       ·      The  maximum  size  of  a  zerocopy transfer is 1MB. This can be
324              adjusted via the fmr_message_size module parameter.
325
326       ·      The maximum number of memory ranges that can be mapped  is  lim‐
327              ited  to  2048  at  the  moment.  This  can  be adjusted via the
328              fmr_pool_size  module  parameter.  However,  the  actual   limit
329              imposed by the hardware may in fact be lower.
330

AUTHORS

332       RDS was written and is Copyright (C) 2007-2008 by Oracle, Inc.
333
334
335
336                                                               RDS zerocopy(7)