1RDS(7) Miscellaneous Information Manual RDS(7)
2
3
4
6 RDS - Reliable Datagram Sockets
7
9 #include <sys/socket.h>
10 #include <netinet/in.h>
11
13 This is an implementation of the RDS socket API. It provides reliable,
14 in-order datagram delivery between sockets over a variety of trans‐
15 ports.
16
17 Currently, RDS can be transported over Infiniband, and loopback. RDS
18 over TCP is disabled, but will be re-enabled in the near future.
19
20 RDS uses standard AF_INET addresses as described in ip(7) to identify
21 end points.
22
23 Socket Creation
24 RDS is still in development and as such does not have a reserved proto‐
25 col family constant. Applications must read the string representation
26 of the protocol family value from the pf_rds sysctl parameter file
27 described below.
28
29 rds_socket = socket(pf_rds, SOCK_SEQPACKET, 0);
30
31 Socket Options
32 RDS sockets support a number of socket options through the setsock‐
33 opt(2) and getsockopt(2) calls. The following generic options (with
34 socket level SOL_SOCKET) are of specific importance:
35
36 SO_RCVBUF
37 Specifies the size of the receive buffer. See section on "Con‐
38 gestion Control" below.
39
40 SO_SNDBUF
41 Specifies the size of the send buffer. See "Message Transmis‐
42 sion" below.
43
44 SO_SNDTIMEO
45 Specifies the send timeout when trying to enqueue a message on a
46 socket with a full queue in blocking mode.
47
48 In addition to these, RDS supports a number of protocol specific
49 options (with socket level SOL_RDS). Just as with the RDS protocol
50 family, an official value has not been assigned yet, so the kernel will
51 assign a value dynamically. The assigned value can be retrieved from
52 the sol_rds sysctl parameter file.
53
54 RDS specific socket options will be described in a separate section
55 below.
56
57 Binding
58 A new RDS socket has no local address when it is first returned from
59 socket(2). It must be bound to a local address by calling bind(2)
60 before any messages can be sent or received. This will also attach the
61 socket to a specific transport, based on the type of interface the
62 local address is attached to. From that point on, the socket can only
63 reach destinations which are available through this transport.
64
65 For instance, when binding to the address of an Infiniband interface
66 such as ib0, the socket will use the Infiniband transport. If RDS is
67 not able to associate a transport with the given address, it will
68 return EADDRNOTAVAIL.
69
70 An RDS socket can only be bound to one address and only one socket can
71 be bound to a given address/port pair. If no port is specified in the
72 binding address then an unbound port is selected at random.
73
74 RDS does not allow the application to bind a previously bound socket to
75 another address. Binding to the wildcard address INADDR_ANY is not per‐
76 mitted either.
77
78 Connecting
79 The default mode of operation for RDS is to use unconnected socket, and
80 specify a destination address as an argument to sendmsg. However, RDS
81 allows sockets to be connected to a remote end point using connect(2).
82 If a socket is connected, calling sendmsg without specifying a destina‐
83 tion address will use the previously given remote address.
84
85 Congestion Control
86 RDS does not have explicit congestion control like common streaming
87 protocols such as TCP. However, sockets have two queue limits associ‐
88 ated with them; the send queue size and the receive queue size. Mes‐
89 sages are accounted based on the number of bytes of payload.
90
91 The send queue size limits how much data local processes can queue on a
92 local socket (see the following section). If that limit is exceeded,
93 the kernel will not accept further messages until the queue is drained
94 and messages have been delivered to and acknowledged by the remote
95 host.
96
97 The receive queue size limits how much data RDS will put on the receive
98 queue of a socket before marking the socket as congested. When a
99 socket becomes congested, RDS will send a congestion map update to the
100 other participating hosts, who are then expected to stop sending more
101 messages to this port.
102
103 There is a timing window during which a remote host can still continue
104 to send messages to a congested port; RDS solves this by accepting
105 these messages even if the socket's receive queue is already over the
106 limit.
107
108 As the application pulls incoming messages off the receive queue using
109 recvmsg(2), the number of bytes on the receive queue will eventually
110 drop below the receive queue size, at which point the port is then
111 marked uncongested, and another congestion update is sent to all par‐
112 ticipating hosts. This tells them to allow applications to send addi‐
113 tional messages to this port.
114
115 The default values for the send and receive buffer size are controlled
116 by the A given RDS socket has limited transmit buffer space. It
117 defaults to the system wide socket send buffer size set in the
118 wmem_default and rmem_default sysctls, respectively. They can be tuned
119 by the application through the SO_SNDBUF and SO_RCVBUF socket options.
120
121 Blocking Behavior
122 The sendmsg(2) and recvmsg(2) calls can block in a variety of situa‐
123 tions. Whether a call blocks or returns with an error depends on the
124 non-blocking setting of the file descriptor and the MSG_DONTWAIT mes‐
125 sage flag. If the file descriptor is set to blocking mode (which is the
126 default), and the MSG_DONTWAIT flag is not given, the call will block.
127
128 In addition, the SO_SNDTIMEO and SO_RCVTIMEO socket options can be used
129 to specify a timeout (in seconds) after which the call will abort wait‐
130 ing, and return an error. The default timeout is 0, which tells RDS to
131 block indefinitely.
132
133 Message Transmission
134 Messages may be sent using sendmsg(2) once the RDS socket is bound.
135 Message length cannot exceed 4 gigabytes as the wire protocol uses an
136 unsigned 32 bit integer to express the message length.
137
138 RDS does not support out of band data. Applications are allowed to send
139 to unicast addresses only; broadcast or multicast are not supported.
140
141 A successful sendmsg(2) call puts the message in the socket's transmit
142 queue where it will remain until either the destination acknowledges
143 that the message is no longer in the network or the application removes
144 the message from the send queue.
145
146 Messages can be removed from the send queue with the RDS_CANCEL_SENT_TO
147 socket option described below.
148
149 While a message is in the transmit queue its payload bytes are
150 accounted for. If an attempt is made to send a message while there is
151 not sufficient room on the transmit queue, the call will either block
152 or return EAGAIN.
153
154 Trying to send to a destination that is marked congested (see above),
155 the call will either block or return ENOBUFS.
156
157 A message sent with no payload bytes will not consume any space in the
158 destination's send buffer but will result in a message receipt on the
159 destination. The receiver will not get any payload data but will be
160 able to see the sender's address.
161
162 Messages sent to a port to which no socket is bound will be silently
163 discarded by the destination host. No error messages are reported to
164 the sender.
165
166 Message Receipt
167 Messages may be received with recvmsg(2) on an RDS socket once it is
168 bound to a source address. RDS will return messages in-order, i.e. mes‐
169 sages from the same sender will arrive in the same order in which they
170 were be sent.
171
172 The address of the sender will be returned in the sockaddr_in structure
173 pointed to by the msg_name field, if set.
174
175 If the MSG_PEEK flag is given, the first message on the receive is
176 returned without removing it from the queue.
177
178 The memory consumed by messages waiting for delivery does not limit the
179 number of messages that can be queued for receive. RDS does attempt to
180 perform congestion control as described in the section above.
181
182 If the length of the message exceeds the size of the buffer provided to
183 recvmsg(2), then the remainder of the bytes in the message are dis‐
184 carded and the MSG_TRUNC flag is set in the msg_flags field. In this
185 truncating case recvmsg(2) will still return the number of bytes
186 copied, not the length of entire messge. If MSG_TRUNC is set in the
187 flags argument to recvmsg(2), then it will return the number of bytes
188 in the entire message. Thus one can examine the size of the next mes‐
189 sage in the receive queue without incurring a copying overhead by pro‐
190 viding a zero length buffer and setting MSG_PEEK and MSG_TRUNC in the
191 flags argument.
192
193 The sending address of a zero-length message will still be provided in
194 the msg_name field.
195
196 Control Messages
197 RDS uses control messages (a.k.a. ancillary data) through the msg_con‐
198 trol and msg_controllen fields in sendmsg(2) and recvmsg(2). Control
199 messages generated by RDS have a cmsg_level value of sol_rds. Most
200 control messages are related to the zerocopy interface added in RDS
201 version 3, and are described in rds-rdma(7).
202
203 The only exception is the RDS_CMSG_CONG_UPDATE message, which is
204 described in the following section.
205
206 Polling
207 RDS supports the poll(2) interface in a limited fashion. POLLIN is
208 returned when there is a message (either a proper RDS message, or a
209 control message) waiting in the socket's receive queue. POLLOUT is
210 always returned while there is room on the socket's send queue.
211
212 Sending to congested ports requires special handling. When an applica‐
213 tion tries to send to a congested destination, the system call will
214 return ENOBUFS. However, it cannot poll for POLLOUT, as there is prob‐
215 ably still room on the transmit queue, so the call to poll(2) would
216 return immediately, even though the destination is still congested.
217
218 There are two ways of dealing with this situation. The first is to sim‐
219 ply poll for POLLIN. By default, a process sleeping in poll(2) is
220 always woken up when the congestion map is updated, and thus the appli‐
221 cation can retry any previously congested sends.
222
223 The second option is explicit congestion monitoring, which gives the
224 application more fine-grained control.
225
226 With explicit monitoring, the application polls for POLLIN as before,
227 and additionally uses the RDS_CONG_MONITOR socket option to install a
228 64bit mask value in the socket, where each bit corresponds to a group
229 of ports. When a congestion update arrives, RDS checks the set of ports
230 that became uncongested against the bit mask installed in the socket.
231 If they overlap, a control messages is enqueued on the socket, and the
232 application is woken up. When it calls recvmsg(2), it will be given the
233 control message containing the bitmap. on the socket.
234
235 The congestion monitor bitmask can be set and queried using setsock‐
236 opt(2) with RDS_CONG_MONITOR, and a pointer to the 64bit mask variable.
237
238 Congestion updates are delivered to the application via
239 RDS_CMSG_CONG_UPDATE control messages. These control messages are
240 always delivered by themselves (or possibly additional control mes‐
241 sages), but never along with a RDS data message. The cmsg_data field of
242 the control message is an 8 byte datum containing the 64bit mask value.
243
244 Applications can use the following macros to test for and set bits in
245 the bitmask:
246
247 #define RDS_CONG_MONITOR_SIZE 64
248 #define RDS_CONG_MONITOR_BIT(port) (((unsigned int) port) % RDS_CONG_MONITOR_SIZE)
249 #define RDS_CONG_MONITOR_MASK(port) (1 << RDS_CONG_MONITOR_BIT(port))
250
251 Canceling Messages
252 An application can cancel (flush) messages from the send queue using
253 the RDS_CANCEL_SENT_TO socket option with setsockopt(2). This call
254 takes an optional sockaddr_in address structure as argument. If given,
255 only messages to the destination specified by this address are dis‐
256 carded. If no address is given, all pending messages are discarded.
257
258 Note that this affects messages that have not yet been transmitted as
259 well as messages that have been transmitted, but for which no acknowl‐
260 edgment from the remote host has been received yet.
261
262 Reliability
263 If sendmsg(2) succeeds, RDS guarantees that the message will be vis‐
264 ible to recvmsg(2) on a socket bound to the destination address as
265 long as that destination socket remains open.
266
267 If there is no socket bound on the destination, the message is
268 silently dropped. If the sending RDS can't be sure that there is no
269 socket bound then it will try to send the message indefinitely until it
270 can be sure or the sent message is canceled.
271
272 If a socket is closed then all pending sent messages on the socket are
273 canceled and may or may not be seen by the receiver.
274
275 The RDS_CANCEL_SENT_TO socket option can be used to cancel all pending
276 messages to a given destination.
277
278 If a receiving socket is closed with pending messages then the sender
279 considers those messages as having left the network and will not
280 retransmit them.
281
282 A message will only be seen by recvmsg(2) once, unless MSG_PEEK was
283 specified. Once the message has been delivered it is removed from the
284 sending socket's transmit queue.
285
286 All messages sent from the same socket to the same destination will be
287 delivered in the order they're sent. Messages sent from different sock‐
288 ets, or to different destinations, may be delivered in any order.
289
291 These parameteres may only be accessed through their files in
292 /proc/sys/net/rds. Access through sysctl(2) is not supported.
293
294 pf_rds This file contains the string representation of the protocol
295 family constant passed to socket(2) to create a new RDS socket.
296
297 sol_rds
298 This file contains the string representation of the socket level
299 parameter that is passed to getsockopt(2) and setsockopt(2) to
300 manipulate RDS socket options.
301
302 max_unacked_bytes and max_unacked_packets
303 These parameters are used to tune the generation of acknowledge‐
304 ments. By default, the system receiving RDS messages does not
305 send back explicit acknowledgements unless it transmits a mes‐
306 sage of its own (in which case the ACK is piggybacked onto the
307 outgoing message), or when the sending system requests an ACK.
308
309 However, the sender needs to see an ACK from time to time so
310 that it can purge old messages from the send queue. The unacked
311 bytes and packet counters are used to keep track of how much
312 data has been sent without requesting an ACK. The default is to
313 request an acknowledgement every 16 packets, or every 16 MB,
314 whichever comes first.
315
316 reconnect_delay_min_ms and reconnect_delay_max_ms
317 RDS uses host-to-host connections to transport RDS messages
318 (both for the TCP and the Infiniband transport). If this connec‐
319 tion breaks, RDS will try to re-establish the connection.
320 Because this reconnect may be triggered by both hosts at the
321 same time and fail, RDS uses a random backoff before attempting
322 a reconnect. These two parameters specify the minimum and maxi‐
323 mum delay in milliseconds. The default values are 1 and 1000,
324 respectively.
325
327 rds-rdma(7), socket(2), bind(2), sendmsg(2), recvmsg(2), getsockopt(2),
328 setsockopt(2).
329
330
331
332 RDS(7)