1RDS(7)                 Miscellaneous Information Manual                 RDS(7)
2
3
4

NAME

6       RDS - Reliable Datagram Sockets
7

SYNOPSIS

9       #include <sys/socket.h>
10       #include <netinet/in.h>
11

DESCRIPTION

13       This  is an implementation of the RDS socket API. It provides reliable,
14       in-order datagram delivery between sockets over  a  variety  of  trans‐
15       ports.
16
17       Currently,  RDS  can be transported over Infiniband, and loopback.  RDS
18       over TCP is disabled, but will be re-enabled in the near future.
19
20       RDS uses standard AF_INET addresses as described in ip(7)  to  identify
21       end points.
22
23   Socket Creation
24       RDS is still in development and as such does not have a reserved proto‐
25       col family constant. Applications must read the  string  representation
26       of  the  protocol  family  value  from the pf_rds sysctl parameter file
27       described below.
28
29       rds_socket = socket(pf_rds, SOCK_SEQPACKET, 0);
30
31   Socket Options
32       RDS sockets support a number of socket  options  through  the  setsock‐
33       opt(2)  and  getsockopt(2)  calls.  The following generic options (with
34       socket level SOL_SOCKET) are of specific importance:
35
36       SO_RCVBUF
37              Specifies the size of the receive buffer. See section  on  "Con‐
38              gestion Control" below.
39
40       SO_SNDBUF
41              Specifies  the  size  of the send buffer. See "Message Transmis‐
42              sion" below.
43
44       SO_SNDTIMEO
45              Specifies the send timeout when trying to enqueue a message on a
46              socket with a full queue in blocking mode.
47
48       In  addition  to  these,  RDS  supports  a  number of protocol specific
49       options (with socket level SOL_RDS).  Just as  with  the  RDS  protocol
50       family, an official value has not been assigned yet, so the kernel will
51       assign a value dynamically.  The assigned value can be  retrieved  from
52       the sol_rds sysctl parameter file.
53
54       RDS  specific  socket  options  will be described in a separate section
55       below.
56
57   Binding
58       A new RDS socket has no local address when it is  first  returned  from
59       socket(2).   It  must  be  bound  to a local address by calling bind(2)
60       before any messages can be sent or received. This will also attach  the
61       socket  to  a  specific  transport,  based on the type of interface the
62       local address is attached to.  From that point on, the socket can  only
63       reach destinations which are available through this transport.
64
65       For  instance,  when  binding to the address of an Infiniband interface
66       such as ib0, the socket will use the Infiniband transport.  If  RDS  is
67       not  able  to  associate  a  transport  with the given address, it will
68       return EADDRNOTAVAIL.
69
70       An RDS socket can only be bound to one address and only one socket  can
71       be  bound  to a given address/port pair. If no port is specified in the
72       binding address then an unbound port is selected at random.
73
74       RDS does not allow the application to bind a previously bound socket to
75       another address. Binding to the wildcard address INADDR_ANY is not per‐
76       mitted either.
77
78   Connecting
79       The default mode of operation for RDS is to use unconnected socket, and
80       specify  a destination address as an argument to sendmsg.  However, RDS
81       allows sockets to be connected to a remote end point using  connect(2).
82       If a socket is connected, calling sendmsg without specifying a destina‐
83       tion address will use the previously given remote address.
84
85   Congestion Control
86       RDS does not have explicit congestion  control  like  common  streaming
87       protocols  such  as TCP. However, sockets have two queue limits associ‐
88       ated with them; the send queue size and the receive queue  size.   Mes‐
89       sages are accounted based on the number of bytes of payload.
90
91       The send queue size limits how much data local processes can queue on a
92       local socket (see the following section). If that  limit  is  exceeded,
93       the  kernel will not accept further messages until the queue is drained
94       and messages have been delivered to  and  acknowledged  by  the  remote
95       host.
96
97       The receive queue size limits how much data RDS will put on the receive
98       queue of a socket before marking  the  socket  as  congested.   When  a
99       socket  becomes congested, RDS will send a congestion map update to the
100       other participating hosts, who are then expected to stop  sending  more
101       messages to this port.
102
103       There  is a timing window during which a remote host can still continue
104       to send messages to a congested port;  RDS  solves  this  by  accepting
105       these  messages  even if the socket's receive queue is already over the
106       limit.
107
108       As the application pulls incoming messages off the receive queue  using
109       recvmsg(2),  the  number  of bytes on the receive queue will eventually
110       drop below the receive queue size, at which  point  the  port  is  then
111       marked  uncongested,  and another congestion update is sent to all par‐
112       ticipating hosts. This tells them to allow applications to  send  addi‐
113       tional messages to this port.
114
115       The  default values for the send and receive buffer size are controlled
116       by the A given  RDS  socket  has  limited  transmit  buffer  space.  It
117       defaults  to  the  system  wide  socket  send  buffer  size  set in the
118       wmem_default and rmem_default sysctls, respectively. They can be  tuned
119       by the application through the SO_SNDBUF and SO_RCVBUF socket options.
120
121   Blocking Behavior
122       The  sendmsg(2)  and  recvmsg(2) calls can block in a variety of situa‐
123       tions.  Whether a call blocks or returns with an error depends  on  the
124       non-blocking  setting  of the file descriptor and the MSG_DONTWAIT mes‐
125       sage flag. If the file descriptor is set to blocking mode (which is the
126       default), and the MSG_DONTWAIT flag is not given, the call will block.
127
128       In addition, the SO_SNDTIMEO and SO_RCVTIMEO socket options can be used
129       to specify a timeout (in seconds) after which the call will abort wait‐
130       ing,  and return an error. The default timeout is 0, which tells RDS to
131       block indefinitely.
132
133   Message Transmission
134       Messages may be sent using sendmsg(2) once the  RDS  socket  is  bound.
135       Message  length  cannot exceed 4 gigabytes as the wire protocol uses an
136       unsigned 32 bit integer to express the message length.
137
138       RDS does not support out of band data. Applications are allowed to send
139       to unicast addresses only; broadcast or multicast are not supported.
140
141       A  successful sendmsg(2) call puts the message in the socket's transmit
142       queue where it will remain until either  the  destination  acknowledges
143       that the message is no longer in the network or the application removes
144       the message from the send queue.
145
146       Messages can be removed from the send queue with the RDS_CANCEL_SENT_TO
147       socket option described below.
148
149       While  a  message  is  in  the  transmit  queue  its  payload bytes are
150       accounted for.  If an attempt is made to send a message while there  is
151       not  sufficient  room on the transmit queue, the call will either block
152       or return EAGAIN.
153
154       Trying to send to a destination that is marked congested  (see  above),
155       the call will either block or return ENOBUFS.
156
157       A  message sent with no payload bytes will not consume any space in the
158       destination's send buffer but will result in a message receipt  on  the
159       destination.  The  receiver  will  not get any payload data but will be
160       able to see the sender's address.
161
162       Messages sent to a port to which no socket is bound  will  be  silently
163       discarded  by  the  destination host. No error messages are reported to
164       the sender.
165
166   Message Receipt
167       Messages may be received with recvmsg(2) on an RDS socket  once  it  is
168       bound to a source address. RDS will return messages in-order, i.e. mes‐
169       sages from the same sender will arrive in the same order in which  they
170       were be sent.
171
172       The address of the sender will be returned in the sockaddr_in structure
173       pointed to by the msg_name field, if set.
174
175       If the MSG_PEEK flag is given, the first  message  on  the  receive  is
176       returned without removing it from the queue.
177
178       The memory consumed by messages waiting for delivery does not limit the
179       number of messages that can be queued for receive. RDS does attempt  to
180       perform congestion control as described in the section above.
181
182       If the length of the message exceeds the size of the buffer provided to
183       recvmsg(2), then the remainder of the bytes in  the  message  are  dis‐
184       carded  and  the  MSG_TRUNC flag is set in the msg_flags field. In this
185       truncating case recvmsg(2)  will  still  return  the  number  of  bytes
186       copied,  not  the  length of entire messge.  If MSG_TRUNC is set in the
187       flags argument to recvmsg(2), then it will return the number  of  bytes
188       in  the  entire message. Thus one can examine the size of the next mes‐
189       sage in the receive queue without incurring a copying overhead by  pro‐
190       viding  a  zero length buffer and setting MSG_PEEK and MSG_TRUNC in the
191       flags argument.
192
193       The sending address of a zero-length message will still be provided  in
194       the msg_name field.
195
196   Control Messages
197       RDS  uses control messages (a.k.a. ancillary data) through the msg_con‐
198       trol and msg_controllen fields in sendmsg(2) and  recvmsg(2).   Control
199       messages  generated  by  RDS  have a cmsg_level value of sol_rds.  Most
200       control messages are related to the zerocopy  interface  added  in  RDS
201       version 3, and are described in rds-rdma(7).
202
203       The  only  exception  is  the  RDS_CMSG_CONG_UPDATE  message,  which is
204       described in the following section.
205
206   Polling
207       RDS supports the poll(2) interface in a  limited  fashion.   POLLIN  is
208       returned  when  there  is  a message (either a proper RDS message, or a
209       control message) waiting in the socket's  receive  queue.   POLLOUT  is
210       always returned while there is room on the socket's send queue.
211
212       Sending  to congested ports requires special handling. When an applica‐
213       tion tries to send to a congested destination,  the  system  call  will
214       return ENOBUFS.  However, it cannot poll for POLLOUT, as there is prob‐
215       ably still room on the transmit queue, so the  call  to  poll(2)  would
216       return immediately, even though the destination is still congested.
217
218       There are two ways of dealing with this situation. The first is to sim‐
219       ply poll for POLLIN.  By default, a  process  sleeping  in  poll(2)  is
220       always woken up when the congestion map is updated, and thus the appli‐
221       cation can retry any previously congested sends.
222
223       The second option is explicit congestion monitoring,  which  gives  the
224       application more fine-grained control.
225
226       With  explicit  monitoring, the application polls for POLLIN as before,
227       and additionally uses the RDS_CONG_MONITOR socket option to  install  a
228       64bit  mask  value in the socket, where each bit corresponds to a group
229       of ports. When a congestion update arrives, RDS checks the set of ports
230       that  became  uncongested against the bit mask installed in the socket.
231       If they overlap, a control messages is enqueued on the socket, and  the
232       application is woken up. When it calls recvmsg(2), it will be given the
233       control message containing the bitmap.  on the socket.
234
235       The congestion monitor bitmask can be set and  queried  using  setsock‐
236       opt(2) with RDS_CONG_MONITOR, and a pointer to the 64bit mask variable.
237
238       Congestion    updates    are   delivered   to   the   application   via
239       RDS_CMSG_CONG_UPDATE  control  messages.  These  control  messages  are
240       always  delivered  by  themselves  (or possibly additional control mes‐
241       sages), but never along with a RDS data message. The cmsg_data field of
242       the control message is an 8 byte datum containing the 64bit mask value.
243
244       Applications  can  use the following macros to test for and set bits in
245       the bitmask:
246
247       #define RDS_CONG_MONITOR_SIZE   64
248       #define RDS_CONG_MONITOR_BIT(port)  (((unsigned int) port) % RDS_CONG_MONITOR_SIZE)
249       #define RDS_CONG_MONITOR_MASK(port) (1 << RDS_CONG_MONITOR_BIT(port))
250
251   Canceling Messages
252       An application can cancel (flush) messages from the  send  queue  using
253       the  RDS_CANCEL_SENT_TO  socket  option  with setsockopt(2).  This call
254       takes an optional sockaddr_in address structure as argument. If  given,
255       only  messages  to  the  destination specified by this address are dis‐
256       carded. If no address is given, all pending messages are discarded.
257
258       Note that this affects messages that have not yet been  transmitted  as
259       well  as messages that have been transmitted, but for which no acknowl‐
260       edgment from the remote host has been received yet.
261
262   Reliability
263       If sendmsg(2) succeeds, RDS guarantees that the  message  will  be vis‐
264       ible   to  recvmsg(2)  on  a socket bound to the destination address as
265       long as that destination socket remains open.
266
267       If there is no socket bound on  the   destination,   the   message   is
268       silently  dropped.   If  the sending RDS can't be sure that there is no
269       socket bound then it will try to send the message indefinitely until it
270       can be sure or the sent message is canceled.
271
272       If  a socket is closed then all pending sent messages on the socket are
273       canceled and may or may not be seen by the receiver.
274
275       The RDS_CANCEL_SENT_TO socket option can be used to cancel all  pending
276       messages to a given destination.
277
278       If  a  receiving socket is closed with pending messages then the sender
279       considers those messages as  having  left  the  network and  will   not
280       retransmit them.
281
282       A   message  will  only be seen by recvmsg(2) once, unless MSG_PEEK was
283       specified. Once the message has been delivered it is removed  from  the
284       sending socket's transmit queue.
285
286       All  messages sent from the same socket to the same destination will be
287       delivered in the order they're sent. Messages sent from different sock‐
288       ets, or to different destinations, may be delivered in any order.
289

SYSCTL VALUES

291       These   parameteres  may  only  be  accessed  through  their  files  in
292       /proc/sys/net/rds.  Access through sysctl(2) is not supported.
293
294       pf_rds This file contains the string  representation  of  the  protocol
295              family constant passed to socket(2) to create a new RDS socket.
296
297       sol_rds
298              This file contains the string representation of the socket level
299              parameter that is passed to getsockopt(2) and  setsockopt(2)  to
300              manipulate RDS socket options.
301
302       max_unacked_bytes and max_unacked_packets
303              These parameters are used to tune the generation of acknowledge‐
304              ments. By default, the system receiving RDS  messages  does  not
305              send  back  explicit acknowledgements unless it transmits a mes‐
306              sage of its own (in which case the ACK is piggybacked  onto  the
307              outgoing message), or when the sending system requests an ACK.
308
309              However,  the  sender  needs  to see an ACK from time to time so
310              that it can purge old messages from the send queue. The  unacked
311              bytes  and  packet  counters  are used to keep track of how much
312              data has been sent without requesting an ACK. The default is  to
313              request  an  acknowledgement  every  16 packets, or every 16 MB,
314              whichever comes first.
315
316       reconnect_delay_min_ms and reconnect_delay_max_ms
317              RDS uses host-to-host  connections  to  transport  RDS  messages
318              (both for the TCP and the Infiniband transport). If this connec‐
319              tion breaks,  RDS  will  try  to  re-establish  the  connection.
320              Because  this  reconnect  may  be triggered by both hosts at the
321              same time and fail, RDS uses a random backoff before  attempting
322              a  reconnect. These two parameters specify the minimum and maxi‐
323              mum delay in milliseconds. The default values are  1  and  1000,
324              respectively.
325

SEE ALSO

327       rds-rdma(7), socket(2), bind(2), sendmsg(2), recvmsg(2), getsockopt(2),
328       setsockopt(2).
329
330
331
332                                                                        RDS(7)
Impressum