1fi_msg(3)                      Libfabric v1.17.0                     fi_msg(3)
2
3
4

NAME

6       fi_msg - Message data transfer operations
7
8       fi_recv / fi_recvv / fi_recvmsg
9              Post a buffer to receive an incoming message
10
11       fi_send  /  fi_sendv / fi_sendmsg fi_inject / fi_senddata : Initiate an
12       operation to send a message
13

SYNOPSIS

15              #include <rdma/fi_endpoint.h>
16
17              ssize_t fi_recv(struct fid_ep *ep, void * buf, size_t len,
18                  void *desc, fi_addr_t src_addr, void *context);
19
20              ssize_t fi_recvv(struct fid_ep *ep, const struct iovec *iov, void **desc,
21                  size_t count, fi_addr_t src_addr, void *context);
22
23              ssize_t fi_recvmsg(struct fid_ep *ep, const struct fi_msg *msg,
24                  uint64_t flags);
25
26              ssize_t fi_send(struct fid_ep *ep, const void *buf, size_t len,
27                  void *desc, fi_addr_t dest_addr, void *context);
28
29              ssize_t fi_sendv(struct fid_ep *ep, const struct iovec *iov,
30                  void **desc, size_t count, fi_addr_t dest_addr, void *context);
31
32              ssize_t fi_sendmsg(struct fid_ep *ep, const struct fi_msg *msg,
33                  uint64_t flags);
34
35              ssize_t fi_inject(struct fid_ep *ep, const void *buf, size_t len,
36                  fi_addr_t dest_addr);
37
38              ssize_t fi_senddata(struct fid_ep *ep, const void *buf, size_t len,
39                  void *desc, uint64_t data, fi_addr_t dest_addr, void *context);
40
41              ssize_t fi_injectdata(struct fid_ep *ep, const void *buf, size_t len,
42                  uint64_t data, fi_addr_t dest_addr);
43

ARGUMENTS

45       ep     Fabric endpoint on which to initiate send or post  receive  buf‐
46              fer.
47
48       buf    Data buffer to send or receive.
49
50       len    Length  of  data  buffer to send or receive, specified in bytes.
51              Valid  transfers  are  from  0  bytes  up  to   the   endpoint’s
52              max_msg_size.
53
54       iov    Vectored data buffer.
55
56       count  Count of vectored data entries.
57
58       desc   Descriptor associated with the data buffer.  See fi_mr(3).
59
60       data   Remote CQ data to transfer with the sent message.
61
62       dest_addr
63              Destination  address  for connectionless transfers.  Ignored for
64              connected endpoints.
65
66       src_addr
67              Source address to receive  from  for  connectionless  transfers.
68              Applies  only  to  connectionless  endpoints with the FI_DIRECT‐
69              ED_RECV capability enabled, otherwise this field is ignored.  If
70              set to FI_ADDR_UNSPEC, any source address may match.
71
72       msg    Message descriptor for send and receive operations.
73
74       flags  Additional flags to apply for the send or receive operation.
75
76       context
77              User  specified  pointer  to associate with the operation.  This
78              parameter is ignored if the operation will not generate  a  suc‐
79              cessful  completion, unless an op flag specifies the context pa‐
80              rameter be used for required input.
81

DESCRIPTION

83       The send functions –  fi_send,  fi_sendv,  fi_sendmsg,  fi_inject,  and
84       fi_senddata  –  are used to transmit a message from one endpoint to an‐
85       other endpoint.  The main difference between  send  functions  are  the
86       number  and  type  of parameters that they accept as input.  Otherwise,
87       they perform the same general function.  Messages sent using fi_msg op‐
88       erations  are received by a remote endpoint into a buffer posted to re‐
89       ceive such messages.
90
91       The receive functions – fi_recv, fi_recvv, fi_recvmsg  –  post  a  data
92       buffer to an endpoint to receive inbound messages.  Similar to the send
93       operations, receive operations operate  asynchronously.   Users  should
94       not  touch  the  posted  data buffer(s) until the receive operation has
95       completed.
96
97       An endpoint must be enabled before an application can post send or  re‐
98       ceive  operations  to it.  For connected endpoints, receive buffers may
99       be posted prior to connect or accept  being  called  on  the  endpoint.
100       This  ensures that buffers are available to receive incoming data imme‐
101       diately after the connection has been established.
102
103       Completed message operations are reported to the user  through  one  or
104       more event collectors associated with the endpoint.  Users provide con‐
105       text which are associated with each operation, and is returned  to  the
106       user  as  part of the event completion.  See fi_cq for completion event
107       details.
108
109   fi_send
110       The call fi_send transfers the data contained in the user-specified da‐
111       ta  buffer  to  a  remote endpoint, with message boundaries being main‐
112       tained.
113
114   fi_sendv
115       The fi_sendv call adds support for a scatter-gather  list  to  fi_send.
116       The  fi_sendv  transfers  the set of data buffers referenced by the iov
117       parameter to a remote endpoint as a single message.
118
119   fi_sendmsg
120       The fi_sendmsg call supports data transfers  over  both  connected  and
121       connectionless  endpoints,  with the ability to control the send opera‐
122       tion per call through the use of flags.  The fi_sendmsg function  takes
123       a struct fi_msg as input.
124
125              struct fi_msg {
126                  const struct iovec *msg_iov; /* scatter-gather array */
127                  void               **desc;   /* local request descriptors */
128                  size_t             iov_count;/* # elements in iov */
129                  fi_addr_t          addr;     /* optional endpoint address */
130                  void               *context; /* user-defined context */
131                  uint64_t           data;     /* optional message data */
132              };
133
134   fi_inject
135       The  send  inject call is an optimized version of fi_send with the fol‐
136       lowing characteristics.  The data buffer is available for reuse immedi‐
137       ately  on  return from the call, and no CQ entry will be written if the
138       transfer completes successfully.
139
140       Conceptually, this means that the fi_inject function behaves as if  the
141       FI_INJECT  transfer  flag  were set, selective completions are enabled,
142       and the FI_COMPLETION flag is not specified.  Note that  the  CQ  entry
143       will  be  suppressed even if the default behavior of the endpoint is to
144       write CQ entries for all successful completions.  See the flags discus‐
145       sion  below  for  more details.  The requested message size that can be
146       used with fi_inject is limited by inject_size.
147
148   fi_senddata
149       The send data call is similar to fi_send, but allows for the sending of
150       remote CQ data (see FI_REMOTE_CQ_DATA flag) as part of the transfer.
151
152   fi_injectdata
153       The  inject data call is similar to fi_inject, but allows for the send‐
154       ing of remote CQ data (see  FI_REMOTE_CQ_DATA  flag)  as  part  of  the
155       transfer.
156
157   fi_recv
158       The fi_recv call posts a data buffer to the receive queue of the corre‐
159       sponding endpoint.  Posted receives are searched in the order in  which
160       they were posted in order to match sends.  Message boundaries are main‐
161       tained.  The order in which the receives complete is dependent  on  the
162       endpoint type and protocol.  For connectionless endpoints, the src_addr
163       parameter can be used to indicate that a buffer should be posted to re‐
164       ceive incoming data from a specific remote endpoint.
165
166   fi_recvv
167       The  fi_recvv  call  adds support for a scatter-gather list to fi_recv.
168       The fi_recvv posts the set of data buffers referenced by the iov param‐
169       eter to a receive incoming data.
170
171   fi_recvmsg
172       The  fi_recvmsg  call  supports posting buffers over both connected and
173       connectionless endpoints, with the ability to control the receive oper‐
174       ation per call through the use of flags.  The fi_recvmsg function takes
175       a struct fi_msg as input.
176

FLAGS

178       The fi_recvmsg and fi_sendmsg calls allow the  user  to  specify  flags
179       which  can  change the default message handling of the endpoint.  Flags
180       specified with fi_recvmsg / fi_sendmsg override most  flags  previously
181       configured  with  the endpoint, except where noted (see fi_endpoint.3).
182       The  following  list  of  flags  are  usable  with  fi_recvmsg   and/or
183       fi_sendmsg.
184
185       FI_REMOTE_CQ_DATA
186              Applies to fi_sendmsg and fi_senddata.  Indicates that remote CQ
187              data is available and should be sent as  part  of  the  request.
188              See fi_getinfo for additional details on FI_REMOTE_CQ_DATA.
189
190       FI_CLAIM
191              Applies  to  posted  receive operations for endpoints configured
192              for FI_BUFFERED_RECV or FI_VARIABLE_MSG.  This flag is  used  to
193              retrieve  a  message that was buffered by the provider.  See the
194              Buffered Receives section for details.
195
196       FI_COMPLETION
197              Indicates that a completion entry should be  generated  for  the
198              specified operation.  The endpoint must be bound to a completion
199              queue with FI_SELECTIVE_COMPLETION that corresponds to the spec‐
200              ified operation, or this flag is ignored.
201
202       FI_DISCARD
203              Applies  to  posted  receive operations for endpoints configured
204              for FI_BUFFERED_RECV or FI_VARIABLE_MSG.  This flag is  used  to
205              free  a  message  that  was  buffered  by the provider.  See the
206              Buffered Receives section for details.
207
208       FI_MORE
209              Indicates that the user has additional requests that will  imme‐
210              diately  be  posted after the current call returns.  Use of this
211              flag may improve performance by enabling the provider  to  opti‐
212              mize its access to the fabric hardware.
213
214       FI_INJECT
215              Applies  to fi_sendmsg.  Indicates that the outbound data buffer
216              should be returned to user immediately after the send  call  re‐
217              turns,  even  if  the operation is handled asynchronously.  This
218              may require that the underlying provider implementation copy the
219              data  into a local buffer and transfer out of that buffer.  This
220              flag can only be used with messages smaller than inject_size.
221
222       FI_MULTI_RECV
223              Applies to posted receive operations.  This flag allows the user
224              to post a single buffer that will receive multiple incoming mes‐
225              sages.  Received messages will be packed into the receive buffer
226              until  the buffer has been consumed.  Use of this flag may cause
227              a single posted receive operation to generate multiple events as
228              messages  are placed into the buffer.  The placement of received
229              data into the buffer  may  be  subjected  to  provider  specific
230              alignment restrictions.
231
232       The  buffer  will be released by the provider when the available buffer
233       space falls below the specified  minimum  (see  FI_OPT_MIN_MULTI_RECV).
234       Note  that an entry to the associated receive completion queue will al‐
235       ways be generated when the buffer has been consumed, even if other  re‐
236       ceive  completions  have  been suppressed (i.e. the Rx context has been
237       configured for FI_SELECTIVE_COMPLETION).  See the FI_MULTI_RECV comple‐
238       tion flag fi_cq(3).
239
240       FI_INJECT_COMPLETE
241              Applies  to  fi_sendmsg.   Indicates that a completion should be
242              generated when the source buffer(s) may be reused.
243
244       FI_TRANSMIT_COMPLETE
245              Applies to fi_sendmsg and fi_recvmsg.  For sends, indicates that
246              a  completion  should  not  be generated until the operation has
247              been successfully transmitted and is no longer being tracked  by
248              the  provider.  For receive operations, indicates that a comple‐
249              tion may be generated as soon as the message has been  processed
250              by the local provider, even if the message data may not be visi‐
251              ble to all processing elements.  See fi_cq(3)  for  target  side
252              completion semantics.
253
254       FI_DELIVERY_COMPLETE
255              Applies  to  fi_sendmsg.   Indicates that a completion should be
256              generated when the operation has been processed by the  destina‐
257              tion.
258
259       FI_FENCE
260              Applies  to  transmits.  Indicates that the requested operation,
261              also known as the fenced operation, and any operation posted af‐
262              ter the fenced operation will be deferred until all previous op‐
263              erations targeting the same peer endpoint have completed.  Oper‐
264              ations  posted after the fencing will see and/or replace the re‐
265              sults of any operations initiated prior to the fenced operation.
266
267       The ordering of operations starting at the posting of the fenced opera‐
268       tion  (inclusive)  to the posting of a subsequent fenced operation (ex‐
269       clusive) is controlled by the endpoint’s ordering semantics.
270
271       FI_MULTICAST
272              Applies to transmits.  This  flag  indicates  that  the  address
273              specified  as  the  data transfer destination is a multicast ad‐
274              dress.  This flag must be used in all  multicast  transfers,  in
275              conjunction with a multicast fi_addr_t.
276

Buffered Receives

278       Buffered receives indicate that the networking layer allocates and man‐
279       ages the data buffers used to receive network data transfers.  As a re‐
280       sult,  received  messages  must be copied from the network buffers into
281       application buffers for processing.  However,  applications  can  avoid
282       this  copy  if  they are able to process the message in place (directly
283       from the networking buffers).
284
285       Handling buffered receives differs based on the size of the message be‐
286       ing  sent.  In general, smaller messages are passed directly to the ap‐
287       plication for processing.  However, for large messages, an  application
288       will  only  receive  the  start of the message and must claim the rest.
289       The details for how small messages are reported and large messages  may
290       be claimed are described below.
291
292       When  a provider receives a message, it will write an entry to the com‐
293       pletion queue associated with the receiving endpoint.   For  discussion
294       purposes,  the  completion  queue  is  assumed  to  be  configured  for
295       FI_CQ_FORMAT_DATA.  Since buffered receives are not associated with ap‐
296       plication  posted  buffers,  the  CQ  entry  op_context will point to a
297       struct fi_recv_context.
298
299              struct fi_recv_context {
300                  struct fid_ep *ep;
301                  void *context;
302              };
303
304       The `ep' field will point to the receiving endpoint or Rx context,  and
305       `context'  will be NULL.  The CQ entry’s `buf' will point to a provider
306       managed buffer where the start of the received message is located,  and
307       `len' will be set to the total size of the message.
308
309       The  maximum  sized message that a provider can buffer is limited by an
310       FI_OPT_BUFFERED_LIMIT.  This threshold can be obtained and may  be  ad‐
311       justed  by the application using the fi_getopt and fi_setopt calls, re‐
312       spectively.  Any adjustments must be made prior to  enabling  the  end‐
313       point.  The CQ entry `buf' will point to a buffer of received data.  If
314       the sent message is larger than  the  buffered  amount,  the  CQ  entry
315       `flags'  will  have  the FI_MORE bit set.  When the FI_MORE bit is set,
316       `buf' will reference at least FI_OPT_BUFFERED_MIN bytes  of  data  (see
317       fi_endpoint.3 for more info).
318
319       After  being notified that a buffered receive has arrived, applications
320       must either claim or discard the message.   Typically,  small  messages
321       are  processed and discarded, while large messages are claimed.  Howev‐
322       er, an application is free to claim or discard any  message  regardless
323       of message size.
324
325       To  claim  a message, an application must post a receive operation with
326       the FI_CLAIM flag set.  The struct fi_recv_context returned as part  of
327       the  notification  must be provided as the receive operation’s context.
328       The struct fi_recv_context contains a  `context'  field.   Applications
329       may  modify  this  field prior to claiming the message.  When the claim
330       operation completes, a standard receive completion entry will be gener‐
331       ated on the completion queue.  The `context' of the associated CQ entry
332       will be set to the `context' value passed in through  the  fi_recv_con‐
333       text structure, and the CQ entry flags will have the FI_CLAIM bit set.
334
335       Buffered  receives that are not claimed must be discarded by the appli‐
336       cation when it is done processing the CQ entry data.  To discard a mes‐
337       sage,  an application must post a receive operation with the FI_DISCARD
338       flag set.  The struct fi_recv_context returned as part of the notifica‐
339       tion  must  be  provided  as the receive operation’s context.  When the
340       FI_DISCARD flag is set for a receive operation, the receive input  buf‐
341       fer(s) and length parameters are ignored.
342
343       IMPORTANT:  Buffered  receives must be claimed or discarded in a timely
344       manner.  Failure to do so may result in increased memory usage for net‐
345       work  buffering  or  communication stalls.  Once a buffered receive has
346       been claimed or discarded,  the  original  CQ  entry  `buf'  or  struct
347       fi_recv_context data may no longer be accessed by the application.
348
349       The  use  of  the  FI_CLAIM  and FI_DISCARD operation flags is also de‐
350       scribed with  respect  to  tagged  message  transfers  in  fi_tagged.3.
351       Buffered  receives  of  tagged messages will include the message tag as
352       part of the CQ entry, if available.
353
354       The handling of buffered receives follows all message ordering restric‐
355       tions  assigned  to an endpoint.  For example, completions may indicate
356       the order in which received messages arrived at the receiver  based  on
357       the endpoint attributes.
358

Variable Length Messages

360       Variable  length  messages,  or simply variable messages, are transfers
361       where the size of the message is unknown to the receiver prior  to  the
362       message  being sent.  It indicates that the recipient of a message does
363       not know the amount of data to expect prior to  the  message  arriving.
364       It  is  most  commonly  used  when the size of message transfers varies
365       greatly, with very large messages interspersed with much  smaller  mes‐
366       sages,  making  receive  side  message  buffering  difficult to manage.
367       Variable messages are not subject to max  message  length  restrictions
368       (i.e. struct  fi_ep_attr::max_msg_size  limits),  and  may be up to the
369       maximum value of size_t (e.g. SIZE_MAX) in length.
370
371       Variable length messages support requests that  the  provider  allocate
372       and  manage  the network message buffers.  As a result, the application
373       requirements and provider behavior is identical as  those  defined  for
374       supporting  the  FI_BUFFERED_RECV  mode  bit.  See the Buffered Receive
375       section above for details.  The main difference is  that  buffered  re‐
376       ceives  are  limited by the fi_ep_attr::max_msg_size threshold, whereas
377       variable length messages are not.
378
379       Support for variable messages is indicated through the  FI_VARIABLE_MSG
380       capability bit.
381

NOTES

383       If  an endpoint has been configured with FI_MSG_PREFIX, the application
384       must include buffer space of size msg_prefix_size, as specified by  the
385       endpoint  attributes.  The prefix buffer must occur at the start of the
386       data referenced by the buf parameter, or be referenced by the first  IO
387       vector.   Message prefix space cannot be split between multiple IO vec‐
388       tors.  The size of the prefix buffer should be included as part of  the
389       total buffer length.
390

RETURN VALUE

392       Returns 0 on success.  On error, a negative value corresponding to fab‐
393       ric errno is returned.  Fabric errno values are defined in  rdma/fi_er‐
394       rno.h.
395
396       See the discussion below for details handling FI_EAGAIN.
397

ERRORS

399       -FI_EAGAIN
400              Indicates  that  the underlying provider currently lacks the re‐
401              sources needed to initiate the requested operation.  The reasons
402              for  a provider returning FI_EAGAIN are varied.  However, common
403              reasons include insufficient internal buffering or full process‐
404              ing queues.
405
406       Insufficient  internal  buffering  is  often associated with operations
407       that use FI_INJECT.  In such cases,  additional  buffering  may  become
408       available as posted operations complete.
409
410       Full  processing  queues may be a temporary state related to local pro‐
411       cessing (for example, a large message is being transferred), or may  be
412       the  result of flow control.  In the latter case, the queues may remain
413       blocked until additional resources are made  available  at  the  remote
414       side of the transfer.
415
416       In  all  cases, the operation may be retried after additional resources
417       become available.  It is strongly recommended that  applications  check
418       for transmit and receive completions after receiving FI_EAGAIN as a re‐
419       turn value, independent of the operation which failed.  This is partic‐
420       ularly  important  in  cases  where manual progress is employed, as ac‐
421       knowledgements or flow control messages may need to be processed in or‐
422       der to resume execution.
423

SEE ALSO

425       fi_getinfo(3), fi_endpoint(3), fi_domain(3), fi_cq(3)
426

AUTHORS

428       OpenFabrics.
429
430
431
432Libfabric Programmer’s Manual     2022-12-11                         fi_msg(3)
Impressum