1fi_efa(7) Libfabric v1.18.1 fi_efa(7)
2
3
4
6 fi_efa - The Amazon Elastic Fabric Adapter (EFA) Provider
7
9 The EFA provider supports the Elastic Fabric Adapter (EFA) device on
10 Amazon EC2. EFA provides reliable and unreliable datagram send/receive
11 with direct hardware access from userspace (OS bypass).
12
14 The following features are supported:
15
16 Endpoint types
17 The provider supports endpoint type FI_EP_DGRAM, and FI_EP_RDM
18 on a new Scalable (unordered) Reliable Datagram protocol (SRD).
19 SRD provides support for reliable datagrams and more complete
20 error handling than typically seen with other Reliable Datagram
21 (RD) implementations. The EFA provider provides segmentation,
22 reassembly of out-of-order packets to provide send-after-send
23 ordering guarantees to applications via its FI_EP_RDM endpoint.
24
25 RDM Endpoint capabilities
26 The following data transfer interfaces are supported via the
27 FI_EP_RDM endpoint: FI_MSG, FI_TAGGED, and FI_RMA. FI_SEND,
28 FI_RECV, FI_DIRECTED_RECV, FI_MULTI_RECV, and FI_SOURCE capabil‐
29 ities are supported. The endpoint provides send-after-send
30 guarantees for data operations. The FI_EP_RDM endpoint does not
31 have a maximum message size.
32
33 DGRAM Endpoint capabilities
34 The DGRAM endpoint only supports FI_MSG capability with a maxi‐
35 mum message size of the MTU of the underlying hardware (approxi‐
36 mately 8 KiB).
37
38 Address vectors
39 The provider supports FI_AV_TABLE and FI_AV_MAP address vector
40 types. FI_EVENT is unsupported.
41
42 Completion events
43 The provider supports FI_CQ_FORMAT_CONTEXT, FI_CQ_FORMAT_MSG,
44 and FI_CQ_FORMAT_DATA. FI_CQ_FORMAT_TAGGED is supported on the
45 RDM endpoint. Wait objects are not currently supported.
46
47 Modes The provider requires the use of FI_MSG_PREFIX when running over
48 the DGRAM endpoint, and requires FI_MR_LOCAL for all memory reg‐
49 istrations on the DGRAM endpoint.
50
51 Memory registration modes
52 The RDM endpoint does not require memory registration for send
53 and receive operations, i.e. it does not require FI_MR_LOCAL.
54 Applications may specify FI_MR_LOCAL in the MR mode flags in or‐
55 der to use descriptors provided by the application. The
56 FI_EP_DGRAM endpoint only supports FI_MR_LOCAL.
57
58 Progress
59 RDM and DGRAM endpoints support FI_PROGRESS_MANUAL. EFA erro‐
60 neously claims the support for FI_PROGRESS_AUTO, despite not
61 properly supporting automatic progress. Unfortunately, some
62 Libfabric consumers also ask for FI_PROGRESS_AUTO when they only
63 require FI_PROGRESS_MANUAL, and fixing this bug would break
64 those applications. This will be fixed in a future version of
65 the EFA provider by adding proper support for FI_PROGRESS_AUTO.
66
67 Threading
68 The RDM endpoint supports FI_THREAD_SAFE, the DGRAM endpoint
69 supports FI_THREAD_DOMAIN, i.e. the provider is not thread safe
70 when using the DGRAM endpoint.
71
73 The DGRAM endpoint does not support FI_ATOMIC interfaces. For RMA op‐
74 erations, completion events for RMA targets (FI_RMA_EVENT) is not sup‐
75 ported. The DGRAM endpoint does not fully protect against resource
76 overruns, so resource management is disabled for this endpoint
77 (FI_RM_DISABLED).
78
79 No support for selective completions.
80
81 No support for counters for the DGRAM endpoint.
82
83 No support for inject.
84
85 When using FI_HMEM for AWS Neuron or Habana SynapseAI buffers, the
86 provider requires peer to peer transaction support between the EFA and
87 the FI_HMEM device. Therefore, the FI_HMEM_P2P_DISABLED option is not
88 supported by the EFA provider for AWS Neuron or Habana SynapseAI.
89
91 FI_OPT_EFA_RNR_RETRY
92 Defines the number of RNR retry. The application can use it to
93 reset RNR retry counter via the call to fi_setopt. Note that
94 this option must be set before the endpoint is enabled. Other‐
95 wise, the call will fail. Also note that this option only ap‐
96 plies to RDM endpoint.
97
98 FI_OPT_EFA_EMULATED_READ, FI_OPT_EFA_EMULATED_WRITE, FI_OPT_EFA_EMULAT‐
99 ED_ATOMICS - bool
100 These options only apply to the fi_getopt() call. They are used
101 to query the EFA provider to determine if the endpoint is emu‐
102 lating Read, Write, and Atomic operations (return value is
103 true), or if these operations are assisted by hardware support
104 (return value is false).
105
106 FI_OPT_EFA_USE_DEVICE_RDMA - bool
107 Only available if the application selects a libfabric API ver‐
108 sion >= 1.18. This option allows an application to change lib‐
109 fabric’s behavior with respect to RDMA transfers. Note that
110 there is also an environment variable FI_EFA_USE_DEVICE_RDMA
111 which the user may set as well. If the environment variable and
112 the argument provided with this variable are in conflict, then
113 fi_setopt will return -FI_EINVAL, and the environment variable
114 will be respected. If the hardware does not support RDMA and
115 the argument is true, then fi_setopt will return -FI_EOPNOTSUPP.
116 If the application uses API version < 1.18, the argument is ig‐
117 nored and fi_setopt returns -FI_ENOPROTOOPT. The default behav‐
118 ior for RDMA transfers depends on API version. For API >= 1.18
119 RDMA is enabled by default on any hardware which supports it.
120 For API<1.18, RDMA is enabled by default only on certain newer
121 hardware revisions.
122
123 FI_OPT_EFA_SENDRECV_IN_ORDER_ALIGNED_128_BYTES - bool
124 It is used to force the endpoint to use in-order send/recv oper‐
125 ation for each 128 bytes aligned block. Enabling the option
126 will guarantee data inside each 128 bytes aligned block being
127 sent and received in order, it will also guarantee data to be
128 delivered to the receive buffer only once. If endpoint is not
129 able to support this feature, it will return -FI_EOPNOTSUPP for
130 the call to fi_setopt().
131
132 FI_OPT_EFA_WRITE_IN_ORDER_ALIGNED_128_BYTES - bool
133 It is used to set the endpoint to use in-order RDMA write opera‐
134 tion for each 128 bytes aligned block. Enabling the option will
135 guarantee data inside each 128 bytes aligned block being written
136 in order, it will also guarantee data to be delivered to the
137 target buffer only once. If endpoint is not able to support
138 this feature, it will return -FI_EOPNOTSUPP for the call to
139 fi_setopt().
140
142 FI_EFA_TX_SIZE
143 Maximum number of transmit operations before the provider re‐
144 turns -FI_EAGAIN. For only the RDM endpoint, this parameter
145 will cause transmit operations to be queued when this value is
146 set higher than the default and the transmit queue is full.
147
148 FI_EFA_RX_SIZE
149 Maximum number of receive operations before the provider returns
150 -FI_EAGAIN.
151
152 FI_EFA_TX_IOV_LIMIT
153 Maximum number of IOVs for a transmit operation.
154
155 FI_EFA_RX_IOV_LIMIT
156 Maximum number of IOVs for a receive operation.
157
159 These OFI runtime parameters apply only to the RDM endpoint.
160
161 FI_EFA_RX_WINDOW_SIZE
162 Maximum number of MTU-sized messages that can be in flight from
163 any single endpoint as part of long message data transfer.
164
165 FI_EFA_TX_QUEUE_SIZE
166 Depth of transmit queue opened with the NIC. This may not be
167 set to a value greater than what the NIC supports.
168
169 FI_EFA_RECVWIN_SIZE
170 Size of out of order reorder buffer (in messages). Messages re‐
171 ceived out of this window will result in an error.
172
173 FI_EFA_CQ_SIZE
174 Size of any cq created, in number of entries.
175
176 FI_EFA_MR_CACHE_ENABLE
177 Enables using the mr cache and in-line registration instead of a
178 bounce buffer for iov’s larger than max_memcpy_size. Defaults
179 to true. When disabled, only uses a bounce buffer
180
181 FI_EFA_MR_MAX_CACHED_COUNT
182 Sets the maximum number of memory registrations that can be
183 cached at any time.
184
185 FI_EFA_MR_MAX_CACHED_SIZE
186 Sets the maximum amount of memory that cached memory registra‐
187 tions can hold onto at any time.
188
189 FI_EFA_MAX_MEMCPY_SIZE
190 Threshold size switch between using memory copy into a pre-reg‐
191 istered bounce buffer and memory registration on the user buf‐
192 fer.
193
194 FI_EFA_MTU_SIZE
195 Overrides the default MTU size of the device.
196
197 FI_EFA_RX_COPY_UNEXP
198 Enables the use of a separate pool of bounce-buffers to copy un‐
199 expected messages out of the pre-posted receive buffers.
200
201 FI_EFA_RX_COPY_OOO
202 Enables the use of a separate pool of bounce-buffers to copy
203 out-of-order RTS packets out of the pre-posted receive buffers.
204
205 FI_EFA_MAX_TIMEOUT
206 Maximum timeout (us) for backoff to a peer after a receiver not
207 ready error.
208
209 FI_EFA_TIMEOUT_INTERVAL
210 Time interval (us) for the base timeout to use for exponential
211 backoff to a peer after a receiver not ready error.
212
213 FI_EFA_ENABLE_SHM_TRANSFER
214 Enable SHM provider to provide the communication across all in‐
215 tra-node processes. SHM transfer will be disabled in the case
216 where ptrace protection is turned on. You can turn it off to
217 enable shm transfer.
218
219 FI_EFA_SHM_AV_SIZE
220 Defines the maximum number of entries in SHM provider’s address
221 vector.
222
223 FI_EFA_SHM_MAX_MEDIUM_SIZE
224 Defines the switch point between small/medium message and large
225 message. The message larger than this switch point will be
226 transferred with large message protocol. NOTE: This parameter
227 is now deprecated.
228
229 FI_EFA_INTER_MAX_MEDIUM_MESSAGE_SIZE
230 The maximum size for inter EFA messages to be sent by using
231 medium message protocol. Messages which can fit in one packet
232 will be sent as eager message. Messages whose sizes are smaller
233 than this value will be sent using medium message protocol.
234 Other messages will be sent using CTS based long message proto‐
235 col.
236
237 FI_EFA_FORK_SAFE
238 Enable fork() support. This may have a small performance impact
239 and should only be set when required. Applications that require
240 to register regions backed by huge pages and also require fork
241 support are not supported.
242
243 FI_EFA_RUNT_SIZE
244 The maximum number of bytes that will be eagerly sent by in‐
245 flight messages uses runting read message protocol (Default
246 307200).
247
248 FI_EFA_SET_CUDA_SYNC_MEMOPS
249 Set CU_POINTER_ATTRIBUTE_SYNC_MEMOPS for cuda ptr. (Default: 1)
250
251 FI_EFA_INTER_MIN_READ_MESSAGE_SIZE
252 The minimum message size in bytes for inter EFA read message
253 protocol. If instance support RDMA read, messages whose size is
254 larger than this value will be sent by read message protocol.
255 (Default 1048576).
256
257 FI_EFA_INTER_MIN_READ_WRITE_SIZE
258 The mimimum message size for inter EFA write to use read write
259 protocol. If firmware support RDMA read, and FI_EFA_USE_DE‐
260 VICE_RDMA is 1, write requests whose size is larger than this
261 value will use the read write protocol (Default 65536).
262
263 FI_EFA_USE_DEVICE_RDMA
264 Specify whether to require or ignore RDMA features of the EFA
265 device. - When set to 1/true/yes/on, all RDMA features of the
266 EFA device are used. But if EFA device does not support RDMA
267 and FI_EFA_USE_DEVICE_RDMA is set to 1/true/yes/on, user’s ap‐
268 plication is aborted and a warning message is printed. - When
269 set to 0/false/no/off, libfabric will emulate all fi_rma opera‐
270 tions instead of offloading them to the EFA network device.
271 Libfabric will not use device RDMA to implement send/receive op‐
272 erations. - If not set, RDMA operations will occur when avail‐
273 able based on RDMA device ID/version.
274
276 fabric(7), fi_provider(7), fi_getinfo(3)
277
279 OpenFabrics.
280
281
282
283Libfabric Programmer’s Manual 2023-03-31 fi_efa(7)