1fi_rxm(7) Libfabric v1.12.1 fi_rxm(7)
2
3
4
6 fi_rxm - The RxM (RDM over MSG) Utility Provider
7
9 The RxM provider (ofi_rxm) is an utility provider that supports
10 FI_EP_RDM type endpoint emulated over FI_EP_MSG type endpoint(s) of an
11 underlying core provider. FI_EP_RDM endpoints have a reliable datagram
12 interface and RxM emulates this by hiding the connection management of
13 underlying FI_EP_MSG endpoints from the user. Additionally, RxM can
14 hide memory registration requirement from a core provider like verbs if
15 the apps don't support it.
16
18 Requirements for core provider
19 RxM provider requires the core provider to support the following fea‐
20 tures:
21
22 • MSG endpoints (FI_EP_MSG)
23
24 • RMA read/write (FI_RMA) - Used for implementing rendezvous protocol
25 for large messages.
26
27 • FI_OPT_CM_DATA_SIZE of at least 24 bytes.
28
29 Requirements for applications
30 Since RxM emulates RDM endpoints by hiding connection management and
31 connections are established only on-demand (when app tries to send da‐
32 ta), the first several data transfer calls would return EAGAIN. Appli‐
33 cations should be aware of this and retry until the operation succeeds.
34
35 If an application has chosen manual progress for data progress, it
36 should also read the CQ so that the connection establishment progress‐
37 es. Not doing so would result in a stall. See also the ERRORS section
38 in fi_msg(3).
39
41 The RxM provider currently supports FI_MSG, FI_TAGGED, FI_RMA and
42 FI_ATOMIC capabilities.
43
44 Endpoint types
45 The provider supports only FI_EP_RDM.
46
47 Endpoint capabilities
48 The following data transfer interface is supported: FI_MSG,
49 FI_TAGGED, FI_RMA, FI_ATOMIC.
50
51 Progress
52 The RxM provider supports both FI_PROGRESS_MANUAL and
53 FI_PROGRESS_AUTO. Manual progress in general has better connec‐
54 tion scale-up and lower CPU utilization since there's no sepa‐
55 rate auto-progress thread.
56
57 Addressing Formats
58 FI_SOCKADDR, FI_SOCKADDR_IN
59
60 Memory Region
61 FI_MR_VIRT_ADDR, FI_MR_ALLOCATED, FI_MR_PROV_KEY MR mode bits
62 would be required from the app in case the core provider re‐
63 quires it.
64
66 When using RxM provider, some limitations from the underlying MSG
67 provider could also show up. Please refer to the corresponding MSG
68 provider man pages to find about those limitations.
69
70 Unsupported features
71 RxM provider does not support the following features:
72
73 • op_flags: FI_FENCE.
74
75 • Scalable endpoints
76
77 • Shared contexts
78
79 • FABRIC_DIRECT
80
81 • FI_MR_SCALABLE
82
83 • Authorization keys
84
85 • Application error data buffers
86
87 • Multicast
88
89 • FI_SYNC_ERR
90
91 • Reporting unknown source addr data as part of completions
92
93 • Triggered operations
94
95 Progress limitations
96 When sending large messages, an app doing an sread or waiting on the CQ
97 file descriptor may not get a completion when reading the CQ after be‐
98 ing woken up from the wait. The app has to do sread or wait on the
99 file descriptor again. This is needed because RxM uses a rendezvous
100 protocol for large message sends. An app would get woken up from wait‐
101 ing on CQ fd when rendezvous protocol request completes but it would
102 have to wait again to get an ACK from the receiver indicating comple‐
103 tion of large message transfer by remote RMA read.
104
105 FI_ATOMIC limitations
106 The FI_ATOMIC capability will only be listed in the fi_info if the
107 fi_info hints parameter specifies FI_ATOMIC. If FI_ATOMIC is request‐
108 ed, message order FI_ORDER_RAR, FI_ORDER_RAW, FI_ORDER_WAR, FI_OR‐
109 DER_WAW, FI_ORDER_SAR, and FI_ORDER_SAW can not be supported.
110
111 Miscellaneous limitations
112 • RxM protocol peers should have same endian-ness otherwise connections
113 won't successfully complete. This enables better performance at
114 run-time as byte order translations are avoided.
115
117 The ofi_rxm provider checks for the following environment variables.
118
119 FI_OFI_RXM_BUFFER_SIZE
120 Defines the transmit buffer size / inject size. Messages of
121 size less than this would be transmitted via an eager protocol
122 and those above would be transmitted via a rendezvous or SAR
123 (Segmentation And Reassembly) protocol. Transmit data would be
124 copied up to this size (default: ~16k).
125
126 FI_OFI_RXM_COMP_PER_PROGRESS
127 Defines the maximum number of MSG provider CQ entries (default:
128 1) that would be read per progress (RxM CQ read).
129
130 FI_OFI_RXM_ENABLE_DYN_RBUF
131 Enables support for dynamic receive buffering, if available by
132 the message endpoint provider. This feature allows direct
133 placement of received message data into application buffers, by‐
134 passing RxM bounce buffers. This feature targets providers that
135 provide internal network buffering, such as the tcp provider.
136 (default: false)
137
138 FI_OFI_RXM_SAR_LIMIT
139 Set this environment variable to control the RxM SAR (Segmenta‐
140 tion And Reassembly) protocol. Messages of size greater than
141 this (default: 128 Kb) would be transmitted via rendezvous pro‐
142 tocol.
143
144 FI_OFI_RXM_USE_SRX
145 Set this to 1 to use shared receive context from MSG provider,
146 or 0 to disable using shared receive context. Shared receive
147 contexts reduce overall memory usage, but may increase in mes‐
148 sage latency. If not set, verbs will not use shared receive
149 contexts by default, but the tcp provider will.
150
151 FI_OFI_RXM_TX_SIZE
152 Defines default TX context size (default: 1024)
153
154 FI_OFI_RXM_RX_SIZE
155 Defines default RX context size (default: 1024)
156
157 FI_OFI_RXM_MSG_TX_SIZE
158 Defines FI_EP_MSG TX size that would be requested (default:
159 128).
160
161 FI_OFI_RXM_MSG_RX_SIZE
162 Defines FI_EP_MSG RX size that would be requested (default:
163 128).
164
165 FI_UNIVERSE_SIZE
166 Defines the expected number of ranks / peers an endpoint would
167 communicate with (default: 256).
168
169 FI_OFI_RXM_CM_PROGRESS_INTERVAL
170 Defines the duration of time in microseconds between calls to
171 RxM CM progression functions when using manual progress. Higher
172 values may provide less noise for calls to fi_cq read functions,
173 but may increase connection setup time (default: 10000)
174
175 FI_OFI_RXM_CQ_EQ_FAIRNESS
176 Defines the maximum number of message provider CQ entries that
177 can be consecutively read across progress calls without checking
178 to see if the CM progress interval has been reached (default:
179 128)
180
182 Bandwidth
183 To optimize for bandwidth, ensure you use higher values than default
184 for FI_OFI_RXM_TX_SIZE, FI_OFI_RXM_RX_SIZE, FI_OFI_RXM_MSG_TX_SIZE,
185 FI_OFI_RXM_MSG_RX_SIZE subject to memory limits of the system and the
186 tx and rx sizes supported by the MSG provider.
187
188 FI_OFI_RXM_SAR_LIMIT is another knob that can be experimented with to
189 optimze for bandwidth.
190
191 Memory
192 To conserve memory, ensure FI_UNIVERSE_SIZE set to what is required.
193 Similarly check that FI_OFI_RXM_TX_SIZE, FI_OFI_RXM_RX_SIZE,
194 FI_OFI_RXM_MSG_TX_SIZE and FI_OFI_RXM_MSG_RX_SIZE env variables are set
195 to only required values.
196
198 The data transfer API may return -FI_EAGAIN during on-demand connection
199 setup of the core provider FI_MSG_EP. See fi_msg(3) for a detailed de‐
200 scription of handling FI_EAGAIN.
201
203 If an RxM endpoint is expected to communicate with more peers than the
204 default value of FI_UNIVERSE_SIZE (256) CQ overruns can happen. To
205 avoid this set a higher value for FI_UNIVERSE_SIZE. CQ overrun can
206 make a MSG endpoint unusable.
207
208 At higher # of ranks, there may be connection errors due to a node run‐
209 ning out of memory. The workaround is to use shared receive contexts
210 for the MSG provider (FI_OFI_RXM_USE_SRX=1) or reduce eager message
211 size (FI_OFI_RXM_BUFFER_SIZE) and MSG provider TX/RX queue sizes
212 (FI_OFI_RXM_MSG_TX_SIZE / FI_OFI_RXM_MSG_RX_SIZE).
213
215 fabric(7), fi_provider(7), fi_getinfo(3)
216
218 OpenFabrics.
219
220
221
222Libfabric Programmer's Manual 2021-01-25 fi_rxm(7)