1fi_psm3(7) Libfabric v1.12.1 fi_psm3(7)
2
3
4
6 fi_psm3 - The PSM3 Fabric Provider
7
9 The psm3 provider implements a Performance Scaled Messaging capability
10 which supports Intel RoCEv2 capable NICs. PSM3 represents an Ethernet
11 and standard RoCEv2 enhancement of previous PSM implementations.
12
14 The psm3 provider supports a subset of all the features defined in the
15 libfabric API.
16
17 Endpoint types
18 Supports non-connection based types FI_DGRAM and FI_RDM.
19
20 Endpoint capabilities
21 Endpoints can support any combination of data transfer capabili‐
22 ties FI_TAGGED, FI_MSG, FI_ATOMICS, and FI_RMA. These capabili‐
23 ties can be further refined by FI_SEND, FI_RECV, FI_READ,
24 FI_WRITE, FI_REMOTE_READ, and FI_REMOTE_WRITE to limit the di‐
25 rection of operations.
26
27 FI_MULTI_RECV is supported for non-tagged message queue only.
28
29 Scalable endpoints are supported if the underlying PSM3 library sup‐
30 ports multiple endpoints. This condition must be satisfied both when
31 the provider is built and when the provider is used. See the Scalable
32 endpoints section for more information.
33
34 Other supported capabilities include FI_TRIGGER, FI_REMOTE_CQ_DATA,
35 FI_RMA_EVENT, FI_SOURCE, and FI_SOURCE_ERR. Furthermore,
36 FI_NAMED_RX_CTX is supported when scalable endpoints are enabled.
37
38 Modes FI_CONTEXT is required for the FI_TAGGED and FI_MSG capabili‐
39 ties. That means, any request belonging to these two categories
40 that generates a completion must pass as the operation context a
41 valid pointer to type struct fi_context, and the space refer‐
42 enced by the pointer must remain untouched until the request has
43 completed. If none of FI_TAGGED and FI_MSG is asked for, the
44 FI_CONTEXT mode is not required.
45
46 Progress
47 The psm3 provider performs optimal with manual progress. By de‐
48 fault, the application is expected to call fi_cq_read or fi_cn‐
49 tr_read function from time to time when no other libfabric func‐
50 tion is called to ensure progress is made in a timely manner.
51 The provider does support auto progress mode. However, the per‐
52 formance can be significantly impacted if the application purely
53 depends on the provider to make auto progress.
54
55 Scalable endpoints
56 Scalable endpoints support depends on the multi-EP feature of
57 the PSM3 library. If the PSM3 library supports this feature,
58 the availability is further controlled by an environment vari‐
59 able PSM3_MULTI_EP. The psm3 provider automatically sets this
60 variable to 1 if it is not set. The feature can be disabled ex‐
61 plicitly by setting PSM3_MULTI_EP to 0.
62
63 When creating a scalable endpoint, the exact number of contexts re‐
64 quested should be set in the "fi_info" structure passed to the fi_scal‐
65 able_ep function. This number should be set in "fi_info->ep_at‐
66 tr->tx_ctx_cnt" or "fi_info->ep_attr->rx_ctx_cnt" or both, whichever
67 greater is used. The psm3 provider allocates all requested contexts
68 upfront when the scalable endpoint is created. The same context is
69 used for both Tx and Rx.
70
71 For optimal performance, it is advised to avoid having multiple threads
72 accessing the same context, either directly by posting
73 send/recv/read/write request, or indirectly by polling associated com‐
74 pletion queues or counters.
75
76 Using the scalable endpoint as a whole in communication functions is
77 not supported. Instead, individual tx context or rx context of the
78 scalable endpoint should be used. Similarly, using the address of the
79 scalable endpoint as the source address or destination address doesn't
80 collectively address all the tx/rx contexts. It addresses only the
81 first tx/rx context, instead.
82
84 The psm3 provider doesn't support all the features defined in the lib‐
85 fabric API. Here are some of the limitations not listed above:
86
87 Unsupported features
88 These features are unsupported: connection management, passive
89 endpoint, and shared receive context.
90
92 The psm3 provider checks for the following environment variables:
93
94 FI_PSM3_UUID
95 PSM requires that each job has a unique ID (UUID). All the pro‐
96 cesses in the same job need to use the same UUID in order to be
97 able to talk to each other. The PSM reference manual advises to
98 keep UUID unique to each job. In practice, it generally works
99 fine to reuse UUID as long as (1) no two jobs with the same UUID
100 are running at the same time; and (2) previous jobs with the
101 same UUID have exited normally. If running into "resource busy"
102 or "connection failure" issues with unknown reason, it is advis‐
103 able to manually set the UUID to a value different from the de‐
104 fault.
105
106 The default UUID is 00FF00FF-0000-0000-0000-00FF0F0F00FF.
107
108 It is possible to create endpoints with UUID different from the one set
109 here. To achieve that, set 'info->ep_attr->auth_key' to the uuid value
110 and 'info->ep_attr->auth_key_size' to its size (16 bytes) when calling
111 fi_endpoint() or fi_scalable_ep(). It is still true that an endpoint
112 can only communicate with endpoints with the same UUID.
113
114 FI_PSM3_NAME_SERVER
115 The psm3 provider has a simple built-in name server that can be
116 used to resolve an IP address or host name into a transport ad‐
117 dress needed by the fi_av_insert call. The main purpose of this
118 name server is to allow simple client-server type applications
119 (such as those in fabtests) to be written purely with libfabric,
120 without using any out-of-band communication mechanism. For such
121 applications, the server would run first to allow endpoints be
122 created and registered with the name server, and then the client
123 would call fi_getinfo with the node parameter set to the IP ad‐
124 dress or host name of the server. The resulting fi_info struc‐
125 ture would have the transport address of the endpoint created by
126 the server in the dest_addr field. Optionally the service pa‐
127 rameter can be used in addition to node. Notice that the ser‐
128 vice number is interpreted by the provider and is not a TCP/IP
129 port number.
130
131 The name server is on by default. It can be turned off by setting the
132 variable to 0. This may save a small amount of resource since a sepa‐
133 rate thread is created when the name server is on.
134
135 The provider detects OpenMPI and MPICH runs and changes the default
136 setting to off.
137
138 FI_PSM3_TAGGED_RMA
139 The RMA functions are implemented on top of the PSM Active Mes‐
140 sage functions. The Active Message functions have limit on the
141 size of data can be transferred in a single message. Large
142 transfers can be divided into small chunks and be pipe-lined.
143 However, the bandwidth is sub-optimal by doing this way.
144
145 The psm3 provider use PSM tag-matching message queue functions to
146 achieve higher bandwidth for large size RMA. It takes advantage of the
147 extra tag bits available in PSM3 to separate the RMA traffic from the
148 regular tagged message queue.
149
150 The option is on by default. To turn it off set the variable to 0.
151
152 FI_PSM3_DELAY
153 Time (seconds) to sleep before closing PSM endpoints. This is a
154 workaround for a bug in some versions of PSM library.
155
156 The default setting is 0.
157
158 FI_PSM3_TIMEOUT
159 Timeout (seconds) for gracefully closing PSM endpoints. A
160 forced closing will be issued if timeout expires.
161
162 The default setting is 5.
163
164 FI_PSM3_CONN_TIMEOUT
165 Timeout (seconds) for establishing connection between two PSM
166 endpoints.
167
168 The default setting is 5.
169
170 FI_PSM3_PROG_INTERVAL
171 When auto progress is enabled (asked via the hints to fi_get‐
172 info), a progress thread is created to make progress calls from
173 time to time. This option set the interval (microseconds) be‐
174 tween progress calls.
175
176 The default setting is 1 if affinity is set, or 1000 if not. See
177 FI_PSM3_PROG_AFFINITY.
178
179 FI_PSM3_PROG_AFFINITY
180 When set, specify the set of CPU cores to set the progress
181 thread affinity to. The format is
182 <start>[:<end>[:<stride>]][,<start>[:<end>[:<stride>]]]*, where
183 each triplet <start>:<end>:<stride> defines a block of core_ids.
184 Both <start> and <end> can be either the core_id (when >=0) or
185 core_id - num_cores (when <0).
186
187 By default affinity is not set.
188
189 FI_PSM3_INJECT_SIZE
190 Maximum message size allowed for fi_inject and fi_tinject calls.
191 This is an experimental feature to allow some applications to
192 override default inject size limitation. When the inject size
193 is larger than the default value, some inject calls might block.
194
195 The default setting is 64.
196
197 FI_PSM3_LOCK_LEVEL
198 When set, dictate the level of locking being used by the
199 provider. Level 2 means all locks are enabled. Level 1 dis‐
200 ables some locks and is suitable for runs that limit the access
201 to each PSM3 context to a single thread. Level 0 disables all
202 locks and thus is only suitable for single threaded runs.
203
204 To use level 0 or level 1, wait object and auto progress mode cannot be
205 used because they introduce internal threads that may break the condi‐
206 tions needed for these levels.
207
208 The default setting is 2.
209
210 FI_PSM3_LAZY_CONN
211 There are two strategies on when to establish connections be‐
212 tween the PSM3 endpoints that OFI endpoints are built on top of.
213 In eager connection mode, connections are established when ad‐
214 dresses are inserted into the address vector. In lazy connec‐
215 tion mode, connections are established when addresses are used
216 the first time in communication. Eager connection mode has
217 slightly lower critical path overhead but lazy connection mode
218 scales better.
219
220 This option controls how the two connection modes are used. When set
221 to 1, lazy connection mode is always used. When set to 0, eager con‐
222 nection mode is used when required conditions are all met and lazy con‐
223 nection mode is used otherwise. The conditions for eager connection
224 mode are: (1) multiple endpoint (and scalable endpoint) support is dis‐
225 abled by explicitly setting PSM3_MULTI_EP=0; and (2) the address vector
226 type is FI_AV_MAP.
227
228 The default setting is 0.
229
230 FI_PSM3_DISCONNECT
231 The provider has a mechanism to automatically send disconnection
232 notifications to all connected peers before the local endpoint
233 is closed. As the response, the peers call psm3_ep_disconnect
234 to clean up the connection state at their side. This allows the
235 same PSM3 epid be used by different dynamically started process‐
236 es (clients) to communicate with the same peer (server). This
237 mechanism, however, introduce extra overhead to the finalization
238 phase. For applications that never reuse epids within the same
239 session such overhead is unnecessary.
240
241 This option controls whether the automatic disconnection notification
242 mechanism should be enabled. For client-server application mentioned
243 above, the client side should set this option to 1, but the server
244 should set it to 0.
245
246 The default setting is 0.
247
248 FI_PSM3_TAG_LAYOUT
249 Select how the 96-bit PSM3 tag bits are organized. Currently
250 three choices are available: tag60 means 32-4-60 partitioning
251 for CQ data, internal protocol flags, and application tag.
252 tag64 means 4-28-64 partitioning for internal protocol flags, CQ
253 data, and application tag. auto means to choose either tag60 or
254 tag64 based on the hints passed to fi_getinfo -- tag60 is used
255 if remote CQ data support is requested explicitly, either by
256 passing non-zero value via hints->domain_attr->cq_data_size or
257 by including FI_REMOTE_CQ_DATA in hints->caps, otherwise tag64
258 is used. If tag64 is the result of automatic selection, fi_get‐
259 info also returns a second instance of the provider with tag60
260 layout.
261
262 The default setting is auto.
263
264 Notice that if the provider is compiled with macro PSMX3_TAG_LAYOUT de‐
265 fined to 1 (means tag60) or 2 (means tag64), the choice is fixed at
266 compile time and this runtime option will be disabled.
267
269 fabric(7), fi_provider(7), fi_psm(7), fi_psm2(7),
270
272 OpenFabrics.
273
274
275
276Libfabric Programmer's Manual 2021-02-10 fi_psm3(7)