1fi_psm2(7) Libfabric v1.17.0 fi_psm2(7)
2
3
4
6 fi_psm2 - The PSM2 Fabric Provider
7
9 The psm2 provider runs over the PSM 2.x interface that is supported by
10 the Intel Omni-Path Fabric. PSM 2.x has all the PSM 1.x features plus
11 a set of new functions with enhanced capabilities. Since PSM 1.x and
12 PSM 2.x are not ABI compatible the psm2 provider only works with PSM
13 2.x and doesn’t support Intel TrueScale Fabric.
14
16 The psm2 provider doesn’t support all the features defined in the lib‐
17 fabric API. Here are some of the limitations:
18
19 Endpoint types
20 Only support non-connection based types FI_DGRAM and FI_RDM
21
22 Endpoint capabilities
23 Endpoints can support any combination of data transfer capabili‐
24 ties FI_TAGGED, FI_MSG, FI_ATOMICS, and FI_RMA. These capabili‐
25 ties can be further refined by FI_SEND, FI_RECV, FI_READ,
26 FI_WRITE, FI_REMOTE_READ, and FI_REMOTE_WRITE to limit the di‐
27 rection of operations.
28
29 FI_MULTI_RECV is supported for non-tagged message queue only.
30
31 Scalable endpoints are supported if the underlying PSM2 library sup‐
32 ports multiple endpoints. This condition must be satisfied both when
33 the provider is built and when the provider is used. See the Scalable
34 endpoints section for more information.
35
36 Other supported capabilities include FI_TRIGGER, FI_REMOTE_CQ_DATA,
37 FI_RMA_EVENT, FI_SOURCE, and FI_SOURCE_ERR. Furthermore,
38 FI_NAMED_RX_CTX is supported when scalable endpoints are enabled.
39
40 Modes FI_CONTEXT is required for the FI_TAGGED and FI_MSG capabili‐
41 ties. That means, any request belonging to these two categories
42 that generates a completion must pass as the operation context a
43 valid pointer to type struct fi_context, and the space refer‐
44 enced by the pointer must remain untouched until the request has
45 completed. If none of FI_TAGGED and FI_MSG is asked for, the
46 FI_CONTEXT mode is not required.
47
48 Progress
49 The psm2 provider requires manual progress. The application is
50 expected to call fi_cq_read or fi_cntr_read function from time
51 to time when no other libfabric function is called to ensure
52 progress is made in a timely manner. The provider does support
53 auto progress mode. However, the performance can be signifi‐
54 cantly impacted if the application purely depends on the
55 provider to make auto progress.
56
57 Scalable endpoints
58 Scalable endpoints support depends on the multi-EP feature of
59 the PSM2 library. If the PSM2 library supports this feature,
60 the availability is further controlled by an environment vari‐
61 able PSM2_MULTI_EP. The psm2 provider automatically sets this
62 variable to 1 if it is not set. The feature can be disabled ex‐
63 plicitly by setting PSM2_MULTI_EP to 0.
64
65 When creating a scalable endpoint, the exact number of contexts re‐
66 quested should be set in the “fi_info” structure passed to the fi_scal‐
67 able_ep function. This number should be set in “fi_info->ep_at‐
68 tr->tx_ctx_cnt” or “fi_info->ep_attr->rx_ctx_cnt” or both, whichever
69 greater is used. The psm2 provider allocates all requested contexts
70 upfront when the scalable endpoint is created. The same context is
71 used for both Tx and Rx.
72
73 For optimal performance, it is advised to avoid having multiple threads
74 accessing the same context, either directly by posting
75 send/recv/read/write request, or indirectly by polling associated com‐
76 pletion queues or counters.
77
78 Using the scalable endpoint as a whole in communication functions is
79 not supported. Instead, individual tx context or rx context of the
80 scalable endpoint should be used. Similarly, using the address of the
81 scalable endpoint as the source address or destination address doesn’t
82 collectively address all the tx/rx contexts. It addresses only the
83 first tx/rx context, instead.
84
85 Shared Tx contexts
86 In order to achieve the purpose of saving PSM context by using
87 shared Tx context, the endpoints bound to the shared Tx contexts
88 need to be Tx only. The reason is that Rx capability always re‐
89 quires a PSM context, which can also be automatically used for
90 Tx. As the result, allocating a shared Tx context for Rx capa‐
91 ble endpoints actually consumes one extra context instead of
92 saving some.
93
94 Unsupported features
95 These features are unsupported: connection management, passive
96 endpoint, and shared receive context.
97
99 The psm2 provider checks for the following environment variables:
100
101 FI_PSM2_UUID
102 PSM requires that each job has a unique ID (UUID). All the pro‐
103 cesses in the same job need to use the same UUID in order to be
104 able to talk to each other. The PSM reference manual advises to
105 keep UUID unique to each job. In practice, it generally works
106 fine to reuse UUID as long as (1) no two jobs with the same UUID
107 are running at the same time; and (2) previous jobs with the
108 same UUID have exited normally. If running into “resource busy”
109 or “connection failure” issues with unknown reason, it is advis‐
110 able to manually set the UUID to a value different from the de‐
111 fault.
112
113 The default UUID is 00FF00FF-0000-0000-0000-00FF0F0F00FF.
114
115 It is possible to create endpoints with UUID different from the one set
116 here. To achieve that, set `info->ep_attr->auth_key' to the uuid value
117 and `info->ep_attr->auth_key_size' to its size (16 bytes) when calling
118 fi_endpoint() or fi_scalable_ep(). It is still true that an endpoint
119 can only communicate with endpoints with the same UUID.
120
121 FI_PSM2_NAME_SERVER
122 The psm2 provider has a simple built-in name server that can be
123 used to resolve an IP address or host name into a transport ad‐
124 dress needed by the fi_av_insert call. The main purpose of this
125 name server is to allow simple client-server type applications
126 (such as those in fabtests) to be written purely with libfabric,
127 without using any out-of-band communication mechanism. For such
128 applications, the server would run first to allow endpoints be
129 created and registered with the name server, and then the client
130 would call fi_getinfo with the node parameter set to the IP ad‐
131 dress or host name of the server. The resulting fi_info struc‐
132 ture would have the transport address of the endpoint created by
133 the server in the dest_addr field. Optionally the service pa‐
134 rameter can be used in addition to node. Notice that the ser‐
135 vice number is interpreted by the provider and is not a TCP/IP
136 port number.
137
138 The name server is on by default. It can be turned off by setting the
139 variable to 0. This may save a small amount of resource since a sepa‐
140 rate thread is created when the name server is on.
141
142 The provider detects OpenMPI and MPICH runs and changes the default
143 setting to off.
144
145 FI_PSM2_TAGGED_RMA
146 The RMA functions are implemented on top of the PSM Active Mes‐
147 sage functions. The Active Message functions have limit on the
148 size of data can be transferred in a single message. Large
149 transfers can be divided into small chunks and be pipe-lined.
150 However, the bandwidth is sub-optimal by doing this way.
151
152 The psm2 provider use PSM tag-matching message queue functions to
153 achieve higher bandwidth for large size RMA. It takes advantage of the
154 extra tag bits available in PSM2 to separate the RMA traffic from the
155 regular tagged message queue.
156
157 The option is on by default. To turn it off set the variable to 0.
158
159 FI_PSM2_DELAY
160 Time (seconds) to sleep before closing PSM endpoints. This is a
161 workaround for a bug in some versions of PSM library.
162
163 The default setting is 0.
164
165 FI_PSM2_TIMEOUT
166 Timeout (seconds) for gracefully closing PSM endpoints. A
167 forced closing will be issued if timeout expires.
168
169 The default setting is 5.
170
171 FI_PSM2_CONN_TIMEOUT
172 Timeout (seconds) for establishing connection between two PSM
173 endpoints.
174
175 The default setting is 5.
176
177 FI_PSM2_PROG_INTERVAL
178 When auto progress is enabled (asked via the hints to fi_get‐
179 info), a progress thread is created to make progress calls from
180 time to time. This option set the interval (microseconds) be‐
181 tween progress calls.
182
183 The default setting is 1 if affinity is set, or 1000 if not. See
184 FI_PSM2_PROG_AFFINITY.
185
186 FI_PSM2_PROG_AFFINITY
187 When set, specify the set of CPU cores to set the progress
188 thread affinity to. The format is
189 <start>[:<end>[:<stride>]][,<start>[:<end>[:<stride>]]]*, where
190 each triplet <start>:<end>:<stride> defines a block of core_ids.
191 Both <start> and <end> can be either the core_id (when >=0) or
192 core_id - num_cores (when <0).
193
194 By default affinity is not set.
195
196 FI_PSM2_INJECT_SIZE
197 Maximum message size allowed for fi_inject and fi_tinject calls.
198 This is an experimental feature to allow some applications to
199 override default inject size limitation. When the inject size
200 is larger than the default value, some inject calls might block.
201
202 The default setting is 64.
203
204 FI_PSM2_LOCK_LEVEL
205 When set, dictate the level of locking being used by the
206 provider. Level 2 means all locks are enabled. Level 1 dis‐
207 ables some locks and is suitable for runs that limit the access
208 to each PSM2 context to a single thread. Level 0 disables all
209 locks and thus is only suitable for single threaded runs.
210
211 To use level 0 or level 1, wait object and auto progress mode cannot be
212 used because they introduce internal threads that may break the condi‐
213 tions needed for these levels.
214
215 The default setting is 2.
216
217 FI_PSM2_LAZY_CONN
218 There are two strategies on when to establish connections be‐
219 tween the PSM2 endpoints that OFI endpoints are built on top of.
220 In eager connection mode, connections are established when ad‐
221 dresses are inserted into the address vector. In lazy connec‐
222 tion mode, connections are established when addresses are used
223 the first time in communication. Eager connection mode has
224 slightly lower critical path overhead but lazy connection mode
225 scales better.
226
227 This option controls how the two connection modes are used. When set
228 to 1, lazy connection mode is always used. When set to 0, eager con‐
229 nection mode is used when required conditions are all met and lazy con‐
230 nection mode is used otherwise. The conditions for eager connection
231 mode are: (1) multiple endpoint (and scalable endpoint) support is dis‐
232 abled by explicitly setting PSM2_MULTI_EP=0; and (2) the address vector
233 type is FI_AV_MAP.
234
235 The default setting is 0.
236
237 FI_PSM2_DISCONNECT
238 The provider has a mechanism to automatically send disconnection
239 notifications to all connected peers before the local endpoint
240 is closed. As the response, the peers call psm2_ep_disconnect
241 to clean up the connection state at their side. This allows the
242 same PSM2 epid be used by different dynamically started process‐
243 es (clients) to communicate with the same peer (server). This
244 mechanism, however, introduce extra overhead to the finalization
245 phase. For applications that never reuse epids within the same
246 session such overhead is unnecessary.
247
248 This option controls whether the automatic disconnection notification
249 mechanism should be enabled. For client-server application mentioned
250 above, the client side should set this option to 1, but the server
251 should set it to 0.
252
253 The default setting is 0.
254
255 FI_PSM2_TAG_LAYOUT
256 Select how the 96-bit PSM2 tag bits are organized. Currently
257 three choices are available: tag60 means 32-4-60 partitioning
258 for CQ data, internal protocol flags, and application tag.
259 tag64 means 4-28-64 partitioning for internal protocol flags, CQ
260 data, and application tag. auto means to choose either tag60 or
261 tag64 based on the hints passed to fi_getinfo – tag60 is used if
262 remote CQ data support is requested explicitly, either by pass‐
263 ing non-zero value via hints->domain_attr->cq_data_size or by
264 including FI_REMOTE_CQ_DATA in hints->caps, otherwise tag64 is
265 used. If tag64 is the result of automatic selection, fi_getinfo
266 also returns a second instance of the provider with tag60 lay‐
267 out.
268
269 The default setting is auto.
270
271 Notice that if the provider is compiled with macro PSMX2_TAG_LAYOUT de‐
272 fined to 1 (means tag60) or 2 (means tag64), the choice is fixed at
273 compile time and this runtime option will be disabled.
274
276 The psm2 provider supports limited low level parameter setting through
277 the fi_set_val() and fi_get_val() functions. Currently the following
278 parameters can be set via the domain fid: • .RS 2
279
280 FI_PSM2_DISCONNECT *
281 Overwite the global runtime parameter FI_PSM2_DISCONNECT for
282 this domain. See the RUNTIME PARAMETERS section for details.
283
284 Valid parameter names are defined in the header file rd‐
285 ma/fi_ext_psm2.h.
286
288 fabric(7), fi_provider(7), fi_psm(7), fi_psm3(7),
289
291 OpenFabrics.
292
293
294
295Libfabric Programmer’s Manual 2022-12-11 fi_psm2(7)