1fi_psm3(7) Libfabric v1.17.0 fi_psm3(7)
2
3
4
6 fi_psm3 - The PSM3 Fabric Provider
7
9 The psm3 provider implements a Performance Scaled Messaging capability
10 which supports most verbs UD and sockets devices. Additional features
11 and optimizations can be enabled when running over Intel’s E810 Ether‐
12 net NICs and/or using Intel’s rendezvous kernel module (rv). PSM 3.x
13 fully integrates the OFI provider and the underlying PSM3 protocols/im‐
14 plementation and only exports the OFI APIs.
15
17 The psm3 provider supports a subset of all the features defined in the
18 libfabric API.
19
20 Endpoint types
21 Supports non-connection based types FI_DGRAM and FI_RDM.
22
23 Endpoint capabilities
24 Endpoints can support any combination of data transfer capabili‐
25 ties FI_TAGGED, FI_MSG, FI_ATOMICS, and FI_RMA. These capabili‐
26 ties can be further refined by FI_SEND, FI_RECV, FI_READ,
27 FI_WRITE, FI_REMOTE_READ, and FI_REMOTE_WRITE to limit the di‐
28 rection of operations.
29
30 FI_MULTI_RECV is supported for non-tagged message queue only.
31
32 Scalable endpoints are supported if the underlying PSM3 library sup‐
33 ports multiple endpoints. This condition must be satisfied both when
34 the provider is built and when the provider is used. See the Scalable
35 endpoints section for more information.
36
37 Other supported capabilities include FI_TRIGGER, FI_REMOTE_CQ_DATA,
38 FI_RMA_EVENT, FI_SOURCE, and FI_SOURCE_ERR. Furthermore,
39 FI_NAMED_RX_CTX is supported when scalable endpoints are enabled.
40
41 Modes FI_CONTEXT is required for the FI_TAGGED and FI_MSG capabili‐
42 ties. That means, any request belonging to these two categories
43 that generates a completion must pass as the operation context a
44 valid pointer to type struct fi_context, and the space refer‐
45 enced by the pointer must remain untouched until the request has
46 completed. If none of FI_TAGGED and FI_MSG is asked for, the
47 FI_CONTEXT mode is not required.
48
49 Progress
50 The psm3 provider performs optimal with manual progress. By de‐
51 fault, the application is expected to call fi_cq_read or fi_cn‐
52 tr_read function from time to time when no other libfabric func‐
53 tion is called to ensure progress is made in a timely manner.
54 The provider does support auto progress mode. However, the per‐
55 formance can be significantly impacted if the application purely
56 depends on the provider to make auto progress.
57
58 Scalable endpoints
59 Scalable endpoints support depends on the multi-EP feature of
60 the PSM3 library. If the PSM3 library supports this feature,
61 the availability is further controlled by an environment vari‐
62 able PSM3_MULTI_EP. The psm3 provider automatically sets this
63 variable to 1 if it is not set. The feature can be disabled ex‐
64 plicitly by setting PSM3_MULTI_EP to 0.
65
66 When creating a scalable endpoint, the exact number of contexts re‐
67 quested should be set in the “fi_info” structure passed to the fi_scal‐
68 able_ep function. This number should be set in “fi_info->ep_at‐
69 tr->tx_ctx_cnt” or “fi_info->ep_attr->rx_ctx_cnt” or both, whichever
70 greater is used. The psm3 provider allocates all requested contexts
71 upfront when the scalable endpoint is created. The same context is
72 used for both Tx and Rx.
73
74 For optimal performance, it is advised to avoid having multiple threads
75 accessing the same context, either directly by posting
76 send/recv/read/write request, or indirectly by polling associated com‐
77 pletion queues or counters.
78
79 Using the scalable endpoint as a whole in communication functions is
80 not supported. Instead, individual tx context or rx context of the
81 scalable endpoint should be used. Similarly, using the address of the
82 scalable endpoint as the source address or destination address doesn’t
83 collectively address all the tx/rx contexts. It addresses only the
84 first tx/rx context, instead.
85
87 The psm3 provider doesn’t support all the features defined in the lib‐
88 fabric API. Here are some of the limitations not listed above:
89
90 Unsupported features
91 These features are unsupported: connection management, passive
92 endpoint, and shared receive context.
93
95 The psm3 provider checks for the following environment variables:
96
97 FI_PSM3_UUID
98 PSM requires that each job has a unique ID (UUID). All the pro‐
99 cesses in the same job need to use the same UUID in order to be
100 able to talk to each other. The PSM reference manual advises to
101 keep UUID unique to each job. In practice, it generally works
102 fine to reuse UUID as long as (1) no two jobs with the same UUID
103 are running at the same time; and (2) previous jobs with the
104 same UUID have exited normally. If running into “resource busy”
105 or “connection failure” issues with unknown reason, it is advis‐
106 able to manually set the UUID to a value different from the de‐
107 fault.
108
109 The default UUID is 00FF00FF-0000-0000-0000-00FF0F0F00FF.
110
111 It is possible to create endpoints with UUID different from the one set
112 here. To achieve that, set `info->ep_attr->auth_key' to the uuid value
113 and `info->ep_attr->auth_key_size' to its size (16 bytes) when calling
114 fi_endpoint() or fi_scalable_ep(). It is still true that an endpoint
115 can only communicate with endpoints with the same UUID.
116
117 FI_PSM3_NAME_SERVER
118 The psm3 provider has a simple built-in name server that can be
119 used to resolve an IP address or host name into a transport ad‐
120 dress needed by the fi_av_insert call. The main purpose of this
121 name server is to allow simple client-server type applications
122 (such as those in fabtests) to be written purely with libfabric,
123 without using any out-of-band communication mechanism. For such
124 applications, the server would run first to allow endpoints be
125 created and registered with the name server, and then the client
126 would call fi_getinfo with the node parameter set to the IP ad‐
127 dress or host name of the server. The resulting fi_info struc‐
128 ture would have the transport address of the endpoint created by
129 the server in the dest_addr field. Optionally the service pa‐
130 rameter can be used in addition to node. Notice that the ser‐
131 vice number is interpreted by the provider and is not a TCP/IP
132 port number.
133
134 The name server is on by default. It can be turned off by setting the
135 variable to 0. This may save a small amount of resource since a sepa‐
136 rate thread is created when the name server is on.
137
138 The provider detects OpenMPI and MPICH runs and changes the default
139 setting to off.
140
141 FI_PSM3_TAGGED_RMA
142 The RMA functions are implemented on top of the PSM Active Mes‐
143 sage functions. The Active Message functions have limit on the
144 size of data can be transferred in a single message. Large
145 transfers can be divided into small chunks and be pipe-lined.
146 However, the bandwidth is sub-optimal by doing this way.
147
148 The psm3 provider use PSM tag-matching message queue functions to
149 achieve higher bandwidth for large size RMA. It takes advantage of the
150 extra tag bits available in PSM3 to separate the RMA traffic from the
151 regular tagged message queue.
152
153 The option is on by default. To turn it off set the variable to 0.
154
155 FI_PSM3_DELAY
156 Time (seconds) to sleep before closing PSM endpoints. This is a
157 workaround for a bug in some versions of PSM library.
158
159 The default setting is 0.
160
161 FI_PSM3_TIMEOUT
162 Timeout (seconds) for gracefully closing PSM endpoints. A
163 forced closing will be issued if timeout expires.
164
165 The default setting is 5.
166
167 FI_PSM3_CONN_TIMEOUT
168 Timeout (seconds) for establishing connection between two PSM
169 endpoints.
170
171 The default setting is 5.
172
173 FI_PSM3_PROG_INTERVAL
174 When auto progress is enabled (asked via the hints to fi_get‐
175 info), a progress thread is created to make progress calls from
176 time to time. This option set the interval (microseconds) be‐
177 tween progress calls.
178
179 The default setting is 1 if affinity is set, or 1000 if not. See
180 FI_PSM3_PROG_AFFINITY.
181
182 FI_PSM3_PROG_AFFINITY
183 When set, specify the set of CPU cores to set the progress
184 thread affinity to. The format is
185 <start>[:<end>[:<stride>]][,<start>[:<end>[:<stride>]]]*, where
186 each triplet <start>:<end>:<stride> defines a block of core_ids.
187 Both <start> and <end> can be either the core_id (when >=0) or
188 core_id - num_cores (when <0).
189
190 By default affinity is not set.
191
192 FI_PSM3_INJECT_SIZE
193 Maximum message size allowed for fi_inject and fi_tinject calls.
194 This is an experimental feature to allow some applications to
195 override default inject size limitation. When the inject size
196 is larger than the default value, some inject calls might block.
197
198 The default setting is 64.
199
200 FI_PSM3_LOCK_LEVEL
201 When set, dictate the level of locking being used by the
202 provider. Level 2 means all locks are enabled. Level 1 dis‐
203 ables some locks and is suitable for runs that limit the access
204 to each PSM3 context to a single thread. Level 0 disables all
205 locks and thus is only suitable for single threaded runs.
206
207 To use level 0 or level 1, wait object and auto progress mode cannot be
208 used because they introduce internal threads that may break the condi‐
209 tions needed for these levels.
210
211 The default setting is 2.
212
213 FI_PSM3_LAZY_CONN
214 There are two strategies on when to establish connections be‐
215 tween the PSM3 endpoints that OFI endpoints are built on top of.
216 In eager connection mode, connections are established when ad‐
217 dresses are inserted into the address vector. In lazy connec‐
218 tion mode, connections are established when addresses are used
219 the first time in communication. Eager connection mode has
220 slightly lower critical path overhead but lazy connection mode
221 scales better.
222
223 This option controls how the two connection modes are used. When set
224 to 1, lazy connection mode is always used. When set to 0, eager con‐
225 nection mode is used when required conditions are all met and lazy con‐
226 nection mode is used otherwise. The conditions for eager connection
227 mode are: (1) multiple endpoint (and scalable endpoint) support is dis‐
228 abled by explicitly setting PSM3_MULTI_EP=0; and (2) the address vector
229 type is FI_AV_MAP.
230
231 The default setting is 0.
232
233 FI_PSM3_DISCONNECT
234 The provider has a mechanism to automatically send disconnection
235 notifications to all connected peers before the local endpoint
236 is closed. As the response, the peers call psm3_ep_disconnect
237 to clean up the connection state at their side. This allows the
238 same PSM3 epid be used by different dynamically started process‐
239 es (clients) to communicate with the same peer (server). This
240 mechanism, however, introduce extra overhead to the finalization
241 phase. For applications that never reuse epids within the same
242 session such overhead is unnecessary.
243
244 This option controls whether the automatic disconnection notification
245 mechanism should be enabled. For client-server application mentioned
246 above, the client side should set this option to 1, but the server
247 should set it to 0.
248
249 The default setting is 0.
250
251 FI_PSM3_TAG_LAYOUT
252 Select how the 96-bit PSM3 tag bits are organized. Currently
253 three choices are available: tag60 means 32-4-60 partitioning
254 for CQ data, internal protocol flags, and application tag.
255 tag64 means 4-28-64 partitioning for internal protocol flags, CQ
256 data, and application tag. auto means to choose either tag60 or
257 tag64 based on the hints passed to fi_getinfo – tag60 is used if
258 remote CQ data support is requested explicitly, either by pass‐
259 ing non-zero value via hints->domain_attr->cq_data_size or by
260 including FI_REMOTE_CQ_DATA in hints->caps, otherwise tag64 is
261 used. If tag64 is the result of automatic selection, fi_getinfo
262 also returns a second instance of the provider with tag60 lay‐
263 out.
264
265 The default setting is auto.
266
267 Notice that if the provider is compiled with macro PSMX3_TAG_LAYOUT de‐
268 fined to 1 (means tag60) or 2 (means tag64), the choice is fixed at
269 compile time and this runtime option will be disabled.
270
272 fabric(7), fi_provider(7), fi_psm(7), fi_psm2(7),
273
275 OpenFabrics.
276
277
278
279Libfabric Programmer’s Manual 2022-12-11 fi_psm3(7)