1fi_psm2(7)                     Libfabric v1.17.0                    fi_psm2(7)
2
3
4

NAME

6       fi_psm2 - The PSM2 Fabric Provider
7

OVERVIEW

9       The  psm2 provider runs over the PSM 2.x interface that is supported by
10       the Intel Omni-Path Fabric.  PSM 2.x has all the PSM 1.x features  plus
11       a  set  of new functions with enhanced capabilities.  Since PSM 1.x and
12       PSM 2.x are not ABI compatible the psm2 provider only  works  with  PSM
13       2.x and doesn’t support Intel TrueScale Fabric.
14

LIMITATIONS

16       The  psm2 provider doesn’t support all the features defined in the lib‐
17       fabric API.  Here are some of the limitations:
18
19       Endpoint types
20              Only support non-connection based types FI_DGRAM and FI_RDM
21
22       Endpoint capabilities
23              Endpoints can support any combination of data transfer capabili‐
24              ties FI_TAGGED, FI_MSG, FI_ATOMICS, and FI_RMA.  These capabili‐
25              ties can  be  further  refined  by  FI_SEND,  FI_RECV,  FI_READ,
26              FI_WRITE,  FI_REMOTE_READ,  and FI_REMOTE_WRITE to limit the di‐
27              rection of operations.
28
29       FI_MULTI_RECV is supported for non-tagged message queue only.
30
31       Scalable endpoints are supported if the underlying  PSM2  library  sup‐
32       ports  multiple  endpoints.  This condition must be satisfied both when
33       the provider is built and when the provider is used.  See the  Scalable
34       endpoints section for more information.
35
36       Other  supported  capabilities  include  FI_TRIGGER, FI_REMOTE_CQ_DATA,
37       FI_RMA_EVENT,    FI_SOURCE,    and     FI_SOURCE_ERR.      Furthermore,
38       FI_NAMED_RX_CTX is supported when scalable endpoints are enabled.
39
40       Modes  FI_CONTEXT  is  required  for the FI_TAGGED and FI_MSG capabili‐
41              ties.  That means, any request belonging to these two categories
42              that generates a completion must pass as the operation context a
43              valid pointer to type struct fi_context, and  the  space  refer‐
44              enced by the pointer must remain untouched until the request has
45              completed.  If none of FI_TAGGED and FI_MSG is  asked  for,  the
46              FI_CONTEXT mode is not required.
47
48       Progress
49              The  psm2 provider requires manual progress.  The application is
50              expected to call fi_cq_read or fi_cntr_read function  from  time
51              to  time  when  no  other libfabric function is called to ensure
52              progress is made in a timely manner.  The provider does  support
53              auto  progress  mode.   However, the performance can be signifi‐
54              cantly  impacted  if  the  application  purely  depends  on  the
55              provider to make auto progress.
56
57       Scalable endpoints
58              Scalable  endpoints  support  depends on the multi-EP feature of
59              the PSM2 library.  If the PSM2 library  supports  this  feature,
60              the  availability  is further controlled by an environment vari‐
61              able PSM2_MULTI_EP.  The psm2 provider automatically  sets  this
62              variable to 1 if it is not set.  The feature can be disabled ex‐
63              plicitly by setting PSM2_MULTI_EP to 0.
64
65       When creating a scalable endpoint, the exact  number  of  contexts  re‐
66       quested should be set in the “fi_info” structure passed to the fi_scal‐
67       able_ep function.   This  number  should  be  set  in  “fi_info->ep_at‐
68       tr->tx_ctx_cnt”  or  “fi_info->ep_attr->rx_ctx_cnt”  or both, whichever
69       greater is used.  The psm2 provider allocates  all  requested  contexts
70       upfront  when  the  scalable  endpoint is created.  The same context is
71       used for both Tx and Rx.
72
73       For optimal performance, it is advised to avoid having multiple threads
74       accessing    the    same    context,   either   directly   by   posting
75       send/recv/read/write request, or indirectly by polling associated  com‐
76       pletion queues or counters.
77
78       Using  the  scalable  endpoint as a whole in communication functions is
79       not supported.  Instead, individual tx context or  rx  context  of  the
80       scalable  endpoint should be used.  Similarly, using the address of the
81       scalable endpoint as the source address or destination address  doesn’t
82       collectively  address  all  the  tx/rx contexts.  It addresses only the
83       first tx/rx context, instead.
84
85       Shared Tx contexts
86              In order to achieve the purpose of saving PSM context  by  using
87              shared Tx context, the endpoints bound to the shared Tx contexts
88              need to be Tx only.  The reason is that Rx capability always re‐
89              quires  a  PSM context, which can also be automatically used for
90              Tx.  As the result, allocating a shared Tx context for Rx  capa‐
91              ble  endpoints  actually  consumes  one extra context instead of
92              saving some.
93
94       Unsupported features
95              These features are unsupported: connection  management,  passive
96              endpoint, and shared receive context.
97

RUNTIME PARAMETERS

99       The psm2 provider checks for the following environment variables:
100
101       FI_PSM2_UUID
102              PSM requires that each job has a unique ID (UUID).  All the pro‐
103              cesses in the same job need to use the same UUID in order to  be
104              able to talk to each other.  The PSM reference manual advises to
105              keep UUID unique to each job.  In practice, it  generally  works
106              fine to reuse UUID as long as (1) no two jobs with the same UUID
107              are running at the same time; and (2)  previous  jobs  with  the
108              same UUID have exited normally.  If running into “resource busy”
109              or “connection failure” issues with unknown reason, it is advis‐
110              able  to manually set the UUID to a value different from the de‐
111              fault.
112
113       The default UUID is 00FF00FF-0000-0000-0000-00FF0F0F00FF.
114
115       It is possible to create endpoints with UUID different from the one set
116       here.  To achieve that, set `info->ep_attr->auth_key' to the uuid value
117       and `info->ep_attr->auth_key_size' to its size (16 bytes) when  calling
118       fi_endpoint()  or  fi_scalable_ep().  It is still true that an endpoint
119       can only communicate with endpoints with the same UUID.
120
121       FI_PSM2_NAME_SERVER
122              The psm2 provider has a simple built-in name server that can  be
123              used  to resolve an IP address or host name into a transport ad‐
124              dress needed by the fi_av_insert call.  The main purpose of this
125              name  server  is to allow simple client-server type applications
126              (such as those in fabtests) to be written purely with libfabric,
127              without using any out-of-band communication mechanism.  For such
128              applications, the server would run first to allow  endpoints  be
129              created and registered with the name server, and then the client
130              would call fi_getinfo with the node parameter set to the IP  ad‐
131              dress  or host name of the server.  The resulting fi_info struc‐
132              ture would have the transport address of the endpoint created by
133              the  server  in the dest_addr field.  Optionally the service pa‐
134              rameter can be used in addition to node.  Notice that  the  ser‐
135              vice  number  is interpreted by the provider and is not a TCP/IP
136              port number.
137
138       The name server is on by default.  It can be turned off by setting  the
139       variable  to 0.  This may save a small amount of resource since a sepa‐
140       rate thread is created when the name server is on.
141
142       The provider detects OpenMPI and MPICH runs  and  changes  the  default
143       setting to off.
144
145       FI_PSM2_TAGGED_RMA
146              The  RMA functions are implemented on top of the PSM Active Mes‐
147              sage functions.  The Active Message functions have limit on  the
148              size  of  data  can  be  transferred in a single message.  Large
149              transfers can be divided into small chunks  and  be  pipe-lined.
150              However, the bandwidth is sub-optimal by doing this way.
151
152       The  psm2  provider  use  PSM  tag-matching  message queue functions to
153       achieve higher bandwidth for large size RMA.  It takes advantage of the
154       extra  tag  bits available in PSM2 to separate the RMA traffic from the
155       regular tagged message queue.
156
157       The option is on by default.  To turn it off set the variable to 0.
158
159       FI_PSM2_DELAY
160              Time (seconds) to sleep before closing PSM endpoints.  This is a
161              workaround for a bug in some versions of PSM library.
162
163       The default setting is 0.
164
165       FI_PSM2_TIMEOUT
166              Timeout  (seconds)  for  gracefully  closing  PSM  endpoints.  A
167              forced closing will be issued if timeout expires.
168
169       The default setting is 5.
170
171       FI_PSM2_CONN_TIMEOUT
172              Timeout (seconds) for establishing connection  between  two  PSM
173              endpoints.
174
175       The default setting is 5.
176
177       FI_PSM2_PROG_INTERVAL
178              When  auto  progress  is enabled (asked via the hints to fi_get‐
179              info), a progress thread is created to make progress calls  from
180              time  to  time.  This option set the interval (microseconds) be‐
181              tween progress calls.
182
183       The default setting is 1 if affinity is  set,  or  1000  if  not.   See
184       FI_PSM2_PROG_AFFINITY.
185
186       FI_PSM2_PROG_AFFINITY
187              When  set,  specify  the  set  of  CPU cores to set the progress
188              thread       affinity       to.        The       format       is
189              <start>[:<end>[:<stride>]][,<start>[:<end>[:<stride>]]]*,  where
190              each triplet <start>:<end>:<stride> defines a block of core_ids.
191              Both  <start>  and <end> can be either the core_id (when >=0) or
192              core_id - num_cores (when <0).
193
194       By default affinity is not set.
195
196       FI_PSM2_INJECT_SIZE
197              Maximum message size allowed for fi_inject and fi_tinject calls.
198              This  is  an  experimental feature to allow some applications to
199              override default inject size limitation.  When the  inject  size
200              is larger than the default value, some inject calls might block.
201
202       The default setting is 64.
203
204       FI_PSM2_LOCK_LEVEL
205              When  set,  dictate  the  level  of  locking  being  used by the
206              provider.  Level 2 means all locks are enabled.   Level  1  dis‐
207              ables  some locks and is suitable for runs that limit the access
208              to each PSM2 context to a single thread.  Level 0  disables  all
209              locks and thus is only suitable for single threaded runs.
210
211       To use level 0 or level 1, wait object and auto progress mode cannot be
212       used because they introduce internal threads that may break the  condi‐
213       tions needed for these levels.
214
215       The default setting is 2.
216
217       FI_PSM2_LAZY_CONN
218              There  are  two  strategies on when to establish connections be‐
219              tween the PSM2 endpoints that OFI endpoints are built on top of.
220              In  eager  connection mode, connections are established when ad‐
221              dresses are inserted into the address vector.  In  lazy  connec‐
222              tion  mode,  connections are established when addresses are used
223              the first time in  communication.   Eager  connection  mode  has
224              slightly  lower  critical path overhead but lazy connection mode
225              scales better.
226
227       This option controls how the two connection modes are used.   When  set
228       to  1,  lazy connection mode is always used.  When set to 0, eager con‐
229       nection mode is used when required conditions are all met and lazy con‐
230       nection  mode  is  used otherwise.  The conditions for eager connection
231       mode are: (1) multiple endpoint (and scalable endpoint) support is dis‐
232       abled by explicitly setting PSM2_MULTI_EP=0; and (2) the address vector
233       type is FI_AV_MAP.
234
235       The default setting is 0.
236
237       FI_PSM2_DISCONNECT
238              The provider has a mechanism to automatically send disconnection
239              notifications  to  all connected peers before the local endpoint
240              is closed.  As the response, the peers  call  psm2_ep_disconnect
241              to clean up the connection state at their side.  This allows the
242              same PSM2 epid be used by different dynamically started process‐
243              es  (clients)  to communicate with the same peer (server).  This
244              mechanism, however, introduce extra overhead to the finalization
245              phase.   For applications that never reuse epids within the same
246              session such overhead is unnecessary.
247
248       This option controls whether the automatic  disconnection  notification
249       mechanism  should  be enabled.  For client-server application mentioned
250       above, the client side should set this option  to  1,  but  the  server
251       should set it to 0.
252
253       The default setting is 0.
254
255       FI_PSM2_TAG_LAYOUT
256              Select  how  the  96-bit PSM2 tag bits are organized.  Currently
257              three choices are available: tag60  means  32-4-60  partitioning
258              for  CQ  data,  internal  protocol  flags,  and application tag.
259              tag64 means 4-28-64 partitioning for internal protocol flags, CQ
260              data, and application tag.  auto means to choose either tag60 or
261              tag64 based on the hints passed to fi_getinfo – tag60 is used if
262              remote  CQ data support is requested explicitly, either by pass‐
263              ing non-zero value via  hints->domain_attr->cq_data_size  or  by
264              including  FI_REMOTE_CQ_DATA  in hints->caps, otherwise tag64 is
265              used.  If tag64 is the result of automatic selection, fi_getinfo
266              also  returns  a second instance of the provider with tag60 lay‐
267              out.
268
269       The default setting is auto.
270
271       Notice that if the provider is compiled with macro PSMX2_TAG_LAYOUT de‐
272       fined  to  1  (means  tag60) or 2 (means tag64), the choice is fixed at
273       compile time and this runtime option will be disabled.
274

PSM2 EXTENSIONS

276       The psm2 provider supports limited low level parameter setting  through
277       the  fi_set_val()  and fi_get_val() functions.  Currently the following
278       parameters can be set via the domain fid: • .RS 2
279
280       FI_PSM2_DISCONNECT *
281              Overwite the global  runtime  parameter  FI_PSM2_DISCONNECT  for
282              this domain.  See the RUNTIME PARAMETERS section for details.
283
284       Valid   parameter   names   are   defined   in   the  header  file  rd‐
285       ma/fi_ext_psm2.h.
286

SEE ALSO

288       fabric(7), fi_provider(7), fi_psm(7), fi_psm3(7),
289

AUTHORS

291       OpenFabrics.
292
293
294
295Libfabric Programmer’s Manual     2022-12-11                        fi_psm2(7)
Impressum