1fi_psm3(7)                     Libfabric v1.15.1                    fi_psm3(7)
2
3
4

NAME

6       fi_psm3 - The PSM3 Fabric Provider
7

OVERVIEW

9       The  psm3 provider implements a Performance Scaled Messaging capability
10       which supports most verbs UD and sockets devices.  Additional  features
11       and  optimizations can be enabled when running over Intel’s E810 Ether‐
12       net NICs and/or using Intel’s rendezvous kernel module (rv).   PSM  3.x
13       fully integrates the OFI provider and the underlying PSM3 protocols/im‐
14       plementation and only exports the OFI APIs.
15

SUPPORTED FEATURES

17       The psm3 provider supports a subset of all the features defined in  the
18       libfabric API.
19
20       Endpoint types
21              Supports non-connection based types FI_DGRAM and FI_RDM.
22
23       Endpoint capabilities
24              Endpoints can support any combination of data transfer capabili‐
25              ties FI_TAGGED, FI_MSG, FI_ATOMICS, and FI_RMA.  These capabili‐
26              ties  can  be  further  refined  by  FI_SEND,  FI_RECV, FI_READ,
27              FI_WRITE, FI_REMOTE_READ, and FI_REMOTE_WRITE to limit  the  di‐
28              rection of operations.
29
30       FI_MULTI_RECV is supported for non-tagged message queue only.
31
32       Scalable  endpoints  are  supported if the underlying PSM3 library sup‐
33       ports multiple endpoints.  This condition must be satisfied  both  when
34       the  provider is built and when the provider is used.  See the Scalable
35       endpoints section for more information.
36
37       Other supported  capabilities  include  FI_TRIGGER,  FI_REMOTE_CQ_DATA,
38       FI_RMA_EVENT,     FI_SOURCE,     and    FI_SOURCE_ERR.     Furthermore,
39       FI_NAMED_RX_CTX is supported when scalable endpoints are enabled.
40
41       Modes  FI_CONTEXT is required for the FI_TAGGED  and  FI_MSG  capabili‐
42              ties.  That means, any request belonging to these two categories
43              that generates a completion must pass as the operation context a
44              valid  pointer  to  type struct fi_context, and the space refer‐
45              enced by the pointer must remain untouched until the request has
46              completed.   If  none  of FI_TAGGED and FI_MSG is asked for, the
47              FI_CONTEXT mode is not required.
48
49       Progress
50              The psm3 provider performs optimal with manual progress.  By de‐
51              fault,  the application is expected to call fi_cq_read or fi_cn‐
52              tr_read function from time to time when no other libfabric func‐
53              tion  is  called  to ensure progress is made in a timely manner.
54              The provider does support auto progress mode.  However, the per‐
55              formance can be significantly impacted if the application purely
56              depends on the provider to make auto progress.
57
58       Scalable endpoints
59              Scalable endpoints support depends on the  multi-EP  feature  of
60              the  PSM3  library.   If the PSM3 library supports this feature,
61              the availability is further controlled by an  environment  vari‐
62              able  PSM3_MULTI_EP.   The psm3 provider automatically sets this
63              variable to 1 if it is not set.  The feature can be disabled ex‐
64              plicitly by setting PSM3_MULTI_EP to 0.
65
66       When  creating  a  scalable  endpoint, the exact number of contexts re‐
67       quested should be set in the “fi_info” structure passed to the fi_scal‐
68       able_ep  function.   This  number  should  be  set  in “fi_info->ep_at‐
69       tr->tx_ctx_cnt” or “fi_info->ep_attr->rx_ctx_cnt”  or  both,  whichever
70       greater  is  used.   The psm3 provider allocates all requested contexts
71       upfront when the scalable endpoint is created.   The  same  context  is
72       used for both Tx and Rx.
73
74       For optimal performance, it is advised to avoid having multiple threads
75       accessing   the   same   context,   either    directly    by    posting
76       send/recv/read/write  request, or indirectly by polling associated com‐
77       pletion queues or counters.
78
79       Using the scalable endpoint as a whole in  communication  functions  is
80       not  supported.   Instead,  individual  tx context or rx context of the
81       scalable endpoint should be used.  Similarly, using the address of  the
82       scalable  endpoint as the source address or destination address doesn’t
83       collectively address all the tx/rx contexts.   It  addresses  only  the
84       first tx/rx context, instead.
85

LIMITATIONS

87       The  psm3 provider doesn’t support all the features defined in the lib‐
88       fabric API.  Here are some of the limitations not listed above:
89
90       Unsupported features
91              These features are unsupported: connection  management,  passive
92              endpoint, and shared receive context.
93

RUNTIME PARAMETERS

95       The psm3 provider checks for the following environment variables:
96
97       FI_PSM3_UUID
98              PSM requires that each job has a unique ID (UUID).  All the pro‐
99              cesses in the same job need to use the same UUID in order to  be
100              able to talk to each other.  The PSM reference manual advises to
101              keep UUID unique to each job.  In practice, it  generally  works
102              fine to reuse UUID as long as (1) no two jobs with the same UUID
103              are running at the same time; and (2)  previous  jobs  with  the
104              same UUID have exited normally.  If running into “resource busy”
105              or “connection failure” issues with unknown reason, it is advis‐
106              able  to manually set the UUID to a value different from the de‐
107              fault.
108
109       The default UUID is 00FF00FF-0000-0000-0000-00FF0F0F00FF.
110
111       It is possible to create endpoints with UUID different from the one set
112       here.  To achieve that, set `info->ep_attr->auth_key' to the uuid value
113       and `info->ep_attr->auth_key_size' to its size (16 bytes) when  calling
114       fi_endpoint()  or  fi_scalable_ep().  It is still true that an endpoint
115       can only communicate with endpoints with the same UUID.
116
117       FI_PSM3_NAME_SERVER
118              The psm3 provider has a simple built-in name server that can  be
119              used  to resolve an IP address or host name into a transport ad‐
120              dress needed by the fi_av_insert call.  The main purpose of this
121              name  server  is to allow simple client-server type applications
122              (such as those in fabtests) to be written purely with libfabric,
123              without using any out-of-band communication mechanism.  For such
124              applications, the server would run first to allow  endpoints  be
125              created and registered with the name server, and then the client
126              would call fi_getinfo with the node parameter set to the IP  ad‐
127              dress  or host name of the server.  The resulting fi_info struc‐
128              ture would have the transport address of the endpoint created by
129              the  server  in the dest_addr field.  Optionally the service pa‐
130              rameter can be used in addition to node.  Notice that  the  ser‐
131              vice  number  is interpreted by the provider and is not a TCP/IP
132              port number.
133
134       The name server is on by default.  It can be turned off by setting  the
135       variable  to 0.  This may save a small amount of resource since a sepa‐
136       rate thread is created when the name server is on.
137
138       The provider detects OpenMPI and MPICH runs  and  changes  the  default
139       setting to off.
140
141       FI_PSM3_TAGGED_RMA
142              The  RMA functions are implemented on top of the PSM Active Mes‐
143              sage functions.  The Active Message functions have limit on  the
144              size  of  data  can  be  transferred in a single message.  Large
145              transfers can be divided into small chunks  and  be  pipe-lined.
146              However, the bandwidth is sub-optimal by doing this way.
147
148       The  psm3  provider  use  PSM  tag-matching  message queue functions to
149       achieve higher bandwidth for large size RMA.  It takes advantage of the
150       extra  tag  bits available in PSM3 to separate the RMA traffic from the
151       regular tagged message queue.
152
153       The option is on by default.  To turn it off set the variable to 0.
154
155       FI_PSM3_DELAY
156              Time (seconds) to sleep before closing PSM endpoints.  This is a
157              workaround for a bug in some versions of PSM library.
158
159       The default setting is 0.
160
161       FI_PSM3_TIMEOUT
162              Timeout  (seconds)  for  gracefully  closing  PSM  endpoints.  A
163              forced closing will be issued if timeout expires.
164
165       The default setting is 5.
166
167       FI_PSM3_CONN_TIMEOUT
168              Timeout (seconds) for establishing connection  between  two  PSM
169              endpoints.
170
171       The default setting is 5.
172
173       FI_PSM3_PROG_INTERVAL
174              When  auto  progress  is enabled (asked via the hints to fi_get‐
175              info), a progress thread is created to make progress calls  from
176              time  to  time.  This option set the interval (microseconds) be‐
177              tween progress calls.
178
179       The default setting is 1 if affinity is  set,  or  1000  if  not.   See
180       FI_PSM3_PROG_AFFINITY.
181
182       FI_PSM3_PROG_AFFINITY
183              When  set,  specify  the  set  of  CPU cores to set the progress
184              thread       affinity       to.        The       format       is
185              <start>[:<end>[:<stride>]][,<start>[:<end>[:<stride>]]]*,  where
186              each triplet <start>:<end>:<stride> defines a block of core_ids.
187              Both  <start>  and <end> can be either the core_id (when >=0) or
188              core_id - num_cores (when <0).
189
190       By default affinity is not set.
191
192       FI_PSM3_INJECT_SIZE
193              Maximum message size allowed for fi_inject and fi_tinject calls.
194              This  is  an  experimental feature to allow some applications to
195              override default inject size limitation.  When the  inject  size
196              is larger than the default value, some inject calls might block.
197
198       The default setting is 64.
199
200       FI_PSM3_LOCK_LEVEL
201              When  set,  dictate  the  level  of  locking  being  used by the
202              provider.  Level 2 means all locks are enabled.   Level  1  dis‐
203              ables  some locks and is suitable for runs that limit the access
204              to each PSM3 context to a single thread.  Level 0  disables  all
205              locks and thus is only suitable for single threaded runs.
206
207       To use level 0 or level 1, wait object and auto progress mode cannot be
208       used because they introduce internal threads that may break the  condi‐
209       tions needed for these levels.
210
211       The default setting is 2.
212
213       FI_PSM3_LAZY_CONN
214              There  are  two  strategies on when to establish connections be‐
215              tween the PSM3 endpoints that OFI endpoints are built on top of.
216              In  eager  connection mode, connections are established when ad‐
217              dresses are inserted into the address vector.  In  lazy  connec‐
218              tion  mode,  connections are established when addresses are used
219              the first time in  communication.   Eager  connection  mode  has
220              slightly  lower  critical path overhead but lazy connection mode
221              scales better.
222
223       This option controls how the two connection modes are used.   When  set
224       to  1,  lazy connection mode is always used.  When set to 0, eager con‐
225       nection mode is used when required conditions are all met and lazy con‐
226       nection  mode  is  used otherwise.  The conditions for eager connection
227       mode are: (1) multiple endpoint (and scalable endpoint) support is dis‐
228       abled by explicitly setting PSM3_MULTI_EP=0; and (2) the address vector
229       type is FI_AV_MAP.
230
231       The default setting is 0.
232
233       FI_PSM3_DISCONNECT
234              The provider has a mechanism to automatically send disconnection
235              notifications  to  all connected peers before the local endpoint
236              is closed.  As the response, the peers  call  psm3_ep_disconnect
237              to clean up the connection state at their side.  This allows the
238              same PSM3 epid be used by different dynamically started process‐
239              es  (clients)  to communicate with the same peer (server).  This
240              mechanism, however, introduce extra overhead to the finalization
241              phase.   For applications that never reuse epids within the same
242              session such overhead is unnecessary.
243
244       This option controls whether the automatic  disconnection  notification
245       mechanism  should  be enabled.  For client-server application mentioned
246       above, the client side should set this option  to  1,  but  the  server
247       should set it to 0.
248
249       The default setting is 0.
250
251       FI_PSM3_TAG_LAYOUT
252              Select  how  the  96-bit PSM3 tag bits are organized.  Currently
253              three choices are available: tag60  means  32-4-60  partitioning
254              for  CQ  data,  internal  protocol  flags,  and application tag.
255              tag64 means 4-28-64 partitioning for internal protocol flags, CQ
256              data, and application tag.  auto means to choose either tag60 or
257              tag64 based on the hints passed to fi_getinfo – tag60 is used if
258              remote  CQ data support is requested explicitly, either by pass‐
259              ing non-zero value via  hints->domain_attr->cq_data_size  or  by
260              including  FI_REMOTE_CQ_DATA  in hints->caps, otherwise tag64 is
261              used.  If tag64 is the result of automatic selection, fi_getinfo
262              also  returns  a second instance of the provider with tag60 lay‐
263              out.
264
265       The default setting is auto.
266
267       Notice that if the provider is compiled with macro PSMX3_TAG_LAYOUT de‐
268       fined  to  1  (means  tag60) or 2 (means tag64), the choice is fixed at
269       compile time and this runtime option will be disabled.
270

SEE ALSO

272       fabric(7), fi_provider(7), fi_psm(7), fi_psm2(7),
273

AUTHORS

275       OpenFabrics.
276
277
278
279Libfabric Programmer’s Manual     2022-03-30                        fi_psm3(7)
Impressum