1fi_psm3(7)                     Libfabric v1.12.1                    fi_psm3(7)
2
3
4

NAME

6       fi_psm3 - The PSM3 Fabric Provider
7

OVERVIEW

9       The  psm3 provider implements a Performance Scaled Messaging capability
10       which supports Intel RoCEv2 capable NICs.  PSM3 represents an  Ethernet
11       and standard RoCEv2 enhancement of previous PSM implementations.
12

SUPPORTED FEATURES

14       The  psm3 provider supports a subset of all the features defined in the
15       libfabric API.
16
17       Endpoint types
18              Supports non-connection based types FI_DGRAM and FI_RDM.
19
20       Endpoint capabilities
21              Endpoints can support any combination of data transfer capabili‐
22              ties FI_TAGGED, FI_MSG, FI_ATOMICS, and FI_RMA.  These capabili‐
23              ties can  be  further  refined  by  FI_SEND,  FI_RECV,  FI_READ,
24              FI_WRITE,  FI_REMOTE_READ,  and FI_REMOTE_WRITE to limit the di‐
25              rection of operations.
26
27       FI_MULTI_RECV is supported for non-tagged message queue only.
28
29       Scalable endpoints are supported if the underlying  PSM3  library  sup‐
30       ports  multiple  endpoints.  This condition must be satisfied both when
31       the provider is built and when the provider is used.  See the  Scalable
32       endpoints section for more information.
33
34       Other  supported  capabilities  include  FI_TRIGGER, FI_REMOTE_CQ_DATA,
35       FI_RMA_EVENT,    FI_SOURCE,    and     FI_SOURCE_ERR.      Furthermore,
36       FI_NAMED_RX_CTX is supported when scalable endpoints are enabled.
37
38       Modes  FI_CONTEXT  is  required  for the FI_TAGGED and FI_MSG capabili‐
39              ties.  That means, any request belonging to these two categories
40              that generates a completion must pass as the operation context a
41              valid pointer to type struct fi_context, and  the  space  refer‐
42              enced by the pointer must remain untouched until the request has
43              completed.  If none of FI_TAGGED and FI_MSG is  asked  for,  the
44              FI_CONTEXT mode is not required.
45
46       Progress
47              The psm3 provider performs optimal with manual progress.  By de‐
48              fault, the application is expected to call fi_cq_read or  fi_cn‐
49              tr_read function from time to time when no other libfabric func‐
50              tion is called to ensure progress is made in  a  timely  manner.
51              The provider does support auto progress mode.  However, the per‐
52              formance can be significantly impacted if the application purely
53              depends on the provider to make auto progress.
54
55       Scalable endpoints
56              Scalable  endpoints  support  depends on the multi-EP feature of
57              the PSM3 library.  If the PSM3 library  supports  this  feature,
58              the  availability  is further controlled by an environment vari‐
59              able PSM3_MULTI_EP.  The psm3 provider automatically  sets  this
60              variable to 1 if it is not set.  The feature can be disabled ex‐
61              plicitly by setting PSM3_MULTI_EP to 0.
62
63       When creating a scalable endpoint, the exact  number  of  contexts  re‐
64       quested should be set in the "fi_info" structure passed to the fi_scal‐
65       able_ep function.   This  number  should  be  set  in  "fi_info->ep_at‐
66       tr->tx_ctx_cnt"  or  "fi_info->ep_attr->rx_ctx_cnt"  or both, whichever
67       greater is used.  The psm3 provider allocates  all  requested  contexts
68       upfront  when  the  scalable  endpoint is created.  The same context is
69       used for both Tx and Rx.
70
71       For optimal performance, it is advised to avoid having multiple threads
72       accessing    the    same    context,   either   directly   by   posting
73       send/recv/read/write request, or indirectly by polling associated  com‐
74       pletion queues or counters.
75
76       Using  the  scalable  endpoint as a whole in communication functions is
77       not supported.  Instead, individual tx context or  rx  context  of  the
78       scalable  endpoint should be used.  Similarly, using the address of the
79       scalable endpoint as the source address or destination address  doesn't
80       collectively  address  all  the  tx/rx contexts.  It addresses only the
81       first tx/rx context, instead.
82

LIMITATIONS

84       The psm3 provider doesn't support all the features defined in the  lib‐
85       fabric API.  Here are some of the limitations not listed above:
86
87       Unsupported features
88              These  features  are unsupported: connection management, passive
89              endpoint, and shared receive context.
90

RUNTIME PARAMETERS

92       The psm3 provider checks for the following environment variables:
93
94       FI_PSM3_UUID
95              PSM requires that each job has a unique ID (UUID).  All the pro‐
96              cesses  in the same job need to use the same UUID in order to be
97              able to talk to each other.  The PSM reference manual advises to
98              keep  UUID  unique to each job.  In practice, it generally works
99              fine to reuse UUID as long as (1) no two jobs with the same UUID
100              are  running  at  the  same time; and (2) previous jobs with the
101              same UUID have exited normally.  If running into "resource busy"
102              or "connection failure" issues with unknown reason, it is advis‐
103              able to manually set the UUID to a value different from the  de‐
104              fault.
105
106       The default UUID is 00FF00FF-0000-0000-0000-00FF0F0F00FF.
107
108       It is possible to create endpoints with UUID different from the one set
109       here.  To achieve that, set 'info->ep_attr->auth_key' to the uuid value
110       and  'info->ep_attr->auth_key_size' to its size (16 bytes) when calling
111       fi_endpoint() or fi_scalable_ep().  It is still true that  an  endpoint
112       can only communicate with endpoints with the same UUID.
113
114       FI_PSM3_NAME_SERVER
115              The  psm3 provider has a simple built-in name server that can be
116              used to resolve an IP address or host name into a transport  ad‐
117              dress needed by the fi_av_insert call.  The main purpose of this
118              name server is to allow simple client-server  type  applications
119              (such as those in fabtests) to be written purely with libfabric,
120              without using any out-of-band communication mechanism.  For such
121              applications,  the  server would run first to allow endpoints be
122              created and registered with the name server, and then the client
123              would  call fi_getinfo with the node parameter set to the IP ad‐
124              dress or host name of the server.  The resulting fi_info  struc‐
125              ture would have the transport address of the endpoint created by
126              the server in the dest_addr field.  Optionally the  service  pa‐
127              rameter  can  be used in addition to node.  Notice that the ser‐
128              vice number is interpreted by the provider and is not  a  TCP/IP
129              port number.
130
131       The  name server is on by default.  It can be turned off by setting the
132       variable to 0.  This may save a small amount of resource since a  sepa‐
133       rate thread is created when the name server is on.
134
135       The  provider  detects  OpenMPI  and MPICH runs and changes the default
136       setting to off.
137
138       FI_PSM3_TAGGED_RMA
139              The RMA functions are implemented on top of the PSM Active  Mes‐
140              sage  functions.  The Active Message functions have limit on the
141              size of data can be transferred  in  a  single  message.   Large
142              transfers  can  be  divided into small chunks and be pipe-lined.
143              However, the bandwidth is sub-optimal by doing this way.
144
145       The psm3 provider use  PSM  tag-matching  message  queue  functions  to
146       achieve higher bandwidth for large size RMA.  It takes advantage of the
147       extra tag bits available in PSM3 to separate the RMA traffic  from  the
148       regular tagged message queue.
149
150       The option is on by default.  To turn it off set the variable to 0.
151
152       FI_PSM3_DELAY
153              Time (seconds) to sleep before closing PSM endpoints.  This is a
154              workaround for a bug in some versions of PSM library.
155
156       The default setting is 0.
157
158       FI_PSM3_TIMEOUT
159              Timeout (seconds)  for  gracefully  closing  PSM  endpoints.   A
160              forced closing will be issued if timeout expires.
161
162       The default setting is 5.
163
164       FI_PSM3_CONN_TIMEOUT
165              Timeout  (seconds)  for  establishing connection between two PSM
166              endpoints.
167
168       The default setting is 5.
169
170       FI_PSM3_PROG_INTERVAL
171              When auto progress is enabled (asked via the  hints  to  fi_get‐
172              info),  a progress thread is created to make progress calls from
173              time to time.  This option set the interval  (microseconds)  be‐
174              tween progress calls.
175
176       The  default  setting  is  1  if  affinity is set, or 1000 if not.  See
177       FI_PSM3_PROG_AFFINITY.
178
179       FI_PSM3_PROG_AFFINITY
180              When set, specify the set of  CPU  cores  to  set  the  progress
181              thread       affinity       to.        The       format       is
182              <start>[:<end>[:<stride>]][,<start>[:<end>[:<stride>]]]*,  where
183              each triplet <start>:<end>:<stride> defines a block of core_ids.
184              Both <start> and <end> can be either the core_id (when  >=0)  or
185              core_id - num_cores (when <0).
186
187       By default affinity is not set.
188
189       FI_PSM3_INJECT_SIZE
190              Maximum message size allowed for fi_inject and fi_tinject calls.
191              This is an experimental feature to allow  some  applications  to
192              override  default  inject size limitation.  When the inject size
193              is larger than the default value, some inject calls might block.
194
195       The default setting is 64.
196
197       FI_PSM3_LOCK_LEVEL
198              When set, dictate  the  level  of  locking  being  used  by  the
199              provider.   Level  2  means all locks are enabled.  Level 1 dis‐
200              ables some locks and is suitable for runs that limit the  access
201              to  each  PSM3 context to a single thread.  Level 0 disables all
202              locks and thus is only suitable for single threaded runs.
203
204       To use level 0 or level 1, wait object and auto progress mode cannot be
205       used  because they introduce internal threads that may break the condi‐
206       tions needed for these levels.
207
208       The default setting is 2.
209
210       FI_PSM3_LAZY_CONN
211              There are two strategies on when to  establish  connections  be‐
212              tween the PSM3 endpoints that OFI endpoints are built on top of.
213              In eager connection mode, connections are established  when  ad‐
214              dresses  are  inserted into the address vector.  In lazy connec‐
215              tion mode, connections are established when addresses  are  used
216              the  first  time  in  communication.   Eager connection mode has
217              slightly lower critical path overhead but lazy  connection  mode
218              scales better.
219
220       This  option  controls how the two connection modes are used.  When set
221       to 1, lazy connection mode is always used.  When set to 0,  eager  con‐
222       nection mode is used when required conditions are all met and lazy con‐
223       nection mode is used otherwise.  The conditions  for  eager  connection
224       mode are: (1) multiple endpoint (and scalable endpoint) support is dis‐
225       abled by explicitly setting PSM3_MULTI_EP=0; and (2) the address vector
226       type is FI_AV_MAP.
227
228       The default setting is 0.
229
230       FI_PSM3_DISCONNECT
231              The provider has a mechanism to automatically send disconnection
232              notifications to all connected peers before the  local  endpoint
233              is  closed.   As the response, the peers call psm3_ep_disconnect
234              to clean up the connection state at their side.  This allows the
235              same PSM3 epid be used by different dynamically started process‐
236              es (clients) to communicate with the same peer  (server).   This
237              mechanism, however, introduce extra overhead to the finalization
238              phase.  For applications that never reuse epids within the  same
239              session such overhead is unnecessary.
240
241       This  option  controls whether the automatic disconnection notification
242       mechanism should be enabled.  For client-server  application  mentioned
243       above,  the  client  side  should  set this option to 1, but the server
244       should set it to 0.
245
246       The default setting is 0.
247
248       FI_PSM3_TAG_LAYOUT
249              Select how the 96-bit PSM3 tag bits  are  organized.   Currently
250              three  choices  are  available: tag60 means 32-4-60 partitioning
251              for CQ data,  internal  protocol  flags,  and  application  tag.
252              tag64 means 4-28-64 partitioning for internal protocol flags, CQ
253              data, and application tag.  auto means to choose either tag60 or
254              tag64  based  on the hints passed to fi_getinfo -- tag60 is used
255              if remote CQ data support is  requested  explicitly,  either  by
256              passing  non-zero  value via hints->domain_attr->cq_data_size or
257              by including FI_REMOTE_CQ_DATA in hints->caps,  otherwise  tag64
258              is used.  If tag64 is the result of automatic selection, fi_get‐
259              info also returns a second instance of the provider  with  tag60
260              layout.
261
262       The default setting is auto.
263
264       Notice that if the provider is compiled with macro PSMX3_TAG_LAYOUT de‐
265       fined to 1 (means tag60) or 2 (means tag64), the  choice  is  fixed  at
266       compile time and this runtime option will be disabled.
267

SEE ALSO

269       fabric(7), fi_provider(7), fi_psm(7), fi_psm2(7),
270

AUTHORS

272       OpenFabrics.
273
274
275
276Libfabric Programmer's Manual     2021-02-10                        fi_psm3(7)
Impressum