1fi_opx(7)                      Libfabric v1.15.1                     fi_opx(7)
2
3
4
5       {%include JB/setup %}
6

NAME

8       fi_opx - The Omni-Path Express Fabric Provider
9

OVERVIEW

11       The OPX provider is a native implementation of the libfabric interfaces
12       that makes direct use of Omni-Path fabrics as well as libfabric  accel‐
13       eration  features.   The purpose of this provider is to show the scala‐
14       bility and performance of libfabric, providing an “extreme  scale”  de‐
15       velopment environment for applications and middleware using the libfab‐
16       ric API, and to support a functional and performant version of  MPI  on
17       Omni-Path fabrics.
18

SUPPORTED FEATURES

20       The OPX provider supports most features defined for the libfabric API.
21
22       Key features include:
23
24       Endpoint types
25              The  Omni-Path HFI hardware is connectionless and reliable.  The
26              OPX provider only supports the FI_EP_RDM endpoint type.
27
28       Capabilities
29              Supported  capabilities  include  FI_MSG,   FI_RMA,   FI_TAGGED,
30              FI_ATOMIC, FI_NAMED_RX_CTX, FI_SOURCE, FI_SEND, FI_RECV, FI_MUL‐
31              TI_RECV, FI_DIRECTED_RECV, FI_SOURCE*.
32
33       Notes on FI_DIRECTED_RECV capability: The immediate data which is  sent
34       within  the “senddata” call to support FI_DIRECTED_RECV for OPX must be
35       exactly 4 bytes, which OPX uses to completely identify the  source  ad‐
36       dress to an exascale-level number of ranks for tag matching on the recv
37       and can be managed within the MU packet.  Therefore the  domain  attri‐
38       bute “cq_data_size” is set to 4 which is the OFI standard minimum.
39
40       Modes  Two  modes  are  defined: FI_CONTEXT2 and FI_ASYNC_IOV.  The OPX
41              provider requires FI_CONTEXT2.
42
43       Additional features
44              Supported additional features  include  FABRIC_DIRECT,  scalable
45              endpoints, and counters.
46
47       Progress
48              Only FI_PROGRESS_MANUAL is supported.
49
50       Address vector
51              Only the FI_AV_MAP address vector format is supported.
52
53       Memory registration modes
54              Only FI_MR_SCALABLE is supported.
55

UNSUPPORTED FEATURES

57       Endpoint types
58              Unsupported endpoint types include FI_EP_DGRAM and FI_EP_MSG.
59
60       Capabilities
61              The  OPX  provider  does not support FI_RMA_EVENT and FI_TRIGGER
62              capabilities.
63
64       Address vector
65              The OPX provider does not support the FI_AV_TABLE address vector
66              format.  This may be added in the future.
67

LIMITATIONS

69       As  OPX  is  under  development  this list of limitations is subject to
70       change.
71
72       It runs under the following MPI versions:
73
74       Intel MPI from Parallel Studio 2020, update 4.  Intel MPI  from  OneAPI
75       2021,  update  3.  Open MPI 4.1.2a1 (Older version of Open MPI will not
76       work).  MPICH 3.4.2.
77
78       Currently, this provider is PIO-only.  SDMA is not  supported  at  this
79       time.
80
81       Usage:
82
83       If using with OpenMPI 4.1.x, disable UCX and openib transports.  OPX is
84       not compatible with Open MPI 4.1.x PML/BTL.  DMA, RDMA and SDMA are not
85       implemented.   Performance  falls  off  when using message sizes larger
86       than 1 MTU (4K max size).  Shared memory is not cleaned up after an ap‐
87       plication crashes.  Use "rm -rf /dev/shm/*" to remove old shared-memory
88       files.
89

RUNTIME PARAMETERS

91       FI_OPX_UUID
92              OPX requires a unique ID for each job.  In order  for  all  pro‐
93              cesses  in a job to communicate with each other, they require to
94              use  the  same  UUID.    This   variable   can   be   set   with
95              FI_OPX_UUID=${RANDOM}       The       default       UUID      is
96              00112233445566778899aabbccddeeff.
97
98       FI_OPX_RELIABILITY_SERVICE_USEC_MAX
99              This setting  controls  how  frequently  the  reliability/replay
100              function  will  issue PING requests to a remote connection.  Re‐
101              ducing this value may improve performance at the expense of  in‐
102              creased traffic on the OPX fabric.  Default setting is 500.
103
104       FI_OPX_RELIABILITY_SERVICE_PRE_ACK_RATE
105              This  setting controls how frequently a receiving rank will send
106              ACKs for packets it has received without being prompted  through
107              a  PING request.  A non-zero value N tells the receiving rank to
108              send an ACK for the last N packets every Nth  packet.   Used  in
109              conjunction  with an increased value for FI_OPX_RELIABILITY_SER‐
110              VICE_USEC_MAX may improve performance.
111
112       Valid values are 0 (disabled) and powers of 2 in the range of 1-32,768,
113       inclusive.
114
115       Default setting is 64.
116
117       FI_OPX_HFI_SELECT
118              Controls  how  OPX  chooses which HFI to use when opening a con‐
119              text.  Has two forms: - <hfi-unit> Force  OPX  provider  to  use
120              hfi-unit.   - <selector1>[,<selector2>[,...,<selectorN>]] Select
121              HFI based on first matching selector
122
123       Where selector is one of the following forms: - default to use the  de‐
124       fault  logic  -  fixed:<hfi-unit>  to  fix  to  one  hfi-unit - <selec‐
125       tor-type>:<hfi-unit>:<selector-data>
126
127       The above fields have the following meaning: - selector-type The selec‐
128       tor  criteria  the  caller opening the context is evaluated against.  -
129       hfi-unit The HFI to use if the caller matches the selector.   -  selec‐
130       tor-data Data the caller must match (e.g. NUMA node ID).
131
132       Where selector-type is one of the following: - numa True when caller is
133       local to the NUMA node ID given by selector-data.   -  core  True  when
134       caller is local to the CPU core given by selector-data.
135
136       And  selector-data  is one of the following: - value The specific value
137       to match - <range-start>-<range-end> Matches with  any  value  in  that
138       range
139
140       In  the  second  form, when opening a context, OPX uses the hfi-unit of
141       the first-matching selector.  Selectors  are  evaluated  left-to-right.
142       OPX will return an error if the caller does not match any selector.
143
144       In  either form, it is an error if the specified or selected HFI is not
145       in the Active state.  In this case, OPX will return an error and execu‐
146       tion will not continue.
147
148       With  this option, it is possible to cause OPX to try to open more con‐
149       texts on an HFI than there are free contexts  on  that  HFI.   In  this
150       case,  one  or more of the context-opening calls will fail and OPX will
151       return an error.  For the second form, as which HFI is selected depends
152       on  properties  of the caller, deterministic HFI selection requires de‐
153       terministic caller properties.  E.g.  for the  numa  selector,  if  the
154       caller can migrate between NUMA domains, then HFI selection will not be
155       deterministic.
156
157       The logic used will always be the first valid in a selector list.   For
158       example,  default and fixed will match all callers, so if either are in
159       the beginning of a selector list, you will only use  fixed  or  default
160       regardles of if there are any more selectors.
161
162       Examples:  -  FI_OPX_HFI_SELECT=1 all callers will open contexts on HFI
163       0.  - FI_OPX_HFI_SELECT=numa:0:0,numa:1:1,numa:0:2,numa:1:3 callers lo‐
164       cal to NUMA nodes 0 and 2 will use HFI 0, callers local to NUMA domains
165       1 and 3 will use HFI 1.  - FI_OPX_HFI_SELECT=numa:0:0-3,default callers
166       local  to  NUMA  nodes 0 thru 3 (including 0 and 3) will use HFI 0, and
167       all  else  will  use  default  selection   logic.    -   FI_OPX_HFI_SE‐
168       LECT=core:1:0,fixed:0  callers  local to CPU core 0 will use HFI 1, and
169       all others will use HFI 0.   -  FI_OPX_HFI_SELECT=default,core:1:0  all
170       callers will use default HFI selection logic.
171

SEE ALSO

173       fabric(7), fi_provider(7), fi_getinfo(7),
174

AUTHORS

176       OpenFabrics.
177
178
179
180Libfabric Programmer’s Manual     2022-03-30                         fi_opx(7)
Impressum