1fi_opx(7) Libfabric v1.18.1 fi_opx(7)
2
3
4
5 {%include JB/setup %}
6
8 fi_opx - The Omni-Path Express Fabric Provider
9
11 The opx provider is a native libfabric provider suitable for use with
12 Omni-Path fabrics. OPX features great scalability and performance when
13 running libfabric-enabled message layers.
14 OPX requires 3 additonal external development libraries to build: libu‐
15 uid, libnuma, and the Linux kernel headers.
16
18 The OPX provider supports most features defined for the libfabric API.
19
20 Key features include:
21
22 Endpoint types
23 The Omni-Path HFI hardware is connectionless and reliable. The
24 OPX provider only supports the FI_EP_RDM endpoint type.
25
26 Capabilities
27 Supported capabilities include FI_MSG, FI_RMA, FI_TAGGED,
28 FI_ATOMIC, FI_NAMED_RX_CTX, FI_SOURCE, FI_SEND, FI_RECV, FI_MUL‐
29 TI_RECV, FI_DIRECTED_RECV, FI_SOURCE*.
30
31 Notes on FI_DIRECTED_RECV capability: The immediate data which is sent
32 within the “senddata” call to support FI_DIRECTED_RECV for OPX must be
33 exactly 4 bytes, which OPX uses to completely identify the source ad‐
34 dress to an exascale-level number of ranks for tag matching on the recv
35 and can be managed within the MU packet. Therefore the domain attri‐
36 bute “cq_data_size” is set to 4 which is the OFI standard minimum.
37
38 Modes Two modes are defined: FI_CONTEXT2 and FI_ASYNC_IOV. The OPX
39 provider requires FI_CONTEXT2.
40
41 Additional features
42 Supported additional features include FABRIC_DIRECT, scalable
43 endpoints, and counters.
44
45 Progress
46 FI_PROGRESS_MANUAL and FI_PROGRESS_AUTO are supported, for best
47 performance, use FI_PROGRESS_MANUAL when possible.
48 FI_PROGRESS_AUTO will spawn 1 thread per CQ.
49
50 Address vector
51 FI_AV_MAP and FI_AV_TABLE are both supported. FI_AV_MAP is de‐
52 fault.
53
54 Memory registration modes
55 Only FI_MR_SCALABLE is supported.
56
58 Endpoint types
59 Unsupported endpoint types include FI_EP_DGRAM and FI_EP_MSG.
60
61 Capabilities
62 The OPX provider does not support FI_RMA_EVENT and FI_TRIGGER
63 capabilities.
64
66 OPX supports the following MPI versions:
67
68 Intel MPI from Parallel Studio 2020, update 4. Intel MPI from OneAPI
69 2021, update 3. Open MPI 4.1.2a1 (Older version of Open MPI will not
70 work). MPICH 3.4.2 and later.
71
72 Usage:
73
74 If using with OpenMPI 4.1.x, disable UCX and openib transports. OPX is
75 not compatible with Open MPI 4.1.x PML/BTL.
76
78 OPX_AV OPX supports the option of setting the AV mode to use in a
79 build. 3 settings are supported: - table - map - runtime
80
81 Using table or map will only allow OPX to use FI_AV_TABLE or FI_AV_MAP.
82 Using runtime will allow OPX to use either AV mode depending on what
83 the application requests. Specifying map or table however may lead to
84 a slight performance improvement depending on the application.
85
86 To change OPX_AV, add OPX_AV=table, OPX_AV=map, or OPX_AV=runtime to
87 the configure command. For example, to create a new build with
88 OPX_AV=table:
89 OPX_AV=table ./configure
90 make install
91 There is no way to change OPX_AV after it is set. If OPX_AV is not set
92 in the configure, the default value is runtime.
93
95 FI_OPX_UUID
96 OPX requires a unique ID for each job. In order for all pro‐
97 cesses in a job to communicate with each other, they require to
98 use the same UUID. This variable can be set with
99 FI_OPX_UUID=${RANDOM} The default UUID is
100 00112233445566778899aabbccddeeff.
101
102 FI_OPX_RELIABILITY_SERVICE_USEC_MAX
103 This setting controls how frequently the reliability/replay
104 function will issue PING requests to a remote connection. Re‐
105 ducing this value may improve performance at the expense of in‐
106 creased traffic on the OPX fabric. Default setting is 500.
107
108 FI_OPX_RELIABILITY_SERVICE_PRE_ACK_RATE
109 This setting controls how frequently a receiving rank will send
110 ACKs for packets it has received without being prompted through
111 a PING request. A non-zero value N tells the receiving rank to
112 send an ACK for the last N packets every Nth packet. Used in
113 conjunction with an increased value for FI_OPX_RELIABILITY_SER‐
114 VICE_USEC_MAX may improve performance.
115
116 Valid values are 0 (disabled) and powers of 2 in the range of 1-32,768,
117 inclusive.
118
119 Default setting is 64.
120
121 FI_OPX_PROG_AFFINITY
122 This sets the affinity to be used for any progress threads. Set
123 as a colon-separated triplet as start:end:stride, where stride
124 controls the interval between selected cores. For example,
125 1:5:2 will have cores 1, 3, and 5 as valid cores for progress
126 threads. Default is 1:4:1.
127
128 FI_OPX_AUTO_PROGRESS_INTERVAL_USEC
129 This setting controls the time (in usecs) between polls for auto
130 progress threads. Default is 1.
131
132 FI_OPX_HFI_SELECT
133 Controls how OPX chooses which HFI to use when opening a con‐
134 text. Has two forms: - <hfi-unit> Force OPX provider to use
135 hfi-unit. - <selector1>[,<selector2>[,...,<selectorN>]] Select
136 HFI based on first matching selector
137
138 Where selector is one of the following forms: - default to use the de‐
139 fault logic - fixed:<hfi-unit> to fix to one hfi-unit - <selector-
140 type>:<hfi-unit>:<selector-data>
141
142 The above fields have the following meaning: - selector-type The selec‐
143 tor criteria the caller opening the context is evaluated against. -
144 hfi-unit The HFI to use if the caller matches the selector. - selec‐
145 tor-data Data the caller must match (e.g. NUMA node ID).
146
147 Where selector-type is one of the following: - numa True when caller is
148 local to the NUMA node ID given by selector-data. - core True when
149 caller is local to the CPU core given by selector-data.
150
151 And selector-data is one of the following: - value The specific value
152 to match - <range-start>-<range-end> Matches with any value in that
153 range
154
155 In the second form, when opening a context, OPX uses the hfi-unit of
156 the first-matching selector. Selectors are evaluated left-to-right.
157 OPX will return an error if the caller does not match any selector.
158
159 In either form, it is an error if the specified or selected HFI is not
160 in the Active state. In this case, OPX will return an error and execu‐
161 tion will not continue.
162
163 With this option, it is possible to cause OPX to try to open more con‐
164 texts on an HFI than there are free contexts on that HFI. In this
165 case, one or more of the context-opening calls will fail and OPX will
166 return an error. For the second form, as which HFI is selected depends
167 on properties of the caller, deterministic HFI selection requires de‐
168 terministic caller properties. E.g. for the numa selector, if the
169 caller can migrate between NUMA domains, then HFI selection will not be
170 deterministic.
171
172 The logic used will always be the first valid in a selector list. For
173 example, default and fixed will match all callers, so if either are in
174 the beginning of a selector list, you will only use fixed or default
175 regardles of if there are any more selectors.
176
177 Examples: - FI_OPX_HFI_SELECT=0 all callers will open contexts on HFI
178 0. - FI_OPX_HFI_SELECT=1 all callers will open contexts on HFI 1. -
179 FI_OPX_HFI_SELECT=numa:0:0,numa:1:1,numa:0:2,numa:1:3 callers local to
180 NUMA nodes 0 and 2 will use HFI 0, callers local to NUMA domains 1 and
181 3 will use HFI 1. - FI_OPX_HFI_SELECT=numa:0:0-3,default callers local
182 to NUMA nodes 0 thru 3 (including 0 and 3) will use HFI 0, and all else
183 will use default selection logic. - FI_OPX_HFI_SELECT=core:1:0,fixed:0
184 callers local to CPU core 0 will use HFI 1, and all others will use HFI
185 0. - FI_OPX_HFI_SELECT=default,core:1:0 all callers will use default
186 HFI selection logic.
187
189 fabric(7), fi_provider(7), fi_getinfo(7),
190
192 OpenFabrics.
193
194
195
196Libfabric Programmer’s Manual 2023-02-15 fi_opx(7)