1fi_opx(7) Libfabric v1.17.0 fi_opx(7)
2
3
4
5 {%include JB/setup %}
6
8 fi_opx - The Omni-Path Express Fabric Provider
9
11 The OPX provider is a native implementation of the libfabric interfaces
12 that makes direct use of Omni-Path fabrics as well as libfabric accel‐
13 eration features. The purpose of this provider is to show the scala‐
14 bility and performance of libfabric, providing an “extreme scale” de‐
15 velopment environment for applications and middleware using the libfab‐
16 ric API, and to support a functional and performant version of MPI on
17 Omni-Path fabrics.
18
20 The OPX provider supports most features defined for the libfabric API.
21
22 Key features include:
23
24 Endpoint types
25 The Omni-Path HFI hardware is connectionless and reliable. The
26 OPX provider only supports the FI_EP_RDM endpoint type.
27
28 Capabilities
29 Supported capabilities include FI_MSG, FI_RMA, FI_TAGGED,
30 FI_ATOMIC, FI_NAMED_RX_CTX, FI_SOURCE, FI_SEND, FI_RECV, FI_MUL‐
31 TI_RECV, FI_DIRECTED_RECV, FI_SOURCE*.
32
33 Notes on FI_DIRECTED_RECV capability: The immediate data which is sent
34 within the “senddata” call to support FI_DIRECTED_RECV for OPX must be
35 exactly 4 bytes, which OPX uses to completely identify the source ad‐
36 dress to an exascale-level number of ranks for tag matching on the recv
37 and can be managed within the MU packet. Therefore the domain attri‐
38 bute “cq_data_size” is set to 4 which is the OFI standard minimum.
39
40 Modes Two modes are defined: FI_CONTEXT2 and FI_ASYNC_IOV. The OPX
41 provider requires FI_CONTEXT2.
42
43 Additional features
44 Supported additional features include FABRIC_DIRECT, scalable
45 endpoints, and counters.
46
47 Progress
48 Only FI_PROGRESS_MANUAL is supported.
49
50 Address vector
51 Only the FI_AV_MAP address vector format is supported.
52
53 Memory registration modes
54 Only FI_MR_SCALABLE is supported.
55
57 Endpoint types
58 Unsupported endpoint types include FI_EP_DGRAM and FI_EP_MSG.
59
60 Capabilities
61 The OPX provider does not support FI_RMA_EVENT and FI_TRIGGER
62 capabilities.
63
64 Address vector
65 The OPX provider does not support the FI_AV_TABLE address vector
66 format. This may be added in the future.
67
69 As OPX is under development this list of limitations is subject to
70 change.
71
72 It runs under the following MPI versions:
73
74 Intel MPI from Parallel Studio 2020, update 4. Intel MPI from OneAPI
75 2021, update 3. Open MPI 4.1.2a1 (Older version of Open MPI will not
76 work). MPICH 3.4.2.
77
78 Currently, this provider is PIO-only. SDMA is not supported at this
79 time.
80
81 Usage:
82
83 If using with OpenMPI 4.1.x, disable UCX and openib transports. OPX is
84 not compatible with Open MPI 4.1.x PML/BTL. DMA, RDMA and SDMA are not
85 implemented. Performance falls off when using message sizes larger
86 than 1 MTU (4K max size). Shared memory is not cleaned up after an ap‐
87 plication crashes. Use "rm -rf /dev/shm/*" to remove old shared-memory
88 files.
89
91 FI_OPX_UUID
92 OPX requires a unique ID for each job. In order for all pro‐
93 cesses in a job to communicate with each other, they require to
94 use the same UUID. This variable can be set with
95 FI_OPX_UUID=${RANDOM} The default UUID is
96 00112233445566778899aabbccddeeff.
97
98 FI_OPX_RELIABILITY_SERVICE_USEC_MAX
99 This setting controls how frequently the reliability/replay
100 function will issue PING requests to a remote connection. Re‐
101 ducing this value may improve performance at the expense of in‐
102 creased traffic on the OPX fabric. Default setting is 500.
103
104 FI_OPX_RELIABILITY_SERVICE_PRE_ACK_RATE
105 This setting controls how frequently a receiving rank will send
106 ACKs for packets it has received without being prompted through
107 a PING request. A non-zero value N tells the receiving rank to
108 send an ACK for the last N packets every Nth packet. Used in
109 conjunction with an increased value for FI_OPX_RELIABILITY_SER‐
110 VICE_USEC_MAX may improve performance.
111
112 Valid values are 0 (disabled) and powers of 2 in the range of 1-32,768,
113 inclusive.
114
115 Default setting is 64.
116
117 FI_OPX_HFI_SELECT
118 Controls how OPX chooses which HFI to use when opening a con‐
119 text. Has two forms: - <hfi-unit> Force OPX provider to use
120 hfi-unit. - <selector1>[,<selector2>[,...,<selectorN>]] Select
121 HFI based on first matching selector
122
123 Where selector is one of the following forms: - default to use the de‐
124 fault logic - fixed:<hfi-unit> to fix to one hfi-unit - <selector-
125 type>:<hfi-unit>:<selector-data>
126
127 The above fields have the following meaning: - selector-type The selec‐
128 tor criteria the caller opening the context is evaluated against. -
129 hfi-unit The HFI to use if the caller matches the selector. - selec‐
130 tor-data Data the caller must match (e.g. NUMA node ID).
131
132 Where selector-type is one of the following: - numa True when caller is
133 local to the NUMA node ID given by selector-data. - core True when
134 caller is local to the CPU core given by selector-data.
135
136 And selector-data is one of the following: - value The specific value
137 to match - <range-start>-<range-end> Matches with any value in that
138 range
139
140 In the second form, when opening a context, OPX uses the hfi-unit of
141 the first-matching selector. Selectors are evaluated left-to-right.
142 OPX will return an error if the caller does not match any selector.
143
144 In either form, it is an error if the specified or selected HFI is not
145 in the Active state. In this case, OPX will return an error and execu‐
146 tion will not continue.
147
148 With this option, it is possible to cause OPX to try to open more con‐
149 texts on an HFI than there are free contexts on that HFI. In this
150 case, one or more of the context-opening calls will fail and OPX will
151 return an error. For the second form, as which HFI is selected depends
152 on properties of the caller, deterministic HFI selection requires de‐
153 terministic caller properties. E.g. for the numa selector, if the
154 caller can migrate between NUMA domains, then HFI selection will not be
155 deterministic.
156
157 The logic used will always be the first valid in a selector list. For
158 example, default and fixed will match all callers, so if either are in
159 the beginning of a selector list, you will only use fixed or default
160 regardles of if there are any more selectors.
161
162 Examples: - FI_OPX_HFI_SELECT=1 all callers will open contexts on HFI
163 0. - FI_OPX_HFI_SELECT=numa:0:0,numa:1:1,numa:0:2,numa:1:3 callers lo‐
164 cal to NUMA nodes 0 and 2 will use HFI 0, callers local to NUMA domains
165 1 and 3 will use HFI 1. - FI_OPX_HFI_SELECT=numa:0:0-3,default callers
166 local to NUMA nodes 0 thru 3 (including 0 and 3) will use HFI 0, and
167 all else will use default selection logic. - FI_OPX_HFI_SE‐
168 LECT=core:1:0,fixed:0 callers local to CPU core 0 will use HFI 1, and
169 all others will use HFI 0. - FI_OPX_HFI_SELECT=default,core:1:0 all
170 callers will use default HFI selection logic.
171
173 fabric(7), fi_provider(7), fi_getinfo(7),
174
176 OpenFabrics.
177
178
179
180Libfabric Programmer’s Manual 2022-12-11 fi_opx(7)