1lamssi_rpi(7) LAM SSI RPI OVERVIEW lamssi_rpi(7)
2
3
4
6 LAM SSI RPI - overview of LAM's RPI SSI modules
7
9 The "kind" for RPI SSI modules is "rpi". Specifically, the string
10 "rpi" (without the quotes) should be used to specify which RPI should
11 be used on the mpirun command line with the -ssi switch. For example:
12
13 mpirun -ssi rpi tcp C my_mpi_program
14 Specifies to use the tcp RPI (and to launch a single copy of the
15 executable "foo" on each node).
16
17 The "rpi" string is also used as a prefix send parameters to specific
18 RPI modules. For example:
19
20 mpirun -ssi rpi tcp -ssi rpi_tcp_short 131072 C my_mpi_program
21 Specifies to use the tcp RPI, and to pass in the value of 131072
22 (128K) as the short message length for TCP messages. See each RPI
23 section below for a full description of parameters that are
24 accepted by each RPI.
25
26 LAM currently supports five different RPI SSI modules: gm, lamd, tcp,
27 sysv, usysv.
28
30 Only one RPI module may be selected per command execution. The selec‐
31 tion of which module occurs during MPI_INIT, and is used for the dura‐
32 tion of the MPI process. It is erroneous to select different RPI mod‐
33 ules for different processes.
34
35 The kind for selecting an RPI is "rpi". For example:
36
37 mpriun -ssi rpi tcp C my_mpi_program
38 Selects to use the tcp RPI and run a single copy of the foo exectu‐
39 able on each node.
40
42 As with all SSI modules, it is possible to pass parameters at run time.
43 This section discusses the built-in LAM RPI modules, as well as the
44 run-time parameters that they accept.
45
46 In the discussion below, the parameters are discussed in terms of kind
47 and name. The kind and name may be specified as command line arguments
48 to the mpirun command with the -ssi switch, or they may be set in envi‐
49 ronment variables of the form LAM_MPI_SSI_name=value. Note that using
50 the -ssi command line switch will take precendence over any environment
51 variables.
52
53 If the RPI that is selected is unable to run (e.g., attempting to use
54 the gm RPI when gm support was not compiled into LAM, or if no gm hard‐
55 ware is available on the nodes), an appropriate error message will be
56 printed and execution will abort.
57
58 crtcp RPI
59 The crtcp RPI is a checkpoint/restart-able version of the tcp RPI (see
60 below). It is separate from the tcp RPI because the current implemen‐
61 tation imposes a slight performance penalty to enable the ability to
62 checkpoint and restart MPI jobs. Its tunable parameters are the same
63 as the tcp RPI. This RPI probably only needs to be used when the abil‐
64 ity to checkpoint and restart MPI jobs is required.
65
66 See the LAM/MPI User's Guide for more details on the crtcp RPI as well
67 as the checkpoint/restart capabilities of LAM/MPI. The lamssi_cr(7)
68 manual page also contains additional information.
69
70 gm RPI
71 The gm RPI is used with native Myrinet networks. Please note that the
72 gm RPI exists, but has not yet been optimized. It gives significantly
73 better performance than TCP over Myrinet networks, but has not yet been
74 properly tuned and instrumented in LAM.
75
76 That being said, there are several tunable parameters in the gm RPI:
77
78 rpi_gm_maxport N
79 If rpi_gm_port is not specified, LAM will attempt to find an open
80 GM port to use for MPI communications starting with port 1 and end‐
81 ing with the N value speified by the rpi_gm_maxport parameter. If
82 unspecified, LAM will try all existing GM ports.
83
84 rpi_gm_port N
85 LAM will attempt to use gm port N for MPI communications.
86
87 rpi_gm_tinymsglen N
88 Specifies the maximum message size (in bytes) for "tiny" messages
89 (i.e., messages that are sent entirely in one gm message). Tiny
90 messages are memcpy'ed into the header before it is sent to the
91 destination, and memcpy'ed out of the header into the destination
92 buffer on the receiver. Hence, it is not advisable to make this
93 value too large.
94
95 rpi_gm_fast 1
96 Specifies to use the "fast" protocol for sending short gm messages.
97 Unreliable in the presence of GM errors or timeouts; this parameter
98 is not advised for MPI applications that essentially do not make
99 continual progress within MPI.
100
101 rpi_gm_cr 1
102 Enable checkpoint/restart behavior for gm. This can only be
103 enabled if the gm rpi module was compiled with support for the
104 gm_get() function, which is disabled by default. See the LAM
105 Installation and User's Guides for more information on this parame‐
106 ter before you use it.
107
108 lamd RPI
109 The lamd RPI uses LAM's "out-of-band" communication mechanism for pass‐
110 ing MPI messages. Specifically, MPI messages are sent from the user
111 process to the local LAM daemon, then to the remote LAM daemon (if the
112 destination process is on a different node), and then to the destina‐
113 tion process.
114
115 While this adds latency to message passing because of the extra hops
116 that each message must travel, it allows for true asynchronous message
117 passing. Since the LAM daemon is running in its own execution space,
118 it can make progress on message passing regardless of the state / sta‐
119 tus of the user's program. This can be an overall net savings in per‐
120 formance and execution time for some classes of MPI programs.
121
122 It is expected that this RPI will someday become obsolete when LAM
123 becomes multi-threaded and allows progress to be made on message pass‐
124 ing in separate threads rather than in separate processes.
125
126 The lamd RPI has no tunable parameters.
127
128 tcp RPI
129 The tcp RPI uses pure TCP for all MPI message passing. TCP sockets are
130 opened between MPI processes and are used for all MPI traffic.
131
132 The tcp RPI has one tunable parameter:
133
134 rpi_tcp_short <bytes>
135 Tells the tcp RPI the smallest size (in bytes) for a message to be
136 considered "long". Short messages are sent eagerly (even if the
137 receiving side is not expecting them). Long messages use a rende‐
138 vouz protocol (i.e., a three-way handshake) such that the message
139 is not actually sent until the receiver is expecting it. This
140 value defaults to 64k.
141
142 sysv RPI
143 The sysv RPI uses shared memory for communication between MPI processes
144 on the same node, and TCP sockets for communication between MPI pro‐
145 cesses on different nodes. System V semaphores are used to lock the
146 shared memory pools. This RPI is best used when running multiple MPI
147 processes on uniprocessors (or oversubscribed SMPs) because of the
148 blocking / yielding nature of semaphores.
149
150 The sysv RPI has the following tunable parameters:
151
152 rpi_tcp_short <bytes>
153 Since the sysv RPI uses parts of the tcp RPI for off-node communi‐
154 cation, this parameter also has relevance to the sysv RPI. The
155 meaning of this parameter is discussed in the tcp RPI section.
156
157 rpi_sysv_short <bytes>
158 Tells the sysv RPI the smallest size (in bytes) for a message to be
159 considered "long". Short shared memory messages are sent using a
160 small "postbox" protocol; long messages use a more general shared
161 memory pool method. This value defaults to 8k.
162
163 rpi_sysv_pollyield <bool>
164 If set to a nonzero number, force the use of a system call to yield
165 the processor. The system call will be yield(), sched_yield(), or
166 select() (with a 1ms timeout), depending what LAM's configure
167 script finds at configuration time. This value defaults to 1.
168
169 rpi_sysv_shmpoolsize <bytes>
170 The size of the shared memory pool that is used for long message
171 transfers. It is allocated once on each node for each MPI parallel
172 job. Specifically, if multiple MPI processes from the same paral‐
173 lel job are spawned on a single node, this pool will only be allo‐
174 cated once.
175
176 The configure script will try to determine a default size for the
177 pool if none is explicitly specified (you should always check this
178 to see if it is reasonable). Larger values should improve perfor‐
179 mance especially when an application passes large messages, but
180 will also increase the system resources used by each task.
181
182 rpi_sysv_shmmaxalloc <bytes>
183 To prevent a single large message transfer from monopolizing the
184 global pool, allocations from the pool are actually restricted to a
185 maximum of rpi_sysv_shmmaxalloc bytes each. Even with this
186 restriction, it is possible for the global pool to temporarily
187 become exhausted. In this case, the transport will fall back to
188 using the postbox area to transfer the message. Performance will be
189 degraded, but the application will progress.
190
191 The configure script will try to determine a default size for the
192 maximum atomic transfer size if none is explicitly specified (you
193 should always check this to see if it is reasonable). Larger val‐
194 ues should improve performance especially when an application
195 passes large messages, but will also increase the system resources
196 used by each task.
197
198 usysv RPI
199 The usysv RPI uses shared memory for communication between MPI pro‐
200 cesses on the same node, and TCP sockets for communication between MPI
201 processes on different nodes. Spin locks are used to lock the shared
202 memory pools. This RPI is best used when the multiple of MPI processes
203 on a single node is less than or equal to the number of processors
204 because it allows LAM to fully occupy the processor while waiting for a
205 message and never be swapped out.
206
207 The usysv RPI has many of the same tunable parameters as the sysv RPI:
208
209 rpi_tcp_short <bytes>
210 Same meaning as in the sysv RPI.
211
212 rpi_usysv_short <bytes>
213 Same meaning as rpi_sysv_short in the sysv RPI.
214
215 rpi_usysv_pollyield <bool>
216 Same meaning as rpi_sysv_pollyield in the sysv RPI.
217
218 rpi_usysv_shmpoolsize <bytes>
219 Same meaning as rpi_sysv_shmpoolsize in the sysv RPI.
220
221 rpi_usysv_shmmaxalloc <bytes>
222 Same meaning as rpi_sysv_shmmaxalloc in the sysv RPI.
223
224 rpi_usysv_readlockpoll <iterations>
225 Number of iterations to spin before yielding the processor while
226 waiting to read. This value defaults to 10,000.
227
228 rpi_usysv_writelockpoll <iterations>
229 Number of iterations to spin before yielding the processor while
230 waiting to write. This value defaults to 10.
231
233 lamssi(7), lamssi_cr(7), mpirun(1), LAM User's Guide
234
235
236
237LAM 7.1.2 March, 2006 lamssi_rpi(7)