lamssi_rpi(7)

1lamssi_rpi(7)                LAM SSI RPI OVERVIEW                lamssi_rpi(7)
2
3
4

NAME

6       LAM SSI RPI - overview of LAM's RPI SSI modules
7

DESCRIPTION

9       The  "kind"  for  RPI  SSI  modules is "rpi".  Specifically, the string
10       "rpi" (without the quotes) should be used to specify which  RPI  should
11       be used on the mpirun command line with the -ssi switch.  For example:
12
13       mpirun -ssi rpi tcp C my_mpi_program
14           Specifies  to  use  the tcp RPI (and to launch a single copy of the
15           executable "foo" on each node).
16
17       The "rpi" string is also used as a prefix send parameters  to  specific
18       RPI modules.  For example:
19
20       mpirun -ssi rpi tcp -ssi rpi_tcp_short 131072 C my_mpi_program
21           Specifies  to  use  the tcp RPI, and to pass in the value of 131072
22           (128K) as the short message length for TCP messages.  See each  RPI
23           section  below  for  a  full  description  of  parameters  that are
24           accepted by each RPI.
25
26       LAM currently supports five different RPI SSI modules: gm,  lamd,  tcp,
27       sysv, usysv.
28

SELECTING AN RPI MODULE

30       Only  one RPI module may be selected per command execution.  The selec‐
31       tion of which module occurs during MPI_INIT, and is used for the  dura‐
32       tion  of the MPI process.  It is erroneous to select different RPI mod‐
33       ules for different processes.
34
35       The kind for selecting an RPI is "rpi".  For example:
36
37       mpriun -ssi rpi tcp C my_mpi_program
38           Selects to use the tcp RPI and run a single copy of the foo exectu‐
39           able on each node.
40

AVAILABLE MODULES

42       As with all SSI modules, it is possible to pass parameters at run time.
43       This section discusses the built-in LAM RPI modules,  as  well  as  the
44       run-time parameters that they accept.
45
46       In  the discussion below, the parameters are discussed in terms of kind
47       and name.  The kind and name may be specified as command line arguments
48       to the mpirun command with the -ssi switch, or they may be set in envi‐
49       ronment variables of the form LAM_MPI_SSI_name=value.  Note that  using
50       the -ssi command line switch will take precendence over any environment
51       variables.
52
53       If the RPI that is selected is unable to run (e.g., attempting  to  use
54       the gm RPI when gm support was not compiled into LAM, or if no gm hard‐
55       ware is available on the nodes), an appropriate error message  will  be
56       printed and execution will abort.
57
58   crtcp RPI
59       The  crtcp RPI is a checkpoint/restart-able version of the tcp RPI (see
60       below).  It is separate from the tcp RPI because the current  implemen‐
61       tation  imposes  a  slight performance penalty to enable the ability to
62       checkpoint and restart MPI jobs.  Its tunable parameters are  the  same
63       as the tcp RPI.  This RPI probably only needs to be used when the abil‐
64       ity to checkpoint and restart MPI jobs is required.
65
66       See the LAM/MPI User's Guide for more details on the crtcp RPI as  well
67       as  the  checkpoint/restart  capabilities of LAM/MPI.  The lamssi_cr(7)
68       manual page also contains additional information.
69
70   gm RPI
71       The gm RPI is used with native Myrinet networks.  Please note that  the
72       gm  RPI exists, but has not yet been optimized.  It gives significantly
73       better performance than TCP over Myrinet networks, but has not yet been
74       properly tuned and instrumented in LAM.
75
76       That being said, there are several tunable parameters in the gm RPI:
77
78       rpi_gm_maxport N
79           If  rpi_gm_port  is not specified, LAM will attempt to find an open
80           GM port to use for MPI communications starting with port 1 and end‐
81           ing  with the N value speified by the rpi_gm_maxport parameter.  If
82           unspecified, LAM will try all existing GM ports.
83
84       rpi_gm_port N
85           LAM will attempt to use gm port N for MPI communications.
86
87       rpi_gm_tinymsglen N
88           Specifies the maximum message size (in bytes) for  "tiny"  messages
89           (i.e.,  messages  that  are sent entirely in one gm message).  Tiny
90           messages are memcpy'ed into the header before it  is  sent  to  the
91           destination,  and  memcpy'ed out of the header into the destination
92           buffer on the receiver.  Hence, it is not advisable  to  make  this
93           value too large.
94
95       rpi_gm_fast 1
96           Specifies to use the "fast" protocol for sending short gm messages.
97           Unreliable in the presence of GM errors or timeouts; this parameter
98           is  not  advised  for MPI applications that essentially do not make
99           continual progress within MPI.
100
101       rpi_gm_cr 1
102           Enable checkpoint/restart  behavior  for  gm.   This  can  only  be
103           enabled  if  the  gm  rpi  module was compiled with support for the
104           gm_get() function, which is  disabled  by  default.   See  the  LAM
105           Installation and User's Guides for more information on this parame‐
106           ter before you use it.
107
108   lamd RPI
109       The lamd RPI uses LAM's "out-of-band" communication mechanism for pass‐
110       ing  MPI  messages.   Specifically, MPI messages are sent from the user
111       process to the local LAM daemon, then to the remote LAM daemon (if  the
112       destination  process  is on a different node), and then to the destina‐
113       tion process.
114
115       While this adds latency to message passing because of  the  extra  hops
116       that  each message must travel, it allows for true asynchronous message
117       passing.  Since the LAM daemon is running in its own  execution  space,
118       it  can make progress on message passing regardless of the state / sta‐
119       tus of the user's program.  This can be an overall net savings in  per‐
120       formance and execution time for some classes of MPI programs.
121
122       It  is  expected  that  this  RPI will someday become obsolete when LAM
123       becomes multi-threaded and allows progress to be made on message  pass‐
124       ing in separate threads rather than in separate processes.
125
126       The lamd RPI has no tunable parameters.
127
128   tcp RPI
129       The tcp RPI uses pure TCP for all MPI message passing.  TCP sockets are
130       opened between MPI processes and are used for all MPI traffic.
131
132       The tcp RPI has one tunable parameter:
133
134       rpi_tcp_short <bytes>
135           Tells the tcp RPI the smallest size (in bytes) for a message to  be
136           considered  "long".   Short  messages are sent eagerly (even if the
137           receiving side is not expecting them).  Long messages use a  rende‐
138           vouz  protocol  (i.e., a three-way handshake) such that the message
139           is not actually sent until the  receiver  is  expecting  it.   This
140           value defaults to 64k.
141
142   sysv RPI
143       The sysv RPI uses shared memory for communication between MPI processes
144       on the same node, and TCP sockets for communication  between  MPI  pro‐
145       cesses  on  different  nodes.  System V semaphores are used to lock the
146       shared memory pools.  This RPI is best used when running  multiple  MPI
147       processes  on  uniprocessors  (or  oversubscribed  SMPs) because of the
148       blocking / yielding nature of semaphores.
149
150       The sysv RPI has the following tunable parameters:
151
152       rpi_tcp_short <bytes>
153           Since the sysv RPI uses parts of the tcp RPI for off-node  communi‐
154           cation,  this  parameter  also  has relevance to the sysv RPI.  The
155           meaning of this parameter is discussed in the tcp RPI section.
156
157       rpi_sysv_short <bytes>
158           Tells the sysv RPI the smallest size (in bytes) for a message to be
159           considered  "long".   Short shared memory messages are sent using a
160           small "postbox" protocol; long messages use a more  general  shared
161           memory pool method.  This value defaults to 8k.
162
163       rpi_sysv_pollyield <bool>
164           If set to a nonzero number, force the use of a system call to yield
165           the processor.  The system call will be yield(), sched_yield(),  or
166           select()  (with  a  1ms  timeout),  depending  what LAM's configure
167           script finds at configuration time.  This value defaults to 1.
168
169       rpi_sysv_shmpoolsize <bytes>
170           The size of the shared memory pool that is used  for  long  message
171           transfers.  It is allocated once on each node for each MPI parallel
172           job.  Specifically, if multiple MPI processes from the same  paral‐
173           lel  job are spawned on a single node, this pool will only be allo‐
174           cated once.
175
176           The configure script will try to determine a default size  for  the
177           pool  if none is explicitly specified (you should always check this
178           to see if it is reasonable).  Larger values should improve  perfor‐
179           mance  especially  when  an  application passes large messages, but
180           will also increase the system resources used by each task.
181
182       rpi_sysv_shmmaxalloc <bytes>
183           To prevent a single large message transfer  from  monopolizing  the
184           global pool, allocations from the pool are actually restricted to a
185           maximum  of  rpi_sysv_shmmaxalloc  bytes  each.   Even  with   this
186           restriction,  it  is  possible  for  the global pool to temporarily
187           become exhausted. In this case, the transport  will  fall  back  to
188           using the postbox area to transfer the message. Performance will be
189           degraded, but the application will progress.
190
191           The configure script will try to determine a default size  for  the
192           maximum  atomic  transfer size if none is explicitly specified (you
193           should always check this to see if it is reasonable).  Larger  val‐
194           ues  should  improve  performance  especially  when  an application
195           passes large messages, but will also increase the system  resources
196           used by each task.
197
198   usysv RPI
199       The  usysv  RPI  uses  shared memory for communication between MPI pro‐
200       cesses on the same node, and TCP sockets for communication between  MPI
201       processes  on  different nodes.  Spin locks are used to lock the shared
202       memory pools.  This RPI is best used when the multiple of MPI processes
203       on  a  single  node  is  less than or equal to the number of processors
204       because it allows LAM to fully occupy the processor while waiting for a
205       message and never be swapped out.
206
207       The usysv RPI has many of the same tunable parameters as the sysv RPI:
208
209       rpi_tcp_short <bytes>
210           Same meaning as in the sysv RPI.
211
212       rpi_usysv_short <bytes>
213           Same meaning as rpi_sysv_short in the sysv RPI.
214
215       rpi_usysv_pollyield <bool>
216           Same meaning as rpi_sysv_pollyield in the sysv RPI.
217
218       rpi_usysv_shmpoolsize <bytes>
219           Same meaning as rpi_sysv_shmpoolsize in the sysv RPI.
220
221       rpi_usysv_shmmaxalloc <bytes>
222           Same meaning as rpi_sysv_shmmaxalloc in the sysv RPI.
223
224       rpi_usysv_readlockpoll <iterations>
225           Number  of  iterations  to spin before yielding the processor while
226           waiting to read.  This value defaults to 10,000.
227
228       rpi_usysv_writelockpoll <iterations>
229           Number of iterations to spin before yielding  the  processor  while
230           waiting to write.  This value defaults to 10.
231

NAME

DESCRIPTION

SELECTING AN RPI MODULE

AVAILABLE MODULES

SEE ALSO