1lamssi_cr(7)                  LAM SSI CR OVERVIEW                 lamssi_cr(7)
2
3
4

NAME

6       LAM  SSI  checkpoint  /  restart  -  overview of LAM's MPI checkpoint /
7       restart SSI modules
8

DESCRIPTION

10       The "kind" for checkpoint / restart SSI modules is "cr".  Specifically,
11       the  string "cr" (without the quotes) is the prefix that should be used
12       with the mpirun command line with the -ssi switch.  For example:
13
14       mpirun -ssi cr blcr C my_mpi_program
15
16       LAM/MPI can involuntarily checkpoint and  restart  parallel  MPI  jobs.
17       Doing  so  requires  that  LAM/MPI was compiled with thread support and
18       that back-end checkpointing systems are  available  at  run-time.   MPI
19       jobs  will have to run with at least MPI_THREAD_SERIALIZED support.  If
20       a job elects to run with checkpoint/restart support and an available cr
21       module  is found, the job's thread level will automatically be promoted
22       to MPI_THREAD_SERIALIZED.  See the User's Guide for more details.
23
24   Checkpoint Phases
25       LAM defines three phases for checkpoint / restart support in  each  MPI
26       process:
27
28       Checkpoint.
29           When  the  checkpoint request arrives, before the actual checkpoint
30           occurs.
31
32       Continue.
33           After a checkpoint has successfully completed, in the same  process
34           as the checkpoint was invoked in.
35
36       Restart
37           After a checkpoint has successfully completed, in a new / restarted
38           process.
39
40       The Continue and Restart phases are identical except for the process in
41       which  they  are  invoked  -- the Continue phase is invoked in the same
42       process as the Checkpoint phase was invoked.  The Restart phase is only
43       invoked in newly restarted processes.
44

AVAILABLE MODULES

46       LAM  currently  has two cr modules: blcr and self.  In order for an MPI
47       job to be able to be checkpointed and restarted, all  of  its  MPI  SSI
48       modules  must  support checkpoint/restart.  Currently, this means using
49       the crtcp RPI module or the gm RPI module when compiled  with  gm_get()
50       support (see the User's Guide for more details).
51
52   blcr CR Module
53       The  Berkeley Lab Checkpoint/Restart (BLCR) single-node checkpointer is
54       a software system from Lawrence Berkeley Labs.   See  the  project  web
55       page for more details: http://www.nersc.gov/research/ftg/checkpoint/.
56
57       The blcr module has one SSI parameter:
58
59       cr_blcr_priority
60           blcr's default priority is 50.
61
62   self CR Module
63       The  self  CR module effectively allows application-level checkpointing
64       by invoking user-specified functions at the Checkpoint,  Continue,  and
65       Restart phases of LAM/MPI C/R support.
66
67       Multiple SSI parameters are available:
68
69       cr_self_user_prefix
70           Specify  a  string prefix for the name of the checkpoint, continue,
71           and restart functions that should be  invoked  by  LAM.   That  is,
72           specifying "-ssi cr_self_user_prefix_foo" means that LAM expects to
73           find three functions at run-time: foo_checpkoint(), foo_continue(),
74           and  foo_restart().   This  is  a convenience parameter that can be
75           used instead of the three parameters listed below.
76
77       cr_self_user_checkpoint
78           Name of the user function to invoke during the Checkpoint phase.
79
80       cr_self_user_continue
81           Name of the user function to invoke during the Continue phase.
82
83       cr_self_user_restart
84           Name of the user function to invoke during the Restart phase.
85
86       If none of these parameters  are  specified  and  the  self  module  is
87       selected,  it will abort.  Finally, the usual priority SSI parameter is
88       also available:
89
90       cr_self_priority
91           self's default priority is 25.
92

SEE ALSO

94       lamssi(7), mpirun(1), LAM User's Guide
95
96
97
98LAM 7.1.2                         March, 2006                     lamssi_cr(7)
Impressum