1lamssi_cr(7) LAM SSI CR OVERVIEW lamssi_cr(7)
2
3
4
6 LAM SSI checkpoint / restart - overview of LAM's MPI checkpoint /
7 restart SSI modules
8
10 The "kind" for checkpoint / restart SSI modules is "cr". Specifically,
11 the string "cr" (without the quotes) is the prefix that should be used
12 with the mpirun command line with the -ssi switch. For example:
13
14 mpirun -ssi cr blcr C my_mpi_program
15
16 LAM/MPI can involuntarily checkpoint and restart parallel MPI jobs.
17 Doing so requires that LAM/MPI was compiled with thread support and
18 that back-end checkpointing systems are available at run-time. MPI
19 jobs will have to run with at least MPI_THREAD_SERIALIZED support. If
20 a job elects to run with checkpoint/restart support and an available cr
21 module is found, the job's thread level will automatically be promoted
22 to MPI_THREAD_SERIALIZED. See the User's Guide for more details.
23
24 Checkpoint Phases
25 LAM defines three phases for checkpoint / restart support in each MPI
26 process:
27
28 Checkpoint.
29 When the checkpoint request arrives, before the actual checkpoint
30 occurs.
31
32 Continue.
33 After a checkpoint has successfully completed, in the same process
34 as the checkpoint was invoked in.
35
36 Restart
37 After a checkpoint has successfully completed, in a new / restarted
38 process.
39
40 The Continue and Restart phases are identical except for the process in
41 which they are invoked -- the Continue phase is invoked in the same
42 process as the Checkpoint phase was invoked. The Restart phase is only
43 invoked in newly restarted processes.
44
46 LAM currently has two cr modules: blcr and self. In order for an MPI
47 job to be able to be checkpointed and restarted, all of its MPI SSI
48 modules must support checkpoint/restart. Currently, this means using
49 the crtcp RPI module or the gm RPI module when compiled with gm_get()
50 support (see the User's Guide for more details).
51
52 blcr CR Module
53 The Berkeley Lab Checkpoint/Restart (BLCR) single-node checkpointer is
54 a software system from Lawrence Berkeley Labs. See the project web
55 page for more details: http://www.nersc.gov/research/ftg/checkpoint/.
56
57 The blcr module has one SSI parameter:
58
59 cr_blcr_priority
60 blcr's default priority is 50.
61
62 self CR Module
63 The self CR module effectively allows application-level checkpointing
64 by invoking user-specified functions at the Checkpoint, Continue, and
65 Restart phases of LAM/MPI C/R support.
66
67 Multiple SSI parameters are available:
68
69 cr_self_user_prefix
70 Specify a string prefix for the name of the checkpoint, continue,
71 and restart functions that should be invoked by LAM. That is,
72 specifying "-ssi cr_self_user_prefix_foo" means that LAM expects to
73 find three functions at run-time: foo_checpkoint(), foo_continue(),
74 and foo_restart(). This is a convenience parameter that can be
75 used instead of the three parameters listed below.
76
77 cr_self_user_checkpoint
78 Name of the user function to invoke during the Checkpoint phase.
79
80 cr_self_user_continue
81 Name of the user function to invoke during the Continue phase.
82
83 cr_self_user_restart
84 Name of the user function to invoke during the Restart phase.
85
86 If none of these parameters are specified and the self module is
87 selected, it will abort. Finally, the usual priority SSI parameter is
88 also available:
89
90 cr_self_priority
91 self's default priority is 25.
92
94 lamssi(7), mpirun(1), LAM User's Guide
95
96
97
98LAM 7.1.2 March, 2006 lamssi_cr(7)