1DMTCP(1) Distributed MultiThreaded CheckPointing DMTCP(1)
2
3
4
6 dmtcp - Distributed MultiThreaded Checkpointing
7
9 dmtcp_coordinator [port]
10
11 dmtcp_launch command [args...]
12
13 dmtcp_restart ckpt_FILE1.dmtcp [ckpt_FILE2.dmtcp...]
14
15 dmtcp_command coordinatorCommand
16
17
19 DMTCP is a tool to transparently checkpointing the state of an arbi‐
20 trary group of programs spread across many machines and connected by
21 sockets. It does not modify the user's program nor the operating sys‐
22 tem. MTCP is a standalone component of DMTCP available as a check‐
23 pointing library for a single process.
24
26 For each command, the --help or -h flag will show the command-line
27 options. Most command line options can also be controlled through
28 environment variables. These can be set in bash with "export
29 NAME=value" or in tcsh with "setenv NAME value".
30
31
32 DMTCP_CHECKPOINT_INTERVAL=integer
33 Time in seconds between automatic checkpoints. Checkpoints can
34 also be initiated manually by typing 'c' into the coordinator.
35 (default: 0, disabled; dmtcp_coordinator only)
36
37
38 DMTCP_HOST=string
39 Hostname where the cluster-wide coordinator is running.
40 (default: localhost; dmtcp_launch, dmtcp_restart only)
41
42
43 DMTCP_PORT=integer
44 The port the cluster-wide coordinator listens on. (default:
45 7779)
46
47
48 DMTCP_GZIP=(1|0)
49 Set to "0" to disable compression of checkpoint images.
50 (default: 1, compression enabled; dmtcp_launch only) WARNING:
51 gzip adds seconds. Without gzip, ckpt/restart is often less
52 than 1 s
53
54
55 DMTCP_CHECKPOINT_DIR=path
56 Directory to store checkpoint images in. (default: ./)
57
58
59 DMTCP_SIGCKPT=integer
60 Internal signal number to use for checkpointing. Must not be
61 used by the user program. (default: SIGUSR2; dmtcp_launch only)
62
64 Each computation to be checkpointed must include a DMTCP coordinator
65 process. One can explicitly start a coordinator through dmtcp_coordi‐
66 nator, or allow one to be started implicitly in background by either
67 dmtcp_launch or dmtcp_restart to operate. The address of the unique
68 coordinator should be specified by dmtcp_launch, dmtcp_restart, and
69 dmtcp_command either through the --host and --port command-line flags
70 or through the the DMTCP_HOST and DMTCP_PORT environment variables. If
71 neither is given, the host-port pair defaults to localhost-7779. The
72 host-port pair associated with a particular coordinator is given by the
73 command-line flags used in the dmtcp_coordinator command, or the envi‐
74 ronment variables then in effect, or the default of localhost-7779.
75
76 The coordinator is stateless and is not checkpointed. On restart, one
77 can use an existing or a new coordinator. Multiple computations under
78 DMTCP control can coexist by providing a unique coordinator (with a
79 unique host-port pair) for each such computation.
80
81 The coordinator initiates a checkpoint for all processes in its compu‐
82 tation group. Checkpoints can be: performed automatically on an
83 interval (see DMTCP_CHECKPOINT_INTERVAL above); or initiated manually
84 on the standard input of the coordinator (see next paragraph); or ini‐
85 tiated directly under program control by the computation through the
86 plugin API (see below).
87
88 The coordinator accepts the following commands on its standard input.
89 Each command should be followed by the <return> key. The commands are:
90 l : List connected nodes
91 s : Print status message
92 c : Checkpoint all nodes
93 f : Force a restart even if there are missing nodes (debugging)
94 k : Kill all nodes
95 q : Kill all nodes and quit
96 ? : Show this message
97
98 Coordinator commands can also be issued remotely using dmtcp_command.
99
100
102 1. In a separate terminal window, start the dmtcp_coordinator.
103 (See previous section.)
104
105 dmtcp_coordinator
106
107
108 2. In separate terminal(s), replace each command(s) with "dmtcp_launch
109 [command]". The checkpointed program will connect to the coor‐
110 dinator specified by DMTCP_HOST and DMTCP_PORT. New threads
111 will be checkpointed as part of the process. Child processes
112 will automatically be checkpointed. Remote processes started
113 via ssh will automatically checkpointed. (Internally, DMTCP mod‐
114 ifies the ssh command line to call dmtcp_launch on the remote
115 host.)
116
117 dmtcp_launch ./myprogram
118
119
120 3. To manually initiate a checkpoint, either run the command below
121 or type "c" followed by <return> into the coordinator. Check‐
122 point files for each process will be written to DMTCP_CHECK‐
123 POINT_DIR. The dmtcp_coordinator will write
124 "dmtcp_restart_script.sh" to its working directory. This script
125 contains the necessary calls to dmtcp_restart to restart the
126 entire computation, including remote processes created via ssh.
127
128 dmtcp_command -c
129 OR: dmtcp_command --checkpoint
130
131
132 4. To restart, one should execute dmtcp_restart_script.sh, which is
133 created by the dmtcp_coordinator in its working directory at the
134 time of checkpoint. One can optionally edit this script to
135 migrate processes to different hosts. By default, only one
136 restarted process will be restarted in the foreground and
137 receive the standard input. The script may be edited to choose
138 which process will be restarted in the foreground.
139
140 ./dmtcp_restart_script.sh
141
142
144 The source distribution includes a top-level plugin directory, with
145 examples of how to write a plugin for DMTCP. Plugins allow a check‐
146 pointed application to disconnect from an external resource, and then
147 reconnect to it at the time of restart. Further examples are in the
148 test/plugin directory. In particular, test/plugin/applic-initiated
149 demonstrates how an application can request a checkpoint under program
150 control.
151
152 The plugin feature adds three new user-programmable capabilities. A
153 plugin may: add wrappers around system calls; take special actions at
154 during certain events (e.g. pre-checkpoint, resume/post-checkpoint,
155 restart); and may insert key-value pairs into a database at restart
156 time that is then available to be queried by the restarted processes of
157 a computation. (The events available to the plugin feature form a
158 superset of the events available with the dmtcpaware interface.) One
159 or more plugins are invoked via a list of colon-separated absolute
160 pathnames.
161
162 dmtcp_launch --with-plugin PLUGIN1[:PLUGIN2]...
163
164
166 This API is now deprecated in favor of DMTCP plugins (see above).
167 DMTCP provides a programming interface to allow checkpointed applica‐
168 tions to interact with dmtcp. In the source distribution, see dmtc‐
169 paware/dmtcpaware.h for the functions available. See test/dmtc‐
170 paware[123].c for three example applications. For an example of its
171 usage, try:
172
173 cd test; rm dmtcpaware1; make dmtcpaware1; ./autotest -v dmtcpaware1
174
175 The user application should link with libdmtcpaware.so (-ldmtcpaware)
176 and use the header file dmtcp/dmtcpaware.h.
177
178
180 A target program under DMTCP control normally returns the same return
181 code as if executed without DMTCP. However, if DMTCP fails (as opposed
182 to the target program failing), DMTCP returns a DMTCP-specific return
183 code, rc (or rc+1, rc+2 for two special cases), where rc is the integer
184 value of the environment variable DMTCP_FAIL_RC if set, or else the
185 default value, 99.
186
188 Full documentation is available at the DMTCP home page:
189 http://dmtcp.sourceforge.net/
190 FAQ: http://dmtcp.sourceforge.net/FAQ.html
191 Plugins: https://github.com/dmtcp/dmtcp/blob/master/doc/plugin-tuto‐
192 rial.pdf (DMTCP-2.x)
193 examples: https://github.com/dmtcp/dmtcp/tree/master/test/plugin
194
196 DMTCP and its standalone single-process component MTCP (MultiThreaded
197 CheckPointing) were created by the long-term contributors, Jason Ansel,
198 Kapil Arya, Gene Cooperman, Rohan Garg, Artem Y. Polyakov, Mike Rieker,
199 and a series of additional contributors including Alex Brick, Tyler
200 Denniston, William Enright, Gregory Kerr, Ana-Maria Visan, and others.
201 For support, see DMTCP home page.
202
203
204
205DMTCP team June 17, 2008 DMTCP(1)