1DMTCP(1)            Distributed MultiThreaded CheckPointing           DMTCP(1)
2
3
4

NAME

6       dmtcp - Distributed MultiThreaded Checkpointing
7

SYNOPSIS

9       dmtcp_coordinator [port]
10
11       dmtcp_launch command [args...]
12
13       dmtcp_restart ckpt_FILE1.dmtcp [ckpt_FILE2.dmtcp...]
14
15       dmtcp_command coordinatorCommand
16
17

DESCRIPTION

19       DMTCP  is  a  tool to transparently checkpointing the state of an arbi‐
20       trary group of programs spread across many machines  and  connected  by
21       sockets.  It  does not modify the user's program nor the operating sys‐
22       tem.  MTCP is a standalone component of DMTCP  available  as  a  check‐
23       pointing library for a single process.
24

OPTIONS

26       For  each  command,  the  --help  or -h flag will show the command-line
27       options.  Most command line options  can  also  be  controlled  through
28       environment   variables.   These  can  be  set  in  bash  with  "export
29       NAME=value" or in tcsh with "setenv NAME value".
30
31
32       DMTCP_CHECKPOINT_INTERVAL=integer
33              Time in seconds between automatic checkpoints.  Checkpoints  can
34              also  be  initiated manually by typing 'c' into the coordinator.
35              (default: 0, disabled; dmtcp_coordinator only)
36
37
38       DMTCP_HOST=string
39              Hostname  where  the  cluster-wide   coordinator   is   running.
40              (default: localhost; dmtcp_launch, dmtcp_restart only)
41
42
43       DMTCP_PORT=integer
44              The  port  the  cluster-wide  coordinator  listens on. (default:
45              7779)
46
47
48       DMTCP_GZIP=(1|0)
49              Set  to  "0"  to  disable  compression  of  checkpoint   images.
50              (default:  1,  compression  enabled; dmtcp_launch only) WARNING:
51              gzip adds seconds.  Without gzip,  ckpt/restart  is  often  less
52              than 1 s
53
54
55       DMTCP_CHECKPOINT_DIR=path
56              Directory to store checkpoint images in. (default: ./)
57
58
59       DMTCP_SIGCKPT=integer
60              Internal  signal  number  to use for checkpointing.  Must not be
61              used by the user program.  (default: SIGUSR2; dmtcp_launch only)
62

DMTCP_COORDINATOR

64       Each computation to be checkpointed must include  a  DMTCP  coordinator
65       process.   One can explicitly start a coordinator through dmtcp_coordi‐
66       nator, or allow one to be started implicitly in  background  by  either
67       dmtcp_launch  or  dmtcp_restart  to operate.  The address of the unique
68       coordinator should be specified  by  dmtcp_launch,  dmtcp_restart,  and
69       dmtcp_command  either  through the --host and --port command-line flags
70       or through the the DMTCP_HOST and DMTCP_PORT environment variables.  If
71       neither  is  given, the host-port pair defaults to localhost-7779.  The
72       host-port pair associated with a particular coordinator is given by the
73       command-line  flags used in the dmtcp_coordinator command, or the envi‐
74       ronment variables then in effect, or the default of localhost-7779.
75
76       The coordinator is stateless and is not checkpointed.  On restart,  one
77       can  use an existing or a new coordinator.  Multiple computations under
78       DMTCP control can coexist by providing a  unique  coordinator  (with  a
79       unique host-port pair) for each such computation.
80
81       The  coordinator initiates a checkpoint for all processes in its compu‐
82       tation group.  Checkpoints  can  be:   performed  automatically  on  an
83       interval  (see  DMTCP_CHECKPOINT_INTERVAL above); or initiated manually
84       on the standard input of the coordinator (see next paragraph); or  ini‐
85       tiated  directly  under  program control by the computation through the
86       plugin API (see below).
87
88       The coordinator accepts the following commands on its  standard  input.
89       Each command should be followed by the <return> key.  The commands are:
90         l : List connected nodes
91         s : Print status message
92         c : Checkpoint all nodes
93         f : Force a restart even if there are missing nodes (debugging)
94         k : Kill all nodes
95         q : Kill all nodes and quit
96         ? : Show this message
97
98       Coordinator commands can also be issued remotely using dmtcp_command.
99
100

EXAMPLE USAGE

102       1. In a separate terminal window, start the dmtcp_coordinator.
103              (See previous section.)
104
105               dmtcp_coordinator
106
107
108       2. In separate terminal(s), replace each command(s) with "dmtcp_launch
109              [command]".   The checkpointed program will connect to the coor‐
110              dinator specified by DMTCP_HOST  and  DMTCP_PORT.   New  threads
111              will  be  checkpointed  as part of the process.  Child processes
112              will automatically be checkpointed.   Remote  processes  started
113              via ssh will automatically checkpointed. (Internally, DMTCP mod‐
114              ifies the ssh command line to call dmtcp_launch  on  the  remote
115              host.)
116
117               dmtcp_launch ./myprogram
118
119
120       3. To manually initiate a checkpoint, either run the command below
121              or  type  "c" followed by <return> into the coordinator.  Check‐
122              point files for each process will  be  written  to  DMTCP_CHECK‐
123              POINT_DIR.       The      dmtcp_coordinator      will      write
124              "dmtcp_restart_script.sh" to its working directory.  This script
125              contains  the  necessary  calls  to dmtcp_restart to restart the
126              entire computation, including remote processes created via ssh.
127
128                   dmtcp_command -c
129              OR:  dmtcp_command --checkpoint
130
131
132       4. To restart, one should execute dmtcp_restart_script.sh, which is
133              created by the dmtcp_coordinator in its working directory at the
134              time  of  checkpoint.  One  can  optionally  edit this script to
135              migrate processes to different  hosts.   By  default,  only  one
136              restarted  process  will  be  restarted  in  the  foreground and
137              receive the standard input.  The script may be edited to  choose
138              which process will be restarted in the foreground.
139
140               ./dmtcp_restart_script.sh
141
142

DMTCP PLUGIN

144       The  source  distribution  includes  a top-level plugin directory, with
145       examples of how to write a plugin for DMTCP.  Plugins  allow  a  check‐
146       pointed  application  to disconnect from an external resource, and then
147       reconnect to it at the time of restart.  Further examples  are  in  the
148       test/plugin  directory.   In  particular,  test/plugin/applic-initiated
149       demonstrates how an application can request a checkpoint under  program
150       control.
151
152       The  plugin  feature  adds three new user-programmable capabilities.  A
153       plugin may: add wrappers around system calls; take special  actions  at
154       during  certain  events  (e.g.  pre-checkpoint, resume/post-checkpoint,
155       restart); and may insert key-value pairs into  a  database  at  restart
156       time that is then available to be queried by the restarted processes of
157       a computation.  (The events available to  the  plugin  feature  form  a
158       superset  of  the events available with the dmtcpaware interface.)  One
159       or more plugins are invoked via  a  list  of  colon-separated  absolute
160       pathnames.
161
162         dmtcp_launch --with-plugin PLUGIN1[:PLUGIN2]...
163
164

DMTCPAWARE API

166       This  API  is  now  deprecated  in  favor of DMTCP plugins (see above).
167       DMTCP provides a programming interface to allow  checkpointed  applica‐
168       tions  to  interact  with dmtcp.  In the source distribution, see dmtc‐
169       paware/dmtcpaware.h  for  the  functions  available.   See   test/dmtc‐
170       paware[123].c  for  three  example applications.  For an example of its
171       usage, try:
172
173        cd test; rm dmtcpaware1; make dmtcpaware1; ./autotest -v dmtcpaware1
174
175       The user application should link with  libdmtcpaware.so  (-ldmtcpaware)
176       and use the header file dmtcp/dmtcpaware.h.
177
178

RETURN CODE

180       A  target  program under DMTCP control normally returns the same return
181       code as if executed without DMTCP.  However, if DMTCP fails (as opposed
182       to  the  target program failing), DMTCP returns a DMTCP-specific return
183       code, rc (or rc+1, rc+2 for two special cases), where rc is the integer
184       value  of  the  environment  variable DMTCP_FAIL_RC if set, or else the
185       default value, 99.
186

SEE ALSO

188       Full documentation is available at the DMTCP home page:
189        http://dmtcp.sourceforge.net/
190       FAQ: http://dmtcp.sourceforge.net/FAQ.html
191       Plugins:    https://github.com/dmtcp/dmtcp/blob/master/doc/plugin-tuto
192       rial.pdf      (DMTCP-2.x)
193         examples: https://github.com/dmtcp/dmtcp/tree/master/test/plugin
194

AUTHORS

196       DMTCP  and  its standalone single-process component MTCP (MultiThreaded
197       CheckPointing) were created by the long-term contributors, Jason Ansel,
198       Kapil Arya, Gene Cooperman, Rohan Garg, Artem Y. Polyakov, Mike Rieker,
199       and a series of additional contributors  including  Alex  Brick,  Tyler
200       Denniston,  William Enright, Gregory Kerr, Ana-Maria Visan, and others.
201       For support, see DMTCP home page.
202
203
204
205DMTCP team                       June 17, 2008                        DMTCP(1)
Impressum