ARCHIVED: On the Research SP, how do I checkpoint a serial job?
The Research SP system is capable of kernel-level checkpointing. For a serial program, you could do checkpointing without recompilation. The only change you need to make is to modify the job command file that you submit to LoadLeveler.
The job command file needs to contain the keywords related to
checkpointing, which instruct LoadLeveler to enable checkpointing for
this job. After the job is submitted to LoadLeveler and started, you
can issue the command llckpt to checkpoint the program,
or with appropriate keywords, LoadLeveler can also checkpoint a job
periodically.
The keywords that need to be added are: checkpoint,
ckpt_dir, ckpt_file,
ckpt_time_limit, and
restart_from_ckpt. Following is the description of these
keywords and their default values on the Research SP. Click a keyword
for detailed information on its usage (you will need to enter your
Indiana University Network ID username and password):
| Keyword | Default Value | Description |
|---|---|---|
| checkpoint | No | Indicates whether a job step should be enabled for checkpoint |
| ckpt_dir | Initial working directory of the job | The location to be used for checkpoint files |
| ckpt_file | [jobname.]job_step_id.ckpt |
The base name to be used for the checkpoint file |
| ckpt_time_limit | Unlimited | Amount of time a job step can take to checkpoint |
| restart_from_ckpt | No | Indicates whether a job step is to be restarted from an existing checkpoint file |
It is important to select the value of the ckpt_dir
keyword appropriately. Since the checkpoint file is essentially
recording the running image of a program, it is usually very
big. While the default value of ckpt_dir is the initial
working directory of the program, it is not always the best place to
put a big checkpoint file. For example, if you're running the program
in your home directory, it is not advisable to put the checkpoint file
there since you may easily exceed your quota. Another consideration is
reliability. You need to make sure the checkpoint file is present when
the system needs to restart a job based on it. For example,
/tmp is not an appropriate place to put checkpoint files,
since it is very likely that LoadLeveler will restart a job on a
different node from one where the job previously ran.
For example, if your username were jdoe and you had a
serial program called a.out, you could use the following
job command file to submit a job with checkpointing enabled:
The line #@ checkpoint=interval instructs the system to
checkpoint the program periodically. The default system value is to
checkpoint the program every 15 minutes to every 2 hours; the specific
interval depends on the system load and other factors. The checkpoint
files are written to the /gpfs/jdoe directory, and are
named my_test_1.[tag], where [tag] is the
index of the checkpoint file (0, 1, 2...) and is used to differentiate
the current and previous checkpoint file. New checkpoint files will
overwrite the old ones. The line #@ ckpt_time_limit =
rlim_infinity indicates there is no restriction on how long
checkpointing will take.
The other important keyword is #@restart=no which
instructs LoadLeveler not to requeue the job in case it is vacated
from the executing machine before completion. That tells LoadLeveler
that you are going to manually resubmit the job instead of LoadLeveler
automatically requeuing the job (in that case, the job would be
restarted from the beginning, which is probably not what you want). If
the job aborts, you need to manually restart the job from the
checkpoint. Note that when the job is first submitted, the
restart_from_ckpt keyword must be set to no,
so that LoadLeveler will not try to restart it from a non-existent
checkpoint file.
Note: You can also specify the
checkpoint keyword to be yes to enable
checkpointing. However, in that case, the system won't checkpoint the
program automatically. You will need to use the llckpt
command periodically to invoke checkpointing.
To restart execution of a.out from the checkpoint file
when a job has aborted, use the following job command file:
The only difference between this file and the one above is that the
value of the restart_from_ckpt is changed from
no to yes.
Notes
- Between the time the running program is stopped for whatever
reason to the time the restart job command file is submitted, output,
error, and checkpoint files should not be changed. If they are, the
restart might fail. In addition, any intermediate temporary files need
to be present and unchanged as well when the job is restarted (such as
Gaussian's read-write file and other work files).
- Checkpointing will slow down execution of your program, especially for a program that has large memory images. This is because the system has to take time to write that image onto the disk.
Last modified on November 03, 2005.






