Indiana University
University Information Technology Services
  
What are archived documents?

ARCHIVED: On the Research SP, how do I checkpoint a serial job?

The Research SP system is capable of kernel-level checkpointing. For a serial program, you could do checkpointing without recompilation. The only change you need to make is to modify the job command file that you submit to LoadLeveler.

The job command file needs to contain the keywords related to checkpointing, which instruct LoadLeveler to enable checkpointing for this job. After the job is submitted to LoadLeveler and started, you can issue the command llckpt to checkpoint the program, or with appropriate keywords, LoadLeveler can also checkpoint a job periodically.

The keywords that need to be added are: checkpoint, ckpt_dir, ckpt_file, ckpt_time_limit, and restart_from_ckpt. Following is the description of these keywords and their default values on the Research SP. Click a keyword for detailed information on its usage (you will need to enter your Indiana University Network ID username and password):

Keyword Default Value Description
checkpoint No Indicates whether a job step should be enabled for checkpoint
ckpt_dir Initial working directory of the job The location to be used for checkpoint files
ckpt_file [jobname.]job_step_id.ckpt The base name to be used for the checkpoint file
ckpt_time_limit Unlimited Amount of time a job step can take to checkpoint
restart_from_ckpt No Indicates whether a job step is to be restarted from an existing checkpoint file

It is important to select the value of the ckpt_dir keyword appropriately. Since the checkpoint file is essentially recording the running image of a program, it is usually very big. While the default value of ckpt_dir is the initial working directory of the program, it is not always the best place to put a big checkpoint file. For example, if you're running the program in your home directory, it is not advisable to put the checkpoint file there since you may easily exceed your quota. Another consideration is reliability. You need to make sure the checkpoint file is present when the system needs to restart a job based on it. For example, /tmp is not an appropriate place to put checkpoint files, since it is very likely that LoadLeveler will restart a job on a different node from one where the job previously ran.

For example, if your username were jdoe and you had a serial program called a.out, you could use the following job command file to submit a job with checkpointing enabled:

#@ class=b #@ job_type=serial #@ restart=no #@ executable = a.out #@ checkpoint=interval #@ ckpt_dir=/gpfs/jdoe #@ ckpt_file = my_test_1 #@ ckpt_time_limit = rlim_infinity #@ restart_from_ckpt = no #@ output = out #@ error = error #@ queue

The line #@ checkpoint=interval instructs the system to checkpoint the program periodically. The default system value is to checkpoint the program every 15 minutes to every 2 hours; the specific interval depends on the system load and other factors. The checkpoint files are written to the /gpfs/jdoe directory, and are named my_test_1.[tag], where [tag] is the index of the checkpoint file (0, 1, 2...) and is used to differentiate the current and previous checkpoint file. New checkpoint files will overwrite the old ones. The line #@ ckpt_time_limit = rlim_infinity indicates there is no restriction on how long checkpointing will take.

The other important keyword is #@restart=no which instructs LoadLeveler not to requeue the job in case it is vacated from the executing machine before completion. That tells LoadLeveler that you are going to manually resubmit the job instead of LoadLeveler automatically requeuing the job (in that case, the job would be restarted from the beginning, which is probably not what you want). If the job aborts, you need to manually restart the job from the checkpoint. Note that when the job is first submitted, the restart_from_ckpt keyword must be set to no, so that LoadLeveler will not try to restart it from a non-existent checkpoint file.

Note: You can also specify the checkpoint keyword to be yes to enable checkpointing. However, in that case, the system won't checkpoint the program automatically. You will need to use the llckpt command periodically to invoke checkpointing.

To restart execution of a.out from the checkpoint file when a job has aborted, use the following job command file:

#@ class=b #@ job_type=serial #@ restart=no #@ executable = a.out #@ checkpoint=interval #@ ckpt_dir=/gpfs/jdoe #@ ckpt_file = my_test_1 #@ ckpt_time_limit = rlim_infinity #@ restart_from_ckpt = yes #@ output = out #@ error = error #@ queue

The only difference between this file and the one above is that the value of the restart_from_ckpt is changed from no to yes.

Notes

  • Between the time the running program is stopped for whatever reason to the time the restart job command file is submitted, output, error, and checkpoint files should not be changed. If they are, the restart might fail. In addition, any intermediate temporary files need to be present and unchanged as well when the job is restarted (such as Gaussian's read-write file and other work files).

  • Checkpointing will slow down execution of your program, especially for a program that has large memory images. This is because the system has to take time to write that image onto the disk.
This is document aqpa in domain all.
Last modified on November 03, 2005.
Please tell us, did you find the answer to your question?