|Knowledge Center Contents Previous Next Index|
Job Checkpoint, Restart, and Migration
Job checkpoint and restart optimizes resource usage by enabling a non-interactive job to restart on a new host from the point at which the job stopped-checkpointed jobs do not have to restart from the beginning. Job migration facilitates load balancing by enabling users to move a job from one host to another while taking advantage of job checkpoint and restart functionality.
- Checkpoint and restart options
- Checkpoint directory and files
- Checkpoint and restart executables
- Job restart
- Job migration
Checkpoint and restart options
You can implement job checkpoint and restart at one of the following levels.
- Kernel level-provided by your operating system, enabled by default
- User level-provided by special LSF libraries that you link to your application object files
- Application level-provided by your site-specific applications and supported by LSF through the use of application-specific echkpnt and erestart executables
note:For a detailed description of the job checkpoint and restart feature and how to configure it, see the Platform LSF Configuration Reference.
Checkpoint directory and files
The job checkpoint and restart feature requires that a job be made checkpointable at the job, application profile, or queue level. LSF users can make a job checkpointable by submitting the job using bsub -k and specifying a checkpoint directory, and optional checkpoint period, initial checkpoint period, and checkpoint method. Administrators can make all jobs in a queue or an application profile checkpointable by specifying a checkpoint directory for the queue or application.
The following requirements apply to a checkpoint directory specified at the queue or application profile level:
- The specified checkpoint directory must already exist. LSF does not create the checkpoint directory.
- The user account that submits the job must have read and write permissions for the checkpoint directory.
- For the job to restart on another execution host, both the original and new hosts must have network connectivity to the checkpoint directory.
Specifying a checkpoint directory at the queue level or in an application profile enables checkpointing.
- All jobs submitted to the queue or application profile are checkpointable. LSF writes the checkpoint files, which contain job state information, to the checkpoint directory. The checkpoint directory can contain checkpoint files for multiple jobs.
- If the administrator specifies a checkpoint period, in minutes, LSF creates a checkpoint file every chkpnt_period during job execution.
- If the administrator specifies an initial checkpoint period in an application profile, in minutes, the first checkpoint does not happen until the initial period has elapsed. LSF then creates a checkpoint file every chkpnt_period after the initial checkpoint period, during job execution.
- If a user specifies a checkpoint directory, initial checkpoint period, checkpoint method or checkpoint period at the job level with bsub -k, or modifies the job with bmod, the job-level values override the queue-level and applcation profile values.
The brestart command restarts checkpointed jobs that have stopped running.
Precendence of checkpointing options
If checkpoint-related configuration is specified in both the queue and an application profile, the application profile setting overrides queue level configuration.
If checkpoint-related configuration is specified in the queue, application profile, and at job level:
- Application-level and job-level parameters are merged. If the same parameter is defined at both job-level and in the application profile, the job-level value overrides the application profile value.
- The merged result of job-level and application profile settings override queue-level configuration.
Checkpointing MultiCluster jobs
To enable checkpointing of MultiCluster jobs, define a checkpoint directory in both the send-jobs and receive-jobs queues (CHKPNT in lsb.queues), or in an application profile (CHKPNT_DIR, CHKPNT_PERIOD, CHKPNT_INITPERIOD, CHKPNT_METHOD in lsb.applications) of both submission cluster and execution cluster. LSF uses the directory specified in the execution cluster.
Checkpointing is not supported if a job runs on a leased host.
The following example shows a queue configured for periodic checkpointing in lsb.queues:Begin Queue ... QUEUE_NAME=checkpoint CHKPNT=mydir 240 DESCRIPTION=Automatically checkpoints jobs every 4 hours to mydir ... End Queue
note:The bqueues command displays the checkpoint period in seconds; the lsb.queues CHKPNT parameter defines the checkpoint period in minutes.
If the command bchkpnt -k 123 is used to checkpoint and kill job 123, you can restart the job using the brestart command as shown in the following example:
brestart -q priority mydir 123Job <456> is submitted to queue <priority>
LSF assigns a new job ID of 456, submits the job to the queue named "priority," and restarts the job.
Once job 456 is running, you can change the checkpoint period using the bchkpnt command:
bchkpnt -p 360 456Job <456> is being checkpointed
note:For a detailed description of the commands used with the job checkpoint and restart feature, see the Platform LSF Configuration Reference.
Checkpoint and restart executables
LSF controls checkpointing and restart by means of interfaces named echkpnt and erestart. By default, when a user specifies a checkpoint directory using bsub -k or bmod -k or submits a job to a queue that has a checkpoint directory specified, echkpnt sends checkpoint instructions to an executable named echkpnt.default.
For application-level job checkpoint and restart, you can specify customized checkpoint and restart executables for each application that you use. The optional parameter LSB_ECHKPNT_METHOD specifies a checkpoint executable used for all jobs in the cluster. An LSF user can override this value when submitting a job.
note:For a detailed description of how to write and configure application-level checkpoint and restart executables, see the Platform LSF Configuration Reference.
LSF can restart a checkpointed job on a host other than the original execution host using the information saved in the checkpoint file to recreate the execution environment. Only jobs that have been checkpointed successfully can be restarted from a checkpoint file. When a job restarts, LSF performs the following actions:
- LSF resubmits the job to its original queue as a new job and assigns a new job ID.
- When a suitable host becomes available, LSF dispatches the job.
- LSF recreates the execution environment from the checkpoint file.
- LSF restarts the job from its last checkpoint. You can restart a job manually from the command line using brestart, automatically through configuration, or by migrating the job to a different host using bmig.
To allow restart of a checkpointed job on a different host than the host on which the job originally ran, both the original and the new hosts must:
- Be binary compatible
- Run the same dot version of the operating system for predictable results
- Have network connectivity and read/execute permissions to the checkpoint and restart executables (in LSF_SERVERDIR by default)
- Have network connectivity and read/write permissions to the checkpoint directory and the checkpoint file
- Have access to all files open during job execution so that LSF can locate them using an absolute path name
Job migration is the process of moving a checkpointable or rerunnable job from one host to another. This facilitates load balancing by moving jobs from a heavily-loaded host to a lightly-loaded host.
You can initiate job migration manually on demand (bmig) or automatically. To initiate job migration automatically, you can configure a migration threshold at job submission, or at the host, queue, or in an application profile.
note:For a detailed description of the job migration feature and how to configure it, see the Platform LSF Configuration Reference.
Manual job migration
The bmig command migrates checkpointable or rerunnable jobs on demand. Jobs can be manually migrated by the job owner, queue administrator, and LSF administrator.
For example, to migrate a job with job ID 123 to the first available host:
bmig 123Job <123> is being migrated
Automatic job migration
Automatic job migration assumes that if a job is system-suspended (SSUSP) for an extended period of time, the execution host is probably heavily loaded. Specifying a migration threshold at job submission (bsub -mig) or configuring an application profile-level, queue-level or host-level migration threshold allows the job to progress and reduces the load on the host. You can use bmig at any time to override a configured migration threshold, or bmod -mig to change a job-level migration threshold.
For example, at the queue level, in lsb.queues:Begin Queue ... MIG=30 # Migration threshold set to 30 mins DESCRIPTION=Migrate suspended jobs after 30 mins ... End Queue
At the host level, in lsb.hosts:Begin Host HOST_NAME r1m pg MIG # Keywords ... hostA 5.0 18 30 ... End Host
For example, in an application profile, in lsb.applications:Begin Application ... MIG=30 # Migration threshold set to 30 mins DESCRIPTION=Migrate suspended jobs after 30 mins ... End Application
If you want to requeue migrated jobs instead of restarting or rerunning them, you can define the following parameters in lsf.conf:
Platform Computing Inc.
|Knowledge Center Contents Previous Next Index|