Knowledge Center         Contents    Previous  Next    Index  
Platform Computing Corp.

Job Checkpoint, Restart, and Migration

Job checkpoint and restart optimizes resource usage by enabling a non-interactive job to restart on a new host from the point at which the job stopped-checkpointed jobs do not have to restart from the beginning. Job migration facilitates load balancing by enabling users to move a job from one host to another while taking advantage of job checkpoint and restart functionality.

Contents

Checkpoint and restart options

You can implement job checkpoint and restart at one of the following levels.

note:  
For a detailed description of the job checkpoint and restart feature and how to configure it, see the Platform LSF Configuration Reference.

Checkpoint directory and files

The job checkpoint and restart feature requires that a job be made checkpointable at the job, application profile, or queue level. LSF users can make a job checkpointable by submitting the job using bsub -k and specifying a checkpoint directory, and optional checkpoint period, initial checkpoint period, and checkpoint method. Administrators can make all jobs in a queue or an application profile checkpointable by specifying a checkpoint directory for the queue or application.

Requirements

The following requirements apply to a checkpoint directory specified at the queue or application profile level:

Behavior

Specifying a checkpoint directory at the queue level or in an application profile enables checkpointing.

The brestart command restarts checkpointed jobs that have stopped running.

Precendence of checkpointing options

If checkpoint-related configuration is specified in both the queue and an application profile, the application profile setting overrides queue level configuration.

If checkpoint-related configuration is specified in the queue, application profile, and at job level:

Checkpointing MultiCluster jobs

To enable checkpointing of MultiCluster jobs, define a checkpoint directory in both the send-jobs and receive-jobs queues (CHKPNT in lsb.queues), or in an application profile (CHKPNT_DIR, CHKPNT_PERIOD, CHKPNT_INITPERIOD, CHKPNT_METHOD in lsb.applications) of both submission cluster and execution cluster. LSF uses the directory specified in the execution cluster.

Checkpointing is not supported if a job runs on a leased host.

Example

The following example shows a queue configured for periodic checkpointing in lsb.queues:

Begin Queue 
... 
QUEUE_NAME=checkpoint 
CHKPNT=mydir 240 
DESCRIPTION=Automatically checkpoints jobs every 4 hours to mydir 
... 
End Queue 
note:  
The bqueues command displays the checkpoint period in seconds; the lsb.queues CHKPNT parameter defines the checkpoint period in minutes.

If the command bchkpnt -k 123 is used to checkpoint and kill job 123, you can restart the job using the brestart command as shown in the following example:

brestart -q priority mydir 123

Job <456> is submitted to queue <priority> 

LSF assigns a new job ID of 456, submits the job to the queue named "priority," and restarts the job.

Once job 456 is running, you can change the checkpoint period using the bchkpnt command:

bchkpnt -p 360 456

Job <456> is being checkpointed 
note:  
For a detailed description of the commands used with the job checkpoint and restart feature, see the Platform LSF Configuration Reference.

Checkpoint and restart executables

LSF controls checkpointing and restart by means of interfaces named echkpnt and erestart. By default, when a user specifies a checkpoint directory using bsub -k or bmod -k or submits a job to a queue that has a checkpoint directory specified, echkpnt sends checkpoint instructions to an executable named echkpnt.default.

For application-level job checkpoint and restart, you can specify customized checkpoint and restart executables for each application that you use. The optional parameter LSB_ECHKPNT_METHOD specifies a checkpoint executable used for all jobs in the cluster. An LSF user can override this value when submitting a job.

note:  
For a detailed description of how to write and configure application-level checkpoint and restart executables, see the Platform LSF Configuration Reference.

Job restart

LSF can restart a checkpointed job on a host other than the original execution host using the information saved in the checkpoint file to recreate the execution environment. Only jobs that have been checkpointed successfully can be restarted from a checkpoint file. When a job restarts, LSF performs the following actions:

  1. LSF resubmits the job to its original queue as a new job and assigns a new job ID.
  2. When a suitable host becomes available, LSF dispatches the job.
  3. LSF recreates the execution environment from the checkpoint file.
  4. LSF restarts the job from its last checkpoint. You can restart a job manually from the command line using brestart, automatically through configuration, or by migrating the job to a different host using bmig.

Requirements

To allow restart of a checkpointed job on a different host than the host on which the job originally ran, both the original and the new hosts must:

Job migration

Job migration is the process of moving a checkpointable or rerunnable job from one host to another. This facilitates load balancing by moving jobs from a heavily-loaded host to a lightly-loaded host.

You can initiate job migration manually on demand (bmig) or automatically. To initiate job migration automatically, you can configure a migration threshold at job submission, or at the host, queue, or in an application profile.

note:  
For a detailed description of the job migration feature and how to configure it, see the Platform LSF Configuration Reference.

Manual job migration

The bmig command migrates checkpointable or rerunnable jobs on demand. Jobs can be manually migrated by the job owner, queue administrator, and LSF administrator.

For example, to migrate a job with job ID 123 to the first available host:

bmig 123

Job <123> is being migrated 

Automatic job migration

Automatic job migration assumes that if a job is system-suspended (SSUSP) for an extended period of time, the execution host is probably heavily loaded. Specifying a migration threshold at job submission (bsub -mig) or configuring an application profile-level, queue-level or host-level migration threshold allows the job to progress and reduces the load on the host. You can use bmig at any time to override a configured migration threshold, or bmod -mig to change a job-level migration threshold.

For example, at the queue level, in lsb.queues:

Begin Queue 
  ... 
  MIG=30        # Migration threshold set to 30 mins 
  DESCRIPTION=Migrate suspended jobs after 30 mins
  ... 
End Queue 

At the host level, in lsb.hosts:

Begin Host
  HOST_NAME   r1m   pg   MIG # Keywords
  ...
  hostA       5.0   18   30
  ...
End Host 

For example, in an application profile, in lsb.applications:

Begin Application 
  ... 
  MIG=30        # Migration threshold set to 30 mins 
  DESCRIPTION=Migrate suspended jobs after 30 mins
  ... 
End Application 

If you want to requeue migrated jobs instead of restarting or rerunning them, you can define the following parameters in lsf.conf:


Platform Computing Inc.
www.platform.com
Knowledge Center         Contents    Previous  Next    Index