Job migration

The job migration feature enables you to move checkpointable and rerunnable jobs from one host to another. Job migration makes use of job checkpoint and restart so that a migrated checkpointable job restarts on the new host from the point at which the job stopped on the original host.

Contents

  • About job migration

  • Scope

  • Configuration to enable job migration

  • Job migration behavior

  • Configuration to modify job migration

  • Job migration commands

About job migration

Job migration refers to the process of moving a checkpointable or rerunnable job from one host to another. This facilitates load balancing by moving jobs from a heavily-loaded host to a lightly-loaded host.

You can initiate job migration on demand (bmig) or automatically. To initiate job migration automatically, you configure a migration threshold at the host or queue level.

Default behavior (job migration not enabled)

With automatic job migration enabled

Scope


Applicability

Details

Operating system

  • UNIX

  • Linux

  • Windows

Job types

  • Non-interactive batch jobs submitted with bsub or bmod, including chunk jobs

Dependencies

  • UNIX and Windows user accounts must be valid on all hosts in the cluster, or the correct type of account mapping must be enabled:
    • For a mixed UNIX/Windows cluster, UNIX/Windows user account mapping must be enabled

    • For a cluster with a non-uniform user name space, between-host account mapping must be enabled

    • For a MultiCluster environment with a non-uniform user name space, cross-cluster user account mapping must be enabled

  • Both the original and the new hosts must:
    • Be binary compatible

    • Run the same dot version of the operating system for predictable results

    • Have network connectivity and read/execute permissions to the checkpoint and restart executables (in LSF_SERVERDIR by default)

    • Have network connectivity and read/write permissions to the checkpoint directory and the checkpoint file

    • Have access to all files open during job execution so that LSF can locate them using an absolute path name


Configuration to enable job migration

The job migration feature requires that a job be made checkpointable at the job, application, or queue level, or rerunnable at the job or queue level. An LSF user can make a job
  • Checkpointable, using bsub -k and specifying a checkpoint directory and checkpoint period, and an optional initial checkpoint period

  • Rerunnable, using bsub -r


Configuration file

Parameter and syntax

Behavior

lsb.queues

CHKPNT=chkpnt_dir [chkpnt_period]

  • All jobs submitted to the queue are checkpointable.
    • The specified checkpoint directory must already exist. LSF will not create the checkpoint directory.

    • The user account that submits the job must have read and write permissions for the checkpoint directory.

    • For the job to restart on another execution host, both the original and new hosts must have network connectivity to the checkpoint directory.

  • If the queue administrator specifies a checkpoint period, in minutes, LSF creates a checkpoint file every chkpnt_period during job execution.

  • If a user specifies a checkpoint directory and checkpoint period at the job level with bsub -k, the job-level values override the queue-level values.

RERUNNABLE=Y

  • If the execution host becomes unavailable, LSF reruns the job from the beginning on a different host.

lsb.applications

CHKPNT_DIR=chkpnt_dir

  • Specifies the checkpoint directory for automatic checkpointing for the application. To enable automatic checkpoint for the application profile, administrators must specify a checkpoint directory in the configuration of the application profile.

  • If CHKPNT_PERIOD, CHKPNT_INITPERIOD or CHKPNT_METHOD was set in an application profile but CHKPNT_DIR was not set, a warning message is issued and and those settings are ignored.

  • The checkpoint directory is the directory where the checkpoint files are created. Specify an absolute path or a path relative to the current working directory for the job. Do not use environment variables in the directory path.

  • If checkpoint-related configuration is specified in both the queue and an application profile, the application profile setting overrides queue level configuration.

CHKPNT_INITPERIOD=init_chkpnt_period

CHKPNT_PERIOD=chkpnt_period

CHKPNT_METHOD=chkpnt_method


Configuration to enable automatic job migration

Automatic job migration assumes that if a job is system-suspended (SSUSP) for an extended period of time, the execution host is probably heavily loaded. Configuring a queue-level or host-level migration threshold lets the job to resume on another less loaded host, and reduces the load on the original host. You can use bmig at any time to override a configured migration threshold.

Configuration file

Parameter and syntax

Behavior

lsb.queues

lsb.applications

MIG=minutes

  • LSF automatically migrates jobs that have been in the SSUSP state for more than the specified number of minutes

  • Specify a value of 0 to migrate jobs immediately upon suspension

  • Applies to all jobs submitted to the queue

  • Job-level command line migration threshold (bsub -mig) overrides threshold configuration in application profile and queue. Application profile configuration overrides queue level configuration.

lsb.hosts

HOST_NAME     MIG
host_name     minutes
  • LSF automatically migrates jobs that have been in the SSUSP state for more than the specified number of minutes

  • Specify a value of 0 to migrate jobs immediately upon suspension

  • Applies to all jobs running on the host


Note:

When a host migration threshold is specified, and is lower than the value for the job, the queue, or the application, the host value is used.

Job migration behavior

LSF migrates a job by performing the following actions:
  1. Stops the job if it is running

  2. Checkpoints the job if the job is checkpointable

  3. Kills the job on the current host

  4. Restarts or reruns the job on the first available host, bypassing all pending jobs

Configuration to modify job migration

You can configure LSF to requeue a migrating job rather than restart or rerun the job.

Configuration file

Parameter and syntax

Behavior

lsf.conf

LSB_MIG2PEND=1

  • LSF requeues a migrating job rather than restarting or rerunning the job

  • LSF requeues the job as pending in order of the original submission time and priority

  • In a MultiCluster environment, LSF ignores this parameter

LSB_REQUEUE_TO_BOTTOM=1

  • When LSB_MIG2PEND=1, LSF requeues a migrating job to the bottom of the queue, regardless of the original submission time and priority

  • If the queue defines APS scheduling, migrated jobs keep their APS information and compete with other pending jobs based on the APS value


Job migration commands

Commands for submission

Job migration applies to checkpointable or rerunnable jobs submitted with a migration threshold, or that have already started and are either running or suspended.


Command

Description

bsub -mig migration_threshold

  • Submits the job with the specified migration threshold for checkpointable or rerunnable jobs. Enables automatic job migration and specifies the migration threshold, in minutes. A value of 0 (zero) specifies that a suspended job should be migrated immediately.

  • Command-level job migration threshold overrides application profile and queue-level settings.

  • Where a host migration threshold is also specified, and is lower than the job value, the host value is used.


Commands to monitor


Command

Description

bhist -l

  • Displays the actions that LSF took on a completed job, including migration to another host

bjobs -l

  • Displays information about pending, running, and suspended jobs


Commands to control


Command

Description

bmig

  • Migrates one or more running jobs from one host to another. The jobs must be checkpointable or rerunnable

  • Checkpoints, kills, and restarts one or more checkpointable jobs—bmig combines the functionality of the bchkpnt and brestart commands into a single command

  • Migrates the job on demand even if you have configured queue-level or host-level migration thresholds

  • When absolute job priority scheduling (APS) is configured in the queue, LSF schedules migrated jobs before pending jobs—for migrated jobs, LSF maintains the existing job priority

bmod -mig migration_threshold | -mign

  • Modifies or cancels the migration threshold specified at job submission for checkpointable or rerunnable jobs. Enables or disables automatic job migration and specifies the migration threshold, in minutes. A value of 0 (zero) specifies that a suspended job should be migrated immediately.

  • Command-level job migration threshold overrides application profile and queue-level settings.

  • Where a host migration threshold is also specified, and is lower than the job value, the host value is used.


Commands to display configuration


Command

Description

bhosts -l

  • Displays information about hosts configured in lsb.hosts, including the values defined for migration thresholds in minutes

bqueues -l

  • Displays information about queues configured in lsb.queues, including the values defined for migration thresholds
    Note:

    The bqueues command displays the migration threshold in seconds—the lsb.queues MIG parameter defines the migration threshold in minutes.

badmin showconf

  • Displays all configured parameters and their values set in lsf.conf or ego.conf that affect mbatchd and sbatchd.

    Use a text editor to view other parameters in the lsf.conf or ego.conf configuration files.

  • In a MultiCluster environment, badmin showconf only displays the parameters of daemons on the local cluster.