Learn more about Platform products at http://www.platform.com

[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]



Running Parallel Jobs


Contents

[ Top ]


blaunch Distributed Application Framework

Most MPI implementations and many distributed applications use rsh and ssh as their task launching mechanism. The blaunch command provides a drop-in replacement for rsh and ssh as a transparent method for launching parallel and distributed applications within LSF.

The following figure illustrates blaunch processing:

About the blaunch command

Similar to the LSF lsrun command, blaunch transparently connects directly to the RES/SBD on the remote host, and subsequently creates and tracks the remote tasks, and provides the connection back to LSF. There no need to insert pam, taskstarter into the rsh or ssh calling sequence, or configure any wrapper scripts.

blaunch supports the following core command line options as rsh and ssh:

Whereas the host name value for rsh and ssh can only be a single host name, you can use the -z option to specify a space-delimited list of hosts where tasks are started in parallel. All other rsh and ssh options are silently ignored.

Important:


You cannot run blaunch directly from the command line as a standalone command.

blaunch only works within an LSF job; it can only be used to launch tasks on remote hosts that are part of a job allocation. On success, blaunch exits with 0.

blaunch is not supported on Windows.

See the Platform LSF Command Reference for more information about the blaunch command.

LSF APIs for the blaunch distributed application framework

LSF provides the following APIs for programming your own applications to use the blaunch distributed application framework:

See the Platform LSF API Reference for more information about these APIs.

The blaunch job environment

blaunch determines from the job environment what job it is running under, and what the allocation for the job is. These can be determined by examining the environment variables LSB_JOBID, LSB_JOBINDEX, and LSB_MCPU_HOSTS. If any of these variables do not exist, blaunch exits with a non-zero value. Similarly, if blaunch is used to start a task on a host not listed in LSB_MCPU_HOSTS, the command exits with a non-zero value.

The job submission script contains the blaunch command in place of rsh or ssh. The blaunch command does sanity checking of the environment to check for LSB_JOBID and LSB_MCPU_HOSTS. The blaunch command contacts the job RES to validate the information determined from the job environment. When the job RES receives the validation request from blaunch, it registers with the root sbatchd to handle signals for the job.

The job RES periodically requests resource usage for the remote tasks. This message also acts as a heartbeat for the job. If a resource usage request is not made within a certain period of time it is assumed the job is gone and that the remote tasks should be shut down. This timeout is configurable in an application profile in lsb.applications.

The blaunch command also honors the parameters LSB_CMD_LOG_MASK, LSB_DEBUG_CMD, and LSB_CMD_LOGDIR when defined in lsf.conf or as environment variables. The environment variables take precedence over the values in lsf.conf.

To ensure that no other users can run jobs on hosts allocated to tasks launched by blaunch set LSF_DISABLE_LSRUN=Y in lsf.conf. When LSF_DISABLE_LSRUN=Y is defined, RES refuses remote connections from lsrun and lsgrun unless the user is either an LSF administrator or root. LSF_ROOT_REX must be defined for remote execution by root. Other remote execution commands, such as ch and lsmake are not affected.

Temporary directory for tasks launched by blaunch

By default, LSF creates a temporary directory for a job only on the first execution host. If LSF_TMPDIR is set in lsf.conf, the path of the job temporary directory on the first execution host is set to LSF_TMPDIR/job_ID.tmpdir.

If LSB_SET_TMPDIR= Y, the environment variable TMPDIR will be set equal to the path specified by LSF_TMPDIR. This value for TMPDIR overrides any value that might be set in the submission environment.

Tasks launched through the blaunch distributed application framework make use of the LSF temporary directory specified by LSF_TMPDIR:

Automatic generation of the job host file

LSF automatically places the allocated hosts for a job into the $LSB_HOSTS and $LSB_MCPU_HOSTS environment variables. Since most MPI implementations and parallel applications expect to read the allocated hosts from a file, LSF creates a host file in the the default job output directory $HOME/.lsbatch on the execution host before the job runs, and deletes it after the job has finished running. The name of the host file created has the format:

.lsb.<jobID>.hostfile

The host file contains one host per line. For example, if LSB_MCPU_HOSTS="hostA 2 hostB 2 hostC 1", the host file contains:

hostA
hostA
hostB
hostB
hostC

LSF publishes the full path to the host file by setting the environment variable LSB_DJOB_HOSTFILE.

Configuring application profiles for the blaunch framework

Handle remote task exit

You can configure an application profile in lsb.applications to control the behavior of a parallel or distributed application when a remote task exits. Specify a value for RTASK_GONE_ACTION in the application profile to define what the LSF does when a remote task exits.

The default behaviour is:
When ... LSF ...
Task exits with zero value
Does nothing
Task exits with non-zero value
Does nothing
Task crashes
Shuts down the entire job

RTASK_GONE_ACTION has the following syntax:

RTASK_GONE_ACTION="[KILLJOB_TASKDONE | KILLJOB_TASKEXIT] 
[IGNORE_TASKCRASH]"

Where:

For example:

RTASK_GONE_ACTION="IGNORE_TASKCRASH KILLJOB_TASKEXIT"

RTASK_GONE_ACTION only applies to the blaunch distributed application framework.

When defined in an application profile, the LSB_DJOB_RTASK_GONE_ACTION variable is set when running bsub -app for the specified application.

You can also use the environment variable LSB_DJOB_RTASK_GONE_ACTION to override the value set in the application profile.

Handle communication failure

By default, LSF shuts down the entire job if connection is lost with the task RES, validation timeout, or heartbeat timeout. You can configure an application profile in lsb.applications so only the current tasks are shut down, not the entire job.

Use DJOB_COMMFAIL_ACTION="KILL_TASKS" to define the behavior of LSF when it detects a communication failure between itself and one or more tasks. If not defined, LSF terminates all tasks, and shuts down the job. If set to KILL_TASKS, LSF tries to kill all the current tasks of a parallel or distributed job associated with the communication failure.

DJOB_COMMFAIL_ACTION only applies to the blaunch distributed application framework.

When defined in an application profile, the LSB_DJOB_COMMFAIL_ACTION environment variable is set when running bsub -app for the specified application.

Set up job launching environment

LSF can run an appropriate script that is responsible for setup and cleanup of the job launching environment. You can specify the name of the appropriate script in an application profile in lsb.applications.

Use DJOB_ENV_SCRIPT to define the path to a script that sets the environment for the parallel or distributed job launcher. The script runs as the user, and is part of the job. DJOB_ENV_SCRIPT only applies to the blaunch distributed application framework.

If a full path is specified, LSF uses the path name for the execution. If a full path is not specified, LSF looks for it in LSF_BINDIR.

The specified script must support a setup argument and a cleanup argument. LSF invokes the script with the setup argument before launching the actual job to set up the environment, and with cleanup argument after the job is finished.

LSF assumes that if setup cannot be performed, the environment to run the job does not exist. If the script returns a non-zero value at setup, an error is printed to stderr of the job, and the job exits.

Regardless of the return value of the script at cleanup, the real job exit value is used. If the return value of the script is non-zero, an error message is printed to stderr of the job.

When defined in an application profile, the LSB_DJOB_ENV_SCRIPT variable is set when running bsub -app for the specified application.

For example, if DJOB_ENV_SCRIPT=mpich.script, LSF runs

$LSF_BINDIR/mpich.script setup

to set up the environment to run an MPICH job. After the job completes, LSF runs

$LSF_BINDIR/mpich.script cleanup

On cleanup, the mpich.script file could, for example, remove any temporary files and release resources used by the job. Changes to the LSB_DJOB_ENV_SCRIPT environment variable made by the script are visible to the job.

Update job heartbeat and resource usage

Use DJOB_HB_INTERVAL in an application profile in lsb.applications to configure an interval in seconds used to update the heartbeat between LSF and the tasks of a parallel or distributed job. DJOB_HB_INTERVAL only applies to the blaunch distributed application framework.

When DJOB_HB_INTERVAL is specified, the interval is scaled according to the number of tasks in the job:

max(DJOB_HB_INTERVAL, 10) + host_factor

where

host_factor = 0.01 * number of hosts allocated for the job

When defined in an application profile, the LSB_DJOB_HB_INTERVAL variable is set in the parallel or distributed job environment. You should not manually change the value of LSB_DJOB_HB_INTERVAL.

By default, the interval is equal to SBD_SLEEP_TIME in lsb.params, where the default value of SBD_SLEEP_TIME is 30 seconds.

Update job heartbeat and resource usage

Use DJOB_RU_INTERVAL in an application profile in lsb.applications to configure an interval in seconds used to update the resource usage for the tasks of a parallel or distributed job. DJOB_RU_INTERVAL only applies to the blaunch distributed application framework.

When DJOB_RU_INTERVAL is specified, the interval is scaled according to the number of tasks in the job:

max(DJOB_RU_INTERVAL, 10) + host_factor

where

host_factor = 0.01 * number of hosts allocated for the job

When defined in an application profile, the LSB_DJOB_RU_INTERVAL variable is set in parallel or distributed job environment. You should not manually change the value of LSB_DJOB_RU_INTERVAL.

By default, the interval is equal to SBD_SLEEP_TIME in lsb.params, where the default value of SBD_SLEEP_TIME is 30 seconds.

How blaunch supports task geometry and process group files

The current support for task geometry in LSF requires the user submitting a job to specify the wanted task geometry by setting the environment variable LSB_PJL_TASK_GEOMETRY in their submission environment before job submission. LSF checks for LSB_PJL_TASK_GEOMETRY and modifies LSB_MCPU_HOSTS appropriately

The environment variable LSB_PJL_TASK_GEOMETRY is checked for all parallel jobs. If LSB_PJL_TASK_GEOMETRY is set users submit a parallel job (a job that requests more than 1 slot), LSF attempts to shape LSB_MCPU_HOSTS accordingly.

Resource collection for all commands in a job script

Parallel and distributed jobs are typically launched with a job script. If your job script runs multiple commands, you can ensure that resource usage is collected correctly for all commands in a job script by configuring LSF_HPC_EXTENSIONS=CUMULATIVE_RUSAGE in lsf.conf. Resource usage is collected for jobs in the job script, rather than being overwritten when each command is executed.

Submitting jobs with blaunch

Use bsub to call blaunch, or to invoke an execution script that calls blaunch. The blaunch command assumes that bsub -n implies one task per job slot.

Example execution scripts

Launching MPICH-P4 tasks

To launch an MPICH-P4 tasks through LSF using the blaunch framework, substitute the path to rsh or ssh with the path to blaunch. For example:

Sample mpirun script changes:

...
# Set default variables
AUTOMOUNTFIX="sed -e s@/tmp_mnt/@/@g"
DEFAULT_DEVICE=ch_p4
RSHCOMMAND="$LSF_BINDIR/blaunch"
SYNCLOC=/bin/sync
CC="cc"
...

You must also set special arguments for the ch_p4 device:

#! /bin/sh
#
# mpirun.ch_p4.args
#
# Special args for the ch_p4 device
setrshcmd="yes"
givenPGFile=0
case $arg in
...

Sample job submission script:

#! /bin/sh
#
# job script for MPICH-P4
#
#BSUB -n 2
#BSUB -R'span[ptile=1]'
#BSUB -o %J.out
#BSUB -e %J.err
NUMPROC=`wc -l $LSB_DJOB_HOSTFILE|cut -f 1 -d ' '`
mpirun -n $NUMPROC -machinefile $LSB_DJOB_HOSTFILE ./myjob

Launching ANSYS jobs

To launch an ANSYS job through LSF using the blaunch framework, substitute the path to rsh or ssh with the path to blaunch. For example:

#BSUB -o stdout.txt
#BSUB -e stderr.txt
# Note: This case statement should be used to set up any
# environment variables needed to run the different versions
# of Ansys. All versions in this case statement that have the
# string "version list entry" on the same line will appear as
# choices in the Ansys service submission page.

case $VERSION in
 10.0)  #version list entry
        export ANSYS_DIR=/usr/share/app/ansys_inc/v100/Ansys
        export ANSYSLMD_LICENSE_FILE=1051@licserver.company.com
       export MPI_REMSH=/opt/lsf/bin/blaunch
        program=${ANSYS_DIR}/bin/ansys100
        ;;
  *)
        echo "Invalid version ($VERSION) specified"
        exit 1
        ;;
esac

if [ -z "$JOBNAME" ]; then
    export JOBNAME=ANSYS-$$
fi

if [ $CPUS -eq 1 ]; then
    ${program} -p ansys -j $JOBNAME -s read -l en-us -b -i $INPUT $OPTS
else
    if [ $MEMORY_ARCH = "Distributed" ] Then
      HOSTLIST=`echo $LSB_HOSTS | sed s/" "/":1:"/g` ${program} -j $JOBNAME -p 
ansys -pp -dis -machines \
    ${HOSTLIST}:1 -i $INPUT $OPTS
    else
       ${program} -j $JOBNAME -p ansys -pp -dis -np $CPUS \
    -i $INPUT $OPTS
    fi
fi

[ Top ]


OpenMP Jobs

Platform LSF provides the ability to start parallel jobs that use OpenMP to communicate between process on shared-memory machines and MPI to communicate across networked and non-shared memory machines.

This implementation allows you to specify the number of machines and to reserve an equal number of processors per machine. When the job is dispatched, PAM only starts one process per machine.

OpenMP specification

The OpenMP specifications are owned and managed by the OpenMP Architecture Review Board. See www.openmp.org for detailed information.

OpenMP esub

An esub for OpenMP jobs, esub.openmp, is installed with Platform LSF. The OpenMP esub sets environment variable LSF_PAM_HOSTLIST_USE=unique, and starts PAM.

Use bsub -a openmp to submit OpenMP jobs.

Submitting OpenMP jobs

To run an OpenMP job with MPI on multiple hosts, specify the number of processors and the number of processes per machine. For example, to reserve 32 processors and run 4 processes per machine:

bsub -a openmp -n 32 -R "span[ptile=4]" myOpenMPJob

myOpenMPJob runs across 8 machines (4/32=8) and PAM starts 1 MPI process per machine.

To run a parallel OpenMP job on a single host, specify the number of processors:

bsub -a openmp -n 4 -R "span[hosts=1]" myOpenMPJob

[ Top ]


PVM Jobs

Parallel Virtual Machine (PVM) is a parallel programming system distributed by Oak Ridge National Laboratory. PVM programs are controlled by the PVM hosts file, which contains host names and other information.

PVM esub

An esub for PVM jobs, esub.pvm, is installed with Platform LSF. The PVM esub calls the pvmjob script.

Use bsub -a pvm to submit PVM jobs.

pvmjob script

The pvmjob shell script is invoked by esub.pvm to run PVM programs as parallel LSF jobs. The pvmjob script reads the LSF environment variables, sets up the PVM hosts file and then runs the PVM job. If your PVM job needs special options in the hosts file, you can modify the pvmjob script.

Example

For example, if the command line to run your PVM job is:

myjob data1 -o out1

the following command submits this job to run on 10 processors:

bsub -a pvm -n 10 myjob data1 -o out1

Other parallel programming packages can be supported in the same way.

[ Top ]


SGI Vendor MPI Support

Compiling and linking your MPI program

You must use the SGI C compiler (cc by default). You cannot use mpicc to build your programs.

For example, use one of the following compilation commands to build the program mpi_sgi:

System requirements

SGI MPI has the following system requirements:

Use the one of the following commands to determine your installation:

Configuring LSF to work with SGI MPI

To use 32-bit or 64-bit SGI MPI with Platform LSF, set the following parameters in lsf.conf:

libxmpi.so file permission

For PAM to access the libxmpi.so library, the file permission mode must be 755 (-rwxr-xr-x).

Array services authentication (Altix only)

For PAM jobs on Altix, the SGI Array Services daemon arrayd must be running and AUTHENTICATION must be set to NONE in the SGI array services authentication file /usr/lib/array/arrayd.auth (comment out the AUTHENTICATION NOREMOTE method and uncomment the AUTHENTICATION NONE method).

To run a mulithost MPI applications, you must also enable rsh without password prompt between hosts:

The pam command

The pam command invokes the Platform Parallel Application Manager (PAM) to run parallel batch jobs in LSF. It uses the mpirun library and SGI array services to spawn the child processes needed for the parallel tasks that make up your MPI application. It starts these tasks on the systems allocated by LSF. The allocation includes the number of execution hosts needed, and the number of child processes needed on each host.

Using the pam -mpi option

The -mpi option on the bsub and pam command line is equivalent to mpirun in the SGI environment.

Using the pam -auto_place option

The -auto_place option on the pam command line tells the mpirun library to launch the MPI application according to the resources allocated by LSF.

Using the pam -n option

The -n option on the pam command line specifies the number of tasks that PAM should start.

You can use both bsub -n and pam -n in the same job submission. The number specified in the pam -n option should be less than or equal to the number specified by bsub -n. If the number of tasks specified with pam -n is greater than the number specified by bsub -n, the pam -n is ignored.

For example, you can specify:

bsub -n 5 pam -n 2 -mpi a.out

Here, the job requests 5 processors, but PAM only starts 2 parallel tasks.

Examples

Running a job

To run a job and have LSF select the host, the command:

mpirun -np 4 a.out

is entered as:

bsub -n 4 pam -mpi -auto_place a.out

Running a job on a single host

To run a single-host job and have LSF select the host, the command:

mpirun -np 4 a.out

is entered as:

bsub -n 4 -R "span[hosts=1]" pam -mpi -auto_place a.out

Running a job on multiple hosts

To run a multihost job (5 processors per host) and have LSF select the hosts, the following command:

mpirun hosta -np 5 a.out: hostb -np 5 a.out

is entered as:

bsub -n 10 -R "span[ptile=5]" pam -mpi -auto_place a.out

For a complete list of mpirun options and environment variable controls refer to the SGI mpirun man page.

Limitations

[ Top ]


HP Vendor MPI Support

When you use mpirun in stand-alone mode, you specify host names to be used by the MPI job.

Automatic HP MPI library configuration

During installation, lsfinstall sets LSF_VPLUGIN in lsf.conf to the full path to the MPI library libmpirm.sl. For example:

LSF_VPLUGIN="/opt/mpi/lib/pa1.1/libmpirm.sl"

On Linux

On Linux hosts running HP MPI, you must manually set the full path to the HP vendor MPI library libmpirm.so.

For example, if HP MPI is installed in /opt/hpmpi:

LSF_VPLUGIN="/opt/hpmpi/lib/linux_ia32/libmpirm.so"

The pam command

The pam command invokes the Platform Parallel Application Manager (PAM) to run parallel batch jobs in LSF. It uses the HP mpirun library to spawn the child processes needed for the parallel tasks that make up your MPI application. It starts these tasks on the systems allocated by LSF. The allocation includes the number of execution hosts needed, and the number of child processes needed on each host.

Automatic host allocation by LSF

Using the pam -mpi option

To achieve better resource utilization, you can have LSF manage the allocation of hosts, coordinating the start-up phase with mpirun.

This is done by preceding the regular HP MPI mpirun command with:

bsub pam -mpi

The -mpi option on the bsub and pam command line is equivalent to mpirun in the HP MPI environment. The -mpi option must be the first option of the pam command.

Examples

Running a job on a single host

For example, to run a single-host job and have LSF select the host, the command:

mpirun -np 14 a.out

is entered as:

bsub pam -mpi mpirun -np 14 a.out

Running a job on multiple hosts

For example, to run a multi-host job and have LSF select the hosts, the command:

mpirun -f appfile

is entered as:

bsub pam -mpi mpirun -f appfile

where appfile contains the following entries:

-h host1 -np 4 a.out
-h host2 -np 4 b.out

In this example, the hosts host1 and host2 are treated as symbolic names and refer to the actual hosts that LSF allocates to the job.

The a.out and b.out processes may run on a different host, depending on the resources available and LSF scheduling algorithms.

More details on mpirun

For a complete list of mpirun options and environment variable controls, refer to the mpirun man page and the HP MPI User's Guide.

[ Top ]


LSF Generic Parallel Job Launcher Framework

Any parallel execution environment (for example a vendor MPI, or an MPI package like MPICH-GM, MPICH-P4, or LAM/MPI) can be made compatible with LSF using the generic parallel job launcher (PJL) framework.

All LSF Version 7 distributions support running parallel jobs with the generic PJL integration.


Vendor MPIs for SGI MPI and HP MPI are already integrated with Platform LSF.

The generic PJL integration is a framework that allows you to integrate any vendor's parallel job launcher with Platform LSF. PAM does not launch the parallel jobs directly, but manages the job to monitor job resource usage and provide job control over the parallel tasks.

System requirements

[ Top ]


How the Generic PJL Framework Works

Terminology

First execution host

The host name at the top of the execution host list as determined by LSF. Starts PAM.

Execution hosts

The most suitable hosts to execute the batch job as determined by LSF

task

A process that runs on a host; the individual process of a parallel application

parallel job

A parallel job consists of multiple tasks that could be executed on different hosts.

PJL

(Parallel Job Launcher) Any executable script or binary capable of starting parallel tasks on all hosts assigned for a parallel job (for example, mpirun.)

sbatchd

Slave Batch Daemons (SBDs) are batch job execution agents residing on the execution hosts. sbatchd receives jobs from mbatchd in the form of a job specification and starts RES to run the job according the specification. sbatchd reports the batch job status to mbatchd whenever job state changes.

mpirun.lsf

Reads the environment variable LSF_PJL_TYPE, and generates the appropriate pam command line to invoke the PJL. The esub programs provided in LSF_SERVERDIR set this variable to the proper type.

TS

(TaskStarter) An executable responsible for starting a parallel task on a host and reporting the process ID and host name to PAM. TS is located in LSF_BINDIR.

PAM

(Parallel Application Manager) The supervisor of any parallel LSF job. PAM allows LSF to collect resources used by the job and perform job control.

PAM starts the PJL and maintains connection with RES on all execution hosts. It collects resource usage, updates the resource usage of tasks and its own PID and PGID to sbatchd. It propagates signals to all process groups and individual tasks, and cleans up tasks as needed.

PJL wrapper

A script that starts the PJL. The wrapper is typically used to set up the environment for the parallel job and invokes the PJL.

RES

(Remote Execution Server) An LSF daemon running on each server host. Accepts remote execution requests to provide transparent and secure remote execution of jobs and tasks.

RES manages all remote tasks and forwards signals, standard I/O, resources consumption data, and parallel job information between PAM and the tasks.

Architecture

Running a parallel job using a non-integrated PJL

Without the generic PJL framework, the PJL starts tasks directly on each host, and manages the job.

Even if the MPI job was submitted through LSF, LSF never receives information about the individual tasks. LSF is not able to track job resource usage or provide job control.

If you simply replace PAM with a parallel job launcher that is not integrated with LSF, LSF loses control of the process and is not able to monitor job resource usage or provide job control. LSF never receives information about the individual tasks.

Using the generic PJL framework

PAM is the resource manager for the job. The key step in the integration is to place TS in the job startup hierarchy, just before the task starts. TS must be the parent process of each task in order to collect the task process ID (PID) and pass it to PAM.

The following figure illustrates the relationship between PAM, PJL, PJL wrapper, TS, and the parallel job tasks.

  1. Instead of starting the PJL directly, PAM starts the specified PJL wrapper on a single host.
  2. The PJL wrapper starts the PJL (for example, mpirun).
  3. Instead of starting tasks directly, PJL starts TS on each host selected to run the parallel job.
  4. TS starts the task.

Each TS reports its task PID and host name back to PAM. Now PAM can perform job control and resource usage collection through RES.

TaskStarter also collects the exit status of the task and reports it to PAM. When PJL exits, PAM exits with the same termination status as the PJL.

Integration methods

There are 2 ways to integrate the PJL.

Method 1

In this method, PAM rewrites the PJL command line to insert TS in the correct position, and set callback information for TS to communicate with PAM.

Use this method when:

For details, see Integration Method 1

Method 2

In this method, you rewrite or wrap the PJL to include TS and callback information for TS to communicate with PAM. This method of integration is the most flexible, but may be more difficult to implement.

Use this method when:

For details, see Integration Method 2.

Error handling

  1. If PAM cannot start PJL, no tasks are started and PAM exits.
  2. If PAM does not receive all the TS registration messages (host name and PID) within a the timeout specified by LSF_HPC_PJL_LOADENV_TIMEOUT in lsf.conf, it assumes that the job can not be executed. It kills the PJL, kills all the tasks that have been successfully started (if any), and exits. The default for LSF_HPC_PJL_LOADENV_TIMEOUT is 300 seconds.
  3. If TS cannot start the task, it reports this to PAM and exit. If all tasks report, PAM checks to make sure all tasks have started. If any task does not start, PAM kills the PJL, sends a message to kill all the remote tasks that have been successfully started, and exit.
  4. If TS terminates before it can report the exit status of the task to PAM, PAM never succeeds in receiving all the exit status. It then exits when the PJL exits.
  5. If the PJL exits before all TS have registered the exit status of the tasks, then PAM assumes the parallel job is completed, and communicates with RES, which signals the tasks.

Using the pam -n option (SGI MPI only)

The -n option on the pam command line specifies the number of tasks that PAM should start.

You can use both bsub -n and pam -n in the same job submission. The number specified in the pam -n option should be less than or equal to the number specified by bsub -n. If the number of task specified with pam -n is greater than the number specified by bsub -n, the pam -n is ignored.

For example, you can specify:

bsub -n 5 pam -n 2 -mpi a.out

Here, 5 processors are reserved for the job, but PAM only starts 2 parallel tasks.

Custom job controls for parallel jobs

As with sequential LSF jobs, you can use the JOB_CONTROLS parameter in the queue (lsb.queues) to configure custom job controls for your parallel jobs.

If the custom job control contains ... Platform LSF ...
A signal name (for example, SIGSTOP or SIGTSTP)
Propagates the signal to the PAM PGID and all parallel tasks
A /bin/sh command line or script
Sets all job environment variables for the command action.
Sets the following additional environment variables:
  • LSB_JOBPGIDS--a list of current process group IDs of the job
  • LSB_JOBPIDS--a list of current process IDs of the job
  • LSB_PAMPID--the PAM process ID
  • LSB_JOBRES_PID--the process ID of RES for the job
For the SUSPEND action command, sets the following environment variables:
  • LSB_SUSP_REASONS--an integer representing a bitmap of suspending reasons as defined in lsbatch.h. The suspending reason can allow the command to take different actions based on the reason for suspending the job.
  • LSB_SUSP_SUBREASONS--an integer representing the load index that caused the job to be suspended. When the suspending reason SUSP_LOAD_REASON (suspended by load) is set in LSB_SUSP_REASONS, LSB_SUSP_SUBREASONS set to one of the load index values defined in lsf.h.

Using the LSB_JOBRES_PID and LSB_PAMPID environment variables

How to use these two variables in your job control scripts:


LSB_PAM_PID may not be available when job first starts. It take some time for pam to register back its PID to sbatchd.

For more information

See the Platform LSF Configuration Guide for information about JOB_CONTROLS in the lsb.queues file.

See Administering Platform LSF for information about configuring job controls.

Sample job termination script for queue job control

By default, LSF sends a SIGUSR2 signal to terminate a job that has reached its run limit or deadline. Some applications do not respond to the SIGUSR2 signal (for example, LAM/MPI), so jobs may not exit immediately when a job run limit is reached. You should configure your queues with a custom job termination action specified by the JOB_CONTROLS parameter.

Sample script

Use the following sample job termination control script for the TERMINATE job control in the hpc_linux queue for LAM/MPI jobs:

#!/bin/sh

#JOB_CONTROL_LOG=job.control.log.$LSB_BATCH_JID
JOB_CONTROL_LOG=/dev/null

kill -CONT -$LSB_JOBRES_PID >>$JOB_CONTROL_LOG 2>&1

if [ "$LSB_PAM_PID" != "" -a "$LSB_PAM_PID" != "0" ]; then
    kill -TERM $LSB_PAM_PID >>$JOB_CONTROL_LOG 2>&1

    MACHINETYPE=`uname -a | cut -d" " -f 5`
    while [ "$LSB_PAM_PID" != "0" -a "$LSB_PAM_PID" != "" ] # pam is running
    do
        if [ "$MACHINETYPE" = "CRAY" ]; then
            PIDS=`(ps -ef; ps auxww) 2>/dev/null | egrep ".*[/\[ \t]pam[] 
\t]*$"| sed -n "/grep/d;s/^ *[^ \t]* *\([0-9]*\).*/\1/p" | sort -u`
        else
            PIDS=`(ps -ef; ps auxww) 2>/dev/null | egrep " pam |/pam | 
pam$|/pam$"| sed -n "/grep/d;s/^ *[^ \t]* *\([0-9]*\).*/\1/p" | sort -u`
        fi

        echo PIDS=$PIDS >> $JOB_CONTROL_LOG
        if [ "$PIDS" = "" ]; then # no pam is running
            break;
        fi

        foundPamPid="N"
        for apid in $PIDS
        do
            if [ "$apid" = "$LSB_PAM_PID" ]; then
                # pam is running
                foundPamPid="Y"
                break
            fi
        done

        if [ "$foundPamPid" == "N" ]; then
            break # pam has exited
        fi
        sleep 2
    done
fi

# User other terminate signals if SIGTERM is
# caught and ignored by your application.
kill -TERM -$LSB_JOBRES_PID >>$JOB_CONTROL_LOG 2>&1
exit 0

To configure the script in the hpc_linux queue

  1. Create a job control script named job_terminate_control.sh.
  2. Make the script executable:
    chmod +x job_terminate_control.sh
    
  3. Edit the hpc_linux queue in lsb.queues to configure your job_terminate_control.sh script as the TERMINATE action in the JOB_CONTROLS parameter. For example:
    Begin Queue
    QUEUE_NAME   = hpc_linux_tv
    PRIORITY     = 30
    NICE         = 20
    # ...
    JOB_CONTROLS = TERMINATE[kill -CONT -$LSB_JOBRES_PID; kill 
    -TERM -$LSB_JOBRES_PID]
    JOB_CONTROLS = TERMINATE [/path/job_terminate_control.sh]
    TERMINATE_WHEN = LOAD PREEMPT WINDOW
    RERUNNABLE = NO
    INTERACTIVE = NO
    DESCRIPTION  = Platform LSF TotalView Debug queue.
    End Queue
    
  4. Reconfigure your cluster to make the change take effect:
    # badmin mbdrestart
    

[ Top ]


Integration Method 1

When to use this integration method

In this method, PAM rewrites the PJL command line to insert TS in the correct position, and set callback information for TS to communicate with PAM.

Use this method when:

Using pam to call the PJL

Submit jobs using pam in the following format:

pam [other_pam_options] -g num_args pjl [pjl_options] job [job_options]

The command line includes:

pam options

The -g option is required to use the generic PJL framework. You must specify all the other pam options before -g.

num_args specifies how many space-separated arguments in the command line are related to the PJL, including the PLJ itself (after that, the rest of the command line is assumed to be related to the binary application that launches the parallel tasks).

For example:

How PAM inserts TaskStarter

Before the PJL is started, PAM automatically modifies the command line and inserts the TS, the host and port for TS to contact PAM, and the LSF_ENVDIR in the correct position before the actual job.

TS is placed between the PJL and the parallel application. In this way, the TS starts each task, and LSF can monitor resource usage and control the task.

For example, if your LSF directory is /usr/share/lsf and you input:

pam [pam_options] -g 3 my_pjl -b group_name job [job_options]

PAM automatically modifies the PJL command line to:

my_pjl -b group_name /usr/share/lsf/TaskStarter -p host_name:port_number 
-c /user/share/lsf/conf job [job_options] [pjl_options]

For more detailed examples

See Example Integration: LAM/MPI

[ Top ]


Integration Method 2

When to use this integration method

In this method, you rewrite or wrap the PJL to include TS and callback information for TS to communicate with PAM. This method of integration is the most flexible, but may be more difficult to implement.

Use this method when:

Using pam to call the PJL

Submit jobs using pam in the following format:

pam [other_pam_options] -g pjl_wrap [pjl_wrap_options] job [job_options]

The command line includes:

pam options

The -g option is required to use the generic PJL framework. You must specify all the other pam options before -g.

Placing TaskStarter in your code

Each end job task must be started by the binary TaskStarter that is provided by Platform Computing.

When you use this method, PAM does not insert TS for you. You must modify your code to use TS and the LSF_TS_OPTIONS environment variable. LSF_TS_OPTIONS is created by PAM on the first execution host and contains the callback information for TS to contact PAM.


You must insert TS and the PAM callback information directly in front of the executable application that starts the parallel tasks.

To place TS and its options, you can modify either the PJL wrapper or the job script, depending on your implementation. If the package requires the path, specify the full path to TaskStarter.

Example

This example modifies the PJL wrapper. The job script includes both the PJL wrapper and the job itself.

Before

Without the integration, your job submission command line is:

bsub -n 2 jobscript

Your job script is:

#!/bin/sh
if [ -n "$ENV1" ]; then
  pjl -opt1 job1
else
  pjl -opt2 -opt3 job2
fi

After

After the integration, your job submission command line includes the pam command:

bsub -n 2 pam -g new_jobscript

Your new job script inserts TS and LSF_TS_OPTIONS before the jobs:

#!/bin/sh
if [ -n "$ENV1" ]; then
  pjl -opt1 usr/share/lsf/TaskStarter $LSF_TS_OPTIONS job1
else
  pjl -opt2 -opt3 usr/share/lsf/TaskStarter $LSF_TS_OPTIONS 
job2
fi

For more detailed examples

See Example Integration: LAM/MPI

[ Top ]


Tuning PAM Scalability and Fault Tolerance

To improve performance and scalability for large parallel jobs, tune the following parameters.

Parameters for PAM (lsf.conf)

For better performance, you can adjust the following parameters in lsf.conf. The user's environment can override these.

LSF_HPC_PJL_LOADENV_TIMEOUT

Timeout value in seconds for PJL to load or unload the environment. For example, the time needed for IBM POE to load or unload adapter windows.

At job startup, the PJL times out if the first task fails to register within the specified timeout value. At job shutdown, the PJL times out if it fails to exit after the last Taskstarter termination report within the specified timeout value.

Default: LSF_HPC_PJL_LOADENV_TIMEOUT=300

LSF_PAM_RUSAGE_UPD_FACTOR

This factor adjusts the update interval according to the following calculation:

RUSAGE_UPDATE_INTERVAL + num_tasks * 1 * LSF_PAM_RUSAGE_UPD_F ACTOR.

PAM updates resource usage for each task for every SBD_SLEEP_TIME + num_tasks * 1 seconds (by default, SBD_SLEEP_TIME=15). For large parallel jobs, this interval is too long. As the number of parallel tasks increases, LSF_PAM_RUSAGE_UPD_FACTOR causes more frequent updates.

Default: LSF_PAM_RUSAGE_UPD_FACTOR=0.01

[ Top ]


Running Jobs with Task Geometry

Specifying task geometry allows you to group tasks of a parallel job step to run together on the same node. Task geometry allows for flexibility in how tasks are grouped for execution on system nodes. You cannot specify the particular nodes that these groups run on; the scheduler decides which nodes run the specified groupings.

Task geometry is supported for all Platform LSF MPI integrations including IBM POE, LAM/MPI, MPICH-GM, MPICH-P4, and Intel® MPI.

Use the LSB_PJL_TASK_GEOMETRY environment variable to specify task geometry for your jobs. LSB_PJL_TASK_GEOMETRY overrides any process group or command file placement options.

The environment variable LSB_PJL_TASK_GEOMETRY is checked for all parallel jobs. If LSB_PJL_TASK_GEOMETRY is set users submit a parallel job (a job that requests more than 1 slot), LSF attempts to shape LSB_MCPU_HOSTS accordingly.

The mpirun.lsf script sets the LSB_MCPU_HOSTS environment variable in the job according to the task geometry specification. The PJL wrapper script controls the actual PJL to start tasks based on the new LSB_MCPU_HOSTS and task geometry.

Syntax

setenv LSB_PJL_TASK_GEOMETRY "{(task_ID,...) ...}"

For example, to submit a job to spawn 8 tasks and span 4 nodes, specify:

setenv LSB_PJL_TASK_GEOMETRY "{(2,5,7)(0,6)(1,3)(4)}"

Each task_ID number corresponds to a task ID in a job, each set of parenthesis contains the task IDs assigned to one node. Tasks can appear in any order, but the entire range of tasks specified must begin with 0, and must include all task ID numbers; you cannot skip a task ID number. Use braces to enclose the entire task geometry specification, and use parentheses to enclose groups of nodes. Use commas to separate task IDs.

For example.

setenv LSB_PJL_TASK_GEOMETRY "{(1)(2)}"

is incorrect because it does not start from task 0.

setenv LSB_PJL_TASK_GEOMETRY "{(0)(3)}"

is incorrect because it does not specify task 1and 2.

LSB_PJL_TASK_GEOMETRY cannot request more hosts than specified by the bsub -n option.

For example:

setenv LSB_PJL_TASK_GEOMETRY "{(0)(1)(2)}"

specifies three nodes, one task per node. A correct job submission must request at least 3 hosts:

bsub -n 3 -R "span[ptile=1]" -I -a mpich_gm mpirun.lsf my_job
Job <564> is submitted to queue <hpc_linux>.
<<Waiting for dispatch ...>>
<<Starting on hostA>>
...

Planning your task geometry specification

You should plan their task geometry in advance and specify the job resource requirements for LSF to select hosts appropriately.

Use bsub -n and -R "span[ptile=]" to make sure LSF selects appropriate hosts to run the job, so that:

LSB_PJL_TASK_GEOMETRY only guarantees the geometry but does not guarantee the host order. You must make sure each host selected by LSF can run any group of tasks specified in LSB_PJL_TASK_GEOMETRY.

You can also use bsub -x to run jobs exclusively on a host. No other jobs share the node once this job is scheduled.

Usage notes and limitations

Examples

For the following task geometry:

setenv LSB_PJL_TASK_GEOMETRY "{(2,5,7)(0,6)(1,3)(4)}"

The job submission should look like:

bsub -n 12 -R "span[ptile=3]" -a poe mpirun.lsf myjob

If task 6 is an OpenMP job that spawns 4 threads, the job submission is:

bsub -n 20 -R "span[ptile=5]" -a poe mpirun.lsf myjob


Do not use -a openmp or set LSF_PAM_HOSTLIST_USE for OpenMP jobs.

A POE job has three tasks: task0, task1, and task2, and

Task task2 spawns 3 threads. The tasks task0 and task1 run on one node and task2 runs on the other node. The job submission is:

bsub -a poe -n 6 -R "span[ptile=3]" mpirun.lsf -cmdfile 
mycmdfile

where mycmdfile contains:

task0
task1
task2

The order of the tasks in the task geometry specification must match the order of tasks in mycmdfile:

setenv LSB_PJL_TASK_GEOMETRY "{(0,1)(2)}"

If the order of tasks in mycmdfile changes, you must change the task geometry specification accordingly.

For example, if mycmdfile contains:

task0
task2
task1

the task geometry must be changed to:

setenv LSB_PJL_TASK_GEOMETRY "{(0,2)(1)}"

[ Top ]


Enforcing Resource Usage Limits for Parallel Tasks

A typical Platform LSF parallel job launches its tasks across multiple hosts. By default you can enforce limits on the total resources used by all the tasks in the job. Because PAM only reports the sum of parallel task resource usage, LSF does not enforce resource usage limits on individual tasks in a parallel job.

For example, resource usage limits cannot control allocated memory of a single task of a parallel job to prevent it from allocating memory and bringing down the entire system. For some jobs, the total resource usage may be exceed a configured resource usage limit even if no single task does, and the job is terminated when it does not need to be.

Attempting to limit individual tasks by setting a system-level swap hard limit (RLIMIT_AS) in the system limit configuration file (/etc/security/limits.conf) is not satisfactory, because it only prevents tasks running on that host from allocating more memory than they should; other tasks in the job can continue to run, with unpredictable results.

By default, custom job controls (JOB_CONTROL in lsb.queues) apply only to the entire job, not individual parallel tasks.

Enabling resource usage limit enforcement for parallel tasks

Use the LSF_HPC_EXTENSIONS options TASK_SWAPLIMIT and TASK_MEMLIMIT in lsf.conf to enable resource usage limit enforcement and job control for parallel tasks. When TASK_SWAPLIMIT or TASK_MEMLIMIT is set in LSF_HPC_EXTENSIONS, LSF terminates the entire parallel job if any single task exceeds the limit setting for memory and swap limits.

Other resource usage limits (CPU limit, process limit, run limit, and so on) continue to be enforced for the entire job, not for individual tasks.

For more information

For detailed information about resource usage limits in LSF, see the "Runtime Resource Usage Limits" chapter in Administering Platform LSF.

Assumptions and behavior

[ Top ]


Example Integration: LAM/MPI

The script lammpirun_wrapper is the PJL wrapper. Use either Integration Method 1 or Integration Method 2 to call this script:

pam [other_pam_options] -g num_args lammpirun_wrapper job [job_options]
pam [other_pam_options] -g lammpirun_wrapper job [job_options]

Example script

#!/bin/sh
# 
# -----------------------------------------------------
# Source the LSF environment. Optional.
# -----------------------------------------------------
. ${LSF_ENVDIR}/lsf.conf

# -----------------------------------------------------
# Set up the variable LSF_TS representing the TaskStarter.
# -----------------------------------------------------
LSF_TS="$LSF_BINDIR/TaskStarter"

# ---------------------------------------------------------------------
# Define the function to handle external signals: 
# - display the signal received and the shutdown action to the user 
# - log the signal received and the daemon shutdown action 
# - exit gracefully by shutting down the daemon 
# - set the exit code to 1 
# ---------------------------------------------------------------------- 
# 
lammpirun_exit()
{
   trap '' 1 2 3 15
   echo "Signal Received, Terminating the job<${TMP_JOBID}> and run lamhalt 
..."
   echo "Signal Received, Terminating the job<${TMP_JOBID}> and run lamhalt 
..." >>$LOGFILE
   $LAMHALT_CMD >>$LOGFILE 2>&1
   exit 1
} #lammpirun_exit

#-----------------------------------
# Name: who_am_i
# Synopsis: who_am_i 
# Environment Variables:
# Description:
#       It returns the name of the current user.
# Return Value:
#       User name.
#-----------------------------------
who_am_i()
{
if  [ `uname` = ConvexOS ] ; then
    _my_name=`whoami | sed -e "s/[      ]//g"`  
else
    _my_name=`id | sed -e 's/[^(]*(\([^)]*\)).*/\1/' | sed -e "s/[      ]//g"`  
fi

echo $_my_name
} # who_am_i

#
#  -----------------------------------------------------
# Set up the script's log file: 
# - create and set the variable LOGDIR to represent the log file directory 
# - fill in your own choice of directory LOGDIR 
# - the log directory you choose must be accessible by the user from all hosts 
# - create a log file with a unique name, based on the job ID 
# - if the log directory is not specified, the log file is /dev/null 
# - the first entry logs the file creation date and file name 
# - we create and set a second variable DISPLAY_JOBID to format the job 
#   ID properly for writing to the log file  
#  ----------------------------------------------------
#
#
# Please specify your own LOGDIR,
# Your LOGDIR must be accessible by the user from all hosts.
#
LOGDIR=""

TMP_JOBID=""
if [ -z "$LSB_JOBINDEX" -o "$LSB_JOBINDEX" = "0" ]; then
    TMP_JOBID="$LSB_JOBID"
    DISPLAY_JOBID="$LSB_JOBID"
else
    TMP_JOBID="$LSB_JOBID"_"$LSB_JOBINDEX"
    DISPLAY_JOBID="$LSB_JOBID[$LSB_JOBINDEX]"
fi

if [ -z "$LOGDIR" ]; then
    LOGFILE="/dev/null"
else
    LOGFILE="${LOGDIR}/lammpirun_wrapper.job${TMP_JOBID}.log"
fi


#
# -----------------------------------------------------
# Create and set variables to represent the commands used in the script:
#  - to modify this script to use different commands, edit this section 
# ----------------------------------------------------
#
TPING_CMD="tping"
LAMMPIRUN_CMD="mpirun"
LAMBOOT_CMD="lamboot"
LAMHALT_CMD="lamhalt"

#
# -----------------------------------------------------
# Define an exit value to rerun the script if it fails 
# - create and set the variable EXIT_VALUE to represent the requeue exit value
# - we assume you have enabled job requeue in LSF
# - we assume 66 is one of the job requeue values you specified in LSF 
# ----------------------------------------------------
#
# EXIT_VALUE should not be set to 0
EXIT_VALUE="66"

#
# -----------------------------------------------------
# Write the first entry to the script's log file 
# - date of creationg 
# - name of log file 
# ----------------------------------------------------
#
my_name=`who_am_i`
echo "`date` $my_name" >>$LOGFILE

# -----------------------------------------------------
# Use the signal handling function to handle specific external signals. 
# ----------------------------------------------------
#
trap lammpirun_exit 1 2 3 15

#
# -----------------------------------------------------
# Set up a hosts file in the specific format required by LAM MPI: 
# - remove any old hosts file
# - create a new hosts file with a unique name using the LSF job ID 
# - write a comment at the start of the hosts file 
# - if the hosts file was not created properly, display an error to 
#   the user and exit  
# - define the variables HOST, NUM_PROC, FLAG, and TOTAL_CPUS to 
#   help with parsing the host information 
# - LSF's selected hosts are described in LSB_MCPU_HOSTS environment variable  
# - parse LSB_MCPU_HOSTS into the components  
# - write the new hosts file using this information 
# - write a comment at the end of the hosts file 
# - log the contents of the new hosts file to the script log file 
# ----------------------------------------------------
#
LAMHOST_FILE=".lsf_${TMP_JOBID}_lammpi.hosts"

if [ -d "$HOME" ]; then
    LAMHOST_FILE="$HOME/$LAMHOST_FILE"
fi

#
#
# start a new host file from scratch
rm -f $LAMHOST_FILE
echo "# LAMMPI host file created by LSF on `date`" >> $LAMHOST_FILE

# check if we were able to start writing the conf file
if [ -f $LAMHOST_FILE ]; then
    :
else
    echo "$0: can't create $LAMHOST_FILE"
    exit 1
fi

HOST=""
NUM_PROC=""
FLAG=""
TOTAL_CPUS=0
for TOKEN in $LSB_MCPU_HOSTS
do
    if [ -z "$FLAG" ]; then
        HOST="$TOKEN"
        FLAG="0"
    else
        NUM_PROC="$TOKEN"
        TOTAL_CPUS=`expr $TOTAL_CPUS + $NUM_PROC`
        FLAG="1"
    fi

    if [ "$FLAG" = "1" ]; then
        _x=0
        while [ $_x -lt $NUM_PROC ]
        do
            echo "$HOST" >>$LAMHOST_FILE
            _x=`expr $_x + 1`
        done

        # get ready for the next host
        FLAG=""
        HOST=""
        NUM_PROC=""
    fi
done

# last thing added to LAMHOST_FILE
echo "# end of LAMHOST file" >> $LAMHOST_FILE

echo "Your lamboot hostfile looks like:" >> $LOGFILE
cat $LAMHOST_FILE >> $LOGFILE



#  -----------------------------------------------------
#  Process the command line: 
# - extract [mpiopts] from the command line
# - extract jobname [jobopts] from the command line
#  -----------------------------------------------------
ARG0=`$LAMMPIRUN_CMD -h 2>&1 | \
      egrep '^[[:space:]]+-[[:alpha:][:digit:]-]+[[:space:]][[:space:]]' | \
      awk '{printf "%s ", $1}'`
# get -ton,t and -w / nw options
TMPARG=`$LAMMPIRUN_CMD -h 2>&1 | \
      egrep '^[[:space:]]+-[[:alpha:]_-]+[[:space:]]*(,|/)[[:space:]]-
[[:alpha:]]*' | 
      sed 's/,/ /'| sed 's/\// /' | \
      awk '{printf "%s %s ", $1, $2}'`
ARG0="$ARG0 $TMPARG"

ARG1=`$LAMMPIRUN_CMD -h 2>&1 | \
      egrep '^[[:space:]]+-[[:alpha:]_-
]+[[:space:]]+<[[:alpha:][:space:]_]+>[[:space:]]' | \
      awk '{printf "%s ", $1}'`

while [ $# -gt 0 ]
do
     MPIRunOpt="0"

     #single-valued options
     for option in $ARG1
     do
         if [ "$option" = "$1" ]; then  
            MPIRunOpt="1"
	     case "$1" in
	         -np|-c)
	 	     shift 
	 	     shift
	 	     ;;
	         *)
	 	     LAMMPI_OPTS="$LAMMPI_OPTS $1" #get option name
	 	     shift 
	 	     LAMMPI_OPTS="$LAMMPI_OPTS $1" #get option value
	 	     shift
	 	     ;;
	     esac
            break
         fi
     done

     if [ $MPIRunOpt = "1" ]; then
        : 
     else
        #Non-valued options
        for option in $ARG0
        do
            if [ $option = "$1" ]; then
               MPIRunOpt="1"
	        case "$1" in
	        -v)
	            shift
	 	    ;;
	        *)
	 	    LAMMPI_OPTS="$LAMMPI_OPTS $1"
	 	    shift 
	 	    ;;
	        esac
	        break
            fi
        done
     fi

     if [ $MPIRunOpt = "1" ]; then
        : 
     else 
        JOB_CMDLN="$*"
        break 
     fi

done

# -----------------------------------------------------------------------------
# Set up the CMD_LINE variable representing the integrated section of the
# command line:
# - LSF_TS, script variable representing the TaskStarter binary. 
#   TaskStarter must start each and every job task process.
# - LSF_TS_OPTIONS, LSF environment variable containing all necessary
#   information for TaskStarter to callback to LSF's Parallel Application
#   Manager.
# - JOB_CMDLN, script variable containing the job and job options
#------------------------------------------------------------------------------
if [ -z "$LSF_TS_OPTIONS" ]
then
    echo CMD_LINE="$JOB_CMDLN" >> $LOGFILE
    CMD_LINE="$JOB_CMDLN "
else
    echo CMD_LINE="$LSF_TS $LSF_TS_OPTIONS $JOB_CMDLN" >> $LOGFILE
    CMD_LINE="$LSF_TS $LSF_TS_OPTIONS $JOB_CMDLN "
fi

#
# -----------------------------------------------------
# Pre-execution steps required by LAMMPI:
# - define the variable LAM_MPI_SOCKET_SUFFIX using the LSF 
#   job ID and export it
# - run lamboot command and log the action 
# - append the hosts file to the script log file 
# - run tping command and log the action and output 
# - capture the result of tping and test for success before proceeding 
# - exits with the "requeue" exit value if pre-execution setup failed
# ----------------------------------------------------
#

LAM_MPI_SOCKET_SUFFIX="${LSB_JOBID}_${LSB_JOBINDEX}"
export LAM_MPI_SOCKET_SUFFIX

echo $LAMBOOT_CMD $LAMHOST_FILE >>$LOGFILE
$LAMBOOT_CMD $LAMHOST_FILE >>$LOGFILE 2>&1
echo $TPING_CMD h -c 1 >>$LOGFILE
$TPING_CMD N -c 1 >>$LOGFILE 2>&1
EXIT_VALUE="$?"

if [ "$EXIT_VALUE" = "0" ]; then
#
# -----------------------------------------------------
# Run the parallel job launcher: 
# - log  the action 
# - trap the exit value 
# ----------------------------------------------------
#
    #call mpirun -np # a.out
    echo "Your command line looks like:" >> $LOGFILE
    echo $LAMMPIRUN_CMD $LAMMPI_OPTS -v C $CMD_LINE >> $LOGFILE
    $LAMMPIRUN_CMD $LAMMPI_OPTS -v C $CMD_LINE 
    EXIT_VALUE=$?
#
# -----------------------------------------------------
#  Post-execution steps required by LAMMPI:
# - run lamhalt   
# - log the action 
# ----------------------------------------------------
#
    echo $LAMHALT_CMD >>$LOGFILE
    $LAMHALT_CMD >>$LOGFILE 2>&1
fi

#
# -----------------------------------------------------
# Clean up after running this script:
# - delete the hosts file we created 
# - log the end of the job 
# - log the exit value of the job
# ----------------------------------------------------
#
# cleanup temp and conf file then exit
rm -f $LAMHOST_FILE
echo "Job<${DISPLAY_JOBID}> exits with exit value $EXIT_VALUE." >>$LOGFILE 2>&1
# To support multiple jobs inside one job script
# Sleep one sec to allow next lamd start up, otherwise tping will return error
sleep 1
exit $EXIT_VALUE
#
# -----------------------------------------------------
# End the script.
# ----------------------------------------------------
#

[ Top ]


Tips for Writing PJL Wrapper Scripts

A wrapper script is often used to call the PJL. We assume the PJL is not integrated with LSF, so if PAM was to start the PJL directly, the PJL would not automatically use the hosts that LSF selected, or allow LSF to collect resource information.

The wrapper script can set up the environment before starting the actual job.

Script log file

The script should create and use its own log file, for troubleshooting purposes. For example, it should log a message each time it runs a command, and it should also log the result of the command. The first entry might record the successful creation of the log file itself.

Command aliases

Set up aliases for the commands used in the script, and identify the full path to the command. Use the alias throughout the script, instead of calling the command directly. This makes it simple to change the path or the command at a later time, by editing just one line.

Signal handling

If the script is interrupted or terminated before it finishes, it should exit gracefully and undo any work it started. This might include closing files it was using, removing files it created, shutting down daemons it started, and recording the signal event in the log file for troubleshooting purposes.

Requeue exit value

In LSF, job requeue is an optional feature that depends on the job's exit value. PAM exits with the same exit value as PJL, or its wrapper script. Some or all errors in the script can specify a special exit value that causes LSF to requeue the job.

Redirect screen output

Use /dev/null to redirect any screen output to a null file.

Access LSF configuration

Set LSF_ENVDIR and source the lsf.conf file. This gives you access to LSF configuration settings.

Construct host file

The hosts LSF has selected to run the job are described by the environment variable LSB_MCPU_HOSTS. This environment variable specifies a list, in quotes, consisting of one or more host names paired with the number of processors to use on that host:

"host_name number_processors host_name number_processors ..."

Parse this variable into the components and create a host file in the specific format required by the vendor PJL. In this way, the hosts LSF has chosen are passed to the PJL.

Vendor-specific pre-exec

Depending on the vendor, the PJL may require some special pre-execution work, such as initializing environment variables or starting daemons. You should log each pre-exec task in the log file, and also check the result and handle errors if a required task failed.

Double-check external resource

If an external resource is used to identify MPI-enabled hosts, LSF has selected hosts based on the availability of that resource. However, there is some time delay between LSF scheduling the job and the script starting the PJL. It's a good idea to make the script verify that required resources are still available on the selected hosts (and exit if the hosts are no longer able to execute the parallel job). Do this immediately before starting the PJL.

PJL

The most important function of the wrapper script is to start the PJL and have it execute the parallel job on the hosts selected by LSF. Normally, you use a version of the mpirun command.

Vendor-specific post-exec

Depending on the vendor, the PJL may require some special post-execution work, such as stopping daemons. You should log each post-exec task in the log file, and also check the result and handle errors if any task failed.

Script post-exec

The script should exit gracefully. This might include closing files it used, removing files it created, shutting down daemons it started, and recording each action in the log file for troubleshooting purposes.

[ Top ]


Other Integration Options

Once the PJL integration is successful, you might be interested in the following LSF features.

For more information about these features, see the LSF documentation.

Using a job starter

A job starter is a wrapper script that can set up the environment before starting the actual job.

Using external resources

You may need to identify MPI-enabled hosts

If all hosts in the LSF cluster can be used run the parallel jobs, with no restrictions, you don't need to differentiate between regular hosts and MPI-enabled hosts.

You can use an external resource to identify suitable hosts for running your parallel jobs.

To identify MPI-enabled hosts, you can configure a static Boolean resource in LSF.

For some integrations, to make sure the parallel jobs are sent to suitable hosts, you must track a dynamic resource (such as free ports). You can use an LSF ELIM to report the availability of these. See Administering Platform LSF for information about writing ELIMs.

Named hosts

Using esub

An esub program can contain logic that modifies a job before submitting it to LSF. The esub can be used to simplify the user input. An example is the LAM/MPI integration in the Platform open source FTP directory.

[ Top ]


[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]


      Date Modified: May 08, 2008
Platform Computing: www.platform.com

Platform Support: support@platform.com
Platform Information Development: doc@platform.com

Copyright © 1994-2008 Platform Computing Corporation. All rights reserved.