last update: 09/24/2009

Load Sharing Facility (LSF)

Running batch jobs using LSF

Load Sharing Facility (LSF) is a product of Platform Computing. LSF is a batch job management subsystem for multi-host, multi-vendor complexes. LSF provides batch job capabilities on blueice, filling the same role as LoadLeveler on bluesky. This section provides basic information about LSF.

An LSF job is a script or a file containing LSF directives. You submit a batch job to the queues with the bsub directive:

bsub < lsf_job_script_file

You can obtain a list of LSF batch queues using the bqueues command. See the bqueues man page for options.

You can list all of your queued and running jobs using the bjobs command. See the bjobs man page for options.

You can get a quick summary of all jobs running on the system using the lsfq command, although you may prefer to use the batchview command instead (see description below).

An LSF man page is available by typing man lsfintro on bluefire.

LSF commands

These LSF commands are essential for running batch jobs on blueice:

The bhosts command

The bhosts command is typically used with the following options:
bhosts [-w|-l][-R "res_req"][host_name|host_group]
to display information about hosts and platforms.

The following partial bhosts output is what you might see on bluefire. A full listing would be much longer, containing a line for each of the 100 nodes. The bhosts command is useful for learning which nodes are unreachable or unavailable, although no reason is given for either status. For example, a node may be listed as unreachable or unavailable if it is taken out of the pool for servicing. Similarly, a node may be listed as unreachable or unavailable if it is reserved for a special project.

bl1012en$ bhosts
HOST_NAME          STATUS       JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV
bl0101en           ok              -      8      0      0      0      0      0
bl0102en           ok              -      8      0      0      0      0      0
bl0103en           unreach         -      8      0      0      0      0      0
bl0104en           ok              -      8      0      0      0      0      0
bl0105en           unavail         -      8      0      0      0      0      0
bl0106en           ok              -      8      0      0      0      0      0
bl0201en           ok              -      8      0      0      0      0      0
...                

The commands
lshosts [-w | -l] [-R "res_req"] [host_name | cluster_name]
lshosts -s [shared_resource_name ...]

display hosts and their static resource information.

The bqueues command

The bqueues command is typically used with the following options:
bqueues [-w|-l|-r][-m host_name|-m all]
[-u user_name|-u all][queue_name .]

to display information about job queues.

By default, the bqueues command returns the following information about all queues:

bv0101en$ bqueues
QUEUE_NAME      PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP
special         500  Open:Active       -    -    -    -     0     0     0     0
premium         300  Open:Active       -    -    -    -     0     0     0     0
regular         200  Open:Active       -    -    -    -  1184   784   400     0
economy         160  Open:Active       -    -    -    -     0     0     0     0
hold            104  Open:Active       -    -    -    -     0     0     0     0
standby         100  Open:Active       -    -    -    -     0     0     0     0
share           100  Open:Active       -    -    -    -     0     0     0     0

The bsub command

The bsub command is typically used as follows:
bsub [options] command [cmd_args]
to submit a job for batch execution.

Options for the bsub command include:

Makes a job checkpointable and specifies checkpoint directory
-B Sends mail at dispatch and initiation times.
-H Holds job in PSUSP and waits for bresume (Please do not use -H)
-I | -Ip | -Is Submits as batch interactive
-K Submits job and locks cmd line with status updates
-N Sends job report by e-mail (use only with -I | -Is | -Ip or -o)
-r Rerun job on another host if host terminates
-x Exclusive execution mode
-a esub_parameters Specifies parallel job launcher (PJL) to be used
-b [[month:]day:]hour:minute Dispatch date/time
-C core_limit Limits size of core dumps (-C 0 recommended unless corefiles required for debugging)
-c [hours:]minutes[/host_name | /host_model] Cpu time limit
-D data_limit  
-e err_file File to use as stderr
-E "pre_exec_command [arguments ...]" Pre-exec command invoked before batch stream command processing
-ext[sched] "external_scheduler_options" N/A
-f "local_file operator [remote_file]" ... Files to be copied between local/remote systems
-F file_limit Per process file size limit
-g job_group_name Submits job to a job group
-G user_group Associates job with a specific group
-i input_file | -is input_file Specifies stdin for job
-J job_name | -J "job_name[index_list]%job_slot_limit" Specifies job name
-k "checkpoint_dir [checkpoint_period][method=method_name]"
-L login_shell Uses login_shell for runtime environment
-m "host_name[@cluster_name][+[pref_level]] | host_group[+[pref _level]] Selects and ranks hosts/groups on which to run
-M mem_limit Sets per process memory limit
-n min_proc[,max_proc] Sets min/max number of processors required to run job (Required)
-o out_file Specifies stdout
-P project_name Specifies project name (Required)
-p process_limit Limits total number of processes
-q queue_name Specifies queue for job (default provided by system)
-R "res_req" Specifies resource requirements
-sla service_class_name Specifies service class for job
-sp priority Specifies priority amongst user's jobs
-S stack_limit Sets per-process stack limit
-t [[month:]day:]hour:minute Specifies job termination date
-T thread_limit Sets limit on number of concurrent jobs
-U reservation_ID Uses reservation via brsvadd command
-u mail_user Mail-to address
-v swap_limit Sets total process virtual memory limit
-w 'dependency_expression' Defines dependencies to be met before job initiation
-wa '[signal | command | CHKPNT]' Specifies action to be taken before job control step occurs
-wt '[hours:]minutes' Specifies time interval before job control occurs to send warni ng signal
-W [hours:]minutes[/host_name | /host_model] Specifies run time limit for job (Required)
-Zs Spools command file and runs from there

Job submission - LSF usage is different from LoadLeveler

The following LSF job submission commands illustrate differences from LoadLeveler. (Note: this list is not exhaustive, and likely will continue to grow.)

LSF command bsub allows submission of an executable, e.g.

  bsub -i infile -o outfile -e errfile a.out
whereas the analogous LoadLeveler command llsubmit requires a job script to run an executable.

LSF can also be used to submit a job script. However, then the LSF command bs ub requires redirection of its command file, specifically

  bsub < myscript
If redirection is missing, the job will not submit properly (ignoring all of your #BSUB directives), and you will see it vanish from the queues with no explanation.

Sample LSF job scripts

Serial job

#!/bin/csh
#
# LSF batch script to run a serial code
#
#BSUB -P 99999999		# Project 99999999
#BSUB -n 1			# number of tasks
#BSUB -J seriallsf.test		# job name
#BSUB -o seriallsf.out		# output filename
#BSUB -e seriallsf.err		# input filename
#BSUB -q regular		# queue
#BSUB -W 1:00			# 1 hour wallclock limit (required)
# Fortran example
#xlf90 -o samp_f -Mextend samp.f
./samp_f

# C example
#cc -o samp_c samp.c
./samp_c

# C++ example
#g++  --no_auto_instantiation -o samp_cc samp.cc
./samp_cc

MPI job

Important: To submit MPI jobs, use mpirun.lsf, not mpirun, so that the LSF job scheduler can properly allocate node resources.

#!/bin/csh
#
# LSF batch script to run the test MPI code
#
#BSUB -P 99999999		# Project 99999999
#BSUB -n 64			# number of total (MPI) tasks
#BSUB -R "span[ptile=32]"	# run a max of 32 MPI tasks per node
#BSUB -J mpilsf.test		# job name
#BSUB -o mpilsf.out		# output filename
#BSUB -e mpilsf.err		# error filename
#BSUB -q regular		# queue
#BSUB -W 1:00			# 1 hour wallclock limit (required)

# Fortran example
#mpxlf90 -o mpi_samp_f mpisamp.f
mpirun.lsf ./mpi_samp_f

# C example
#mpcc -o mpi_samp_c mpisamp.c
mpirun.lsf ./mpi_samp_c

# C++ example
#mpCC -o mpi_samp_cc mpisamp.cc
mpirun.lsf ./mpi_samp_cc

Notes for SMT: The above example shows the "default" way to run (without SMT). It will run 64 tasks on two 32-way nodes, placing one task per physical processor, i.e.:
#BSUB -n 64
#BSUB -R span[ptile=32]

On an SMT-enabled node, you could run this same application on a single 32-way nodes with twice the number of tasks per physical processor, and your job script would use the LSF commands:
#BSUB -n 64
#BSUB -R span[ptile=64]


OpenMP job

#!/bin/csh
#
# LSF script to run the test OMP codes
#
#BSUB -P 99999999		# Project 99999999
#BSUB -n 1			# number of tasks
#BSUB -R "span[hosts=1]"	# max jobs run on one host
#BSUB -J omplsf.test		# job name
#BSUB -o omplsf.out		# ouput filename
#BSUB -e omplsf.err		# input filename
#BSUB -q regular		# queue
#BSUB -W 1:00			# 1 hour wallclock limit

setenv OMP_NUM_THREADS 32

# Fortran example
#xlf90 -o samp_f -qsmp=omp samp.f
./samp_f

# C example
cc -qsmp=omp -o samp_c samp.c
./samp_c

# C++ example
g++ --no_auto_instantiation -qsmp=omp -o sampcc samp.cc
./samp_cc

MPMD (MPI) job

#!/bin/ksh
#
# LSF batch script to run an MPMD code. This example has two executables
# each requiring one task
#
#BSUB -n 2
#BSUB -R "span[ptile=1]" 	#max number of tasks/node. Job will thus run on
#				#two nodes. Using ptile of 2 to 32 will make job 
#				#run on same node
#BSUB -o mpmdlsf.%J.out		# output filename
#BSUB -e mpmdlsf.%J.err		# error filename
#BSUB -J mpmdlsf.test		# job name
#BSUB -W 0:10			# 10 minutes wall clock time 
#BSUB -q regular		# queue
#BSUB -P xxxxxxxx		#insert correct project number

# Run this executable in MPMD mode:
export MP_PGMMODEL=mpmd
export MP_SHARED_MEMORY=yes
#
# Fortran example
cat << 'EOF' > mpmd.f
      program main
      implicit none
      include 'mpif.h'

      character (len=MPI_MAX_PROCESSOR_NAME) nodename
      integer name_len, ierr

! Establish the number of the process being run with the variable "rank"
      integer i
      integer rank,error,tag,length,status(MPI_STATUS_SIZE)
      integer hz, clock0, clock1, t
      real(kind=8)::  sum, buf(2), elapsed

      call mpi_init(error)
! Assign the process number to the current process with "rank"
      call mpi_comm_rank(MPI_COMM_WORLD,rank,error)
      length=2
      tag =999

! Test for the process being run, then calculate sum (process 1) or
! print sum (process 0)
      call system_clock(count_rate = hz)
      call system_clock(count = clock0)
      if (rank .eq. 1) then
          sum=0.0
          do i=1,1000000
              sum=sum+exp(.00000001*i)
          end do
          call system_clock(count = clock1)
          elapsed = real(clock1 - clock0) / hz
          buf(1)=sum
          buf(2)=elapsed/1000.0
          call mpi_send(buf,length,MPI_REAL8,0,tag,MPI_COMM_WORLD,error)
          call MPI_GET_PROCESSOR_NAME(nodename, name_len, ierr)
          print *,'f90 MPI task ', rank,' sending data from node ', nodename(1:name_len)
      elseif (rank .eq. 0) then
          call  mpi_recv(buf,length,MPI_REAL8,1,tag,MPI_COMM_WORLD,status,error)
          call MPI_GET_PROCESSOR_NAME(nodename, name_len, ierr)
          print *,'f90 MPI task ', rank,' receiving data on node ', nodename(1:name_len)
          print 10, buf(1), buf(2)
10        format(' f90 mpmd Results: Sum = ',1pe12.6,' Loop time =  ',0pf12.8)
      end if

      call mpi_finalize(error)
      stop
      end
'EOF'
mpxlf_r -qarch=auto -qtune=auto -O3 -qstrict -qrealsize=8 \
                  -qfixed=132 -o mpmdf mpmd.f
cp mpmdf mpmdf2
#
cat << end > cmds
mpmdf
mpmdf2
end
#
mpirun.lsf -cmdfile cmds
#
rm mpmdf mpmdf2 cmds


Hybrid job

#!/bin/ksh
#
# LSF batch script to run the test mixed MPI/OMP codes.
# The following example runs a job that has been "hard-wired"
# for two MPI tasks. Each task will run 32 threads, so the
# job has been set up to run on two blueefire nodes. Note that the
# OpenMP threads are not taken into account in the -n task
# count, but you should ensure that you allow for each
# thread to have its own processor, or your performance
# will suffer

#BSUB -P 99999999                 # Project 99999999
#BSUB -n 2                        # total tasks (MPI) needed
#BSUB -R "span[ptile=1]"          # max number of tasks (MPI) per node
#BSUB -o mixlsf.out               # output filename
#BSUB -e mixlsf.err               # error filename
#BSUB -J mixlsf.test              # job name
#BSUB -q regular                  # queue
#BSUB -W 1:00                     # wallclock limit of 1 hour
#
#
set EXE=./mix
#
export OMP_NUM_THREADS=32

# Fortran example
#mpxlf90_r -qsmp=omp -o mix mix.f
mpirun.lsf $EXE

# C example
#mpcc_r -qsmp=omp -o mix mix.c
#mpirun.lsf $EXE

# C++ example
#mpCC_r --no_auto_instantiation -qsmp=omp -o mix mix.cc
#mpirun.lsf $EXE

Notes: In the above application (not using SMT), you have two tasks and each task has 32 OpenMP threads. By using ptile=1 you are specifying that the maximum number of (MPI) tasks per node is 1, so the job will have to run on two nodes. Your job script would use the following LSF commands and set the number of OpenMP threads via an environment variable:
#BSUB -n 2
#BSUB -R "span[ptile=1]"
export OMP_NUM_THREADS=32

Notes for SMT: Since our program is hard-wired for 2 MPI tasks, the only way to take advantage of SMT is to double the number of threads (i.e., to 64). If you have a program in which you can vary the number of MPI tasks, you could modify the above script by treating the nodes as if they had twice the number of processors, that is, doubling the number of tasks (-n) you request. This would keep the same node count:
#BSUB -n 4
#BSUB -R "span[ptile=2]"
export OMP_NUM_THREADS=32


Hybrid MPMD job

#!/bin/ksh
#
# LSF batch script to run a test Fortran mixed MPI/OMP code using task geometry
# Equivalent LoadLeveler directives are given below.

#BSUB -n 2                              # total number of (MPI) tasks
##BSUB -R "span[ptile=1]"               # max tasks per node--not needed
                                        # if task geometry used in LSF
#BSUB -o geomlsf.out.%J                 # output filename
#BSUB -e geomlsf.err.%J                 # error filename
#BSUB -P 99999999                       # account number (project)
#BSUB -J geomlsf                        # job name
#BSUB -q regular                        # queue
#BSUB -W 2:00                           # 2 hour run limit
#------------LoadLeveler equivalents, commented out----------------------------
##@ job_type = parallel
##@ network.MPI = csss,not_shared,us    #Handled by POE elim in LSF
##@ total_tasks = 2                     #Note that total_tasks and
##@ tasks_per_node = 1                  #tasks_per_node are not needed
##@ node_usage = not_shared             #with task geometry in LoadLeveler
##@ output = geomlsf.out.$(jobid)
##@ error = geomlsf.err.$(jobid)
##@ account_no = 99999999
##@ jobname = geomlsf
##@ class = regular
##@ wall_clock_limit = 2:00
##@ task_geometry = {(0) (1)}
##@ ja_report = yes                     #Replaced by CISL Portal

#------------------------------------------------------------------------------
# Set up LSF Task Geometry to run across two nodes
# This is set using LSF environment variable rather than BSUB directive
# We will run program "fast" on 1 node with no threading (one MPI task)
# We will run program "slow" on 1 node with three threads under 1 MPI task
export LSB_PJL_TASK_GEOMETRY="{(0) (1)}"
#
# create SMP OMP executable
# No threads are allowed in the build of the "fast" executable
#
cat > geom.F << 'EOF'
      program main
      use omp_lib
      include 'mpif.h'

      implicit none
      integer i,lct,iam,rank,error,tag,length,status(MPI_STATUS_SIZE)
      real(kind=8):: sum, buf(2), elapsed, rtc
      logical print

      character (len=MPI_MAX_PROCESSOR_NAME) nodename
      integer name_len, ierr

      call mpi_init(error)
      call mpi_comm_rank(MPI_COMM_WORLD,rank,error)
      length=2
      lct = LOOP_LENGTH
      sum=0.0

      elapsed=rtc()
      CALL MPI_GET_PROCESSOR_NAME(NODENAME, name_len, ierr)
      print *, 'Task ', rank, ' loop length=', lct, 'on node ', nodename(1:name_
len)
      print=.true.
!$omp  parallel do reduction(+:sum) private(iam)
      do i=1,lct
         if(print .eqv. .true.)then
           iam=omp_get_thread_num()
           print *, 'Task ',rank,' Thread ',iam, 'i = ', i, 'on node ', nodename
(1:name_len)
           print=.false.
           print *, 'Parallel loop, Task ', rank, ' nthreads = ', omp_get_num_th
reads()
         end if
         sum=sum+exp(.00000001*i)
      end do
      elapsed=rtc()-elapsed
      buf(1)=sum
      buf(2)=elapsed

      if( rank .eq. 1) then
      print *, 'Task 1 sending Sum results and Loop time to Task 0'
      call mpi_send(buf,length,MPI_REAL8,0,tag,MPI_COMM_WORLD,error)

      else

      print*,'Rank= 0, Sum=',buf(1), ' Loop time=', buf(2)
      call mpi_recv(buf,length,MPI_REAL8,1,tag,MPI_COMM_WORLD,status,error)
      print *, 'Task 0 received Sum results and Loop time from Task 1'
      print*,'Rank= 1, Sum=',buf(1), ' Loop time=', buf(2)

      end if

      call mpi_finalize(error)
      stop
      end

'EOF'
mpxlf_r  -WF,-DLOOP_LENGTH=1000000 -qfixed=132 -O3 -qstrict \
         -qrealsize=8 -o fast geom.F
#
# build of the "slow" executable allows the use of threads to speed it up
# so is compiled with -qsmp=omp
mpxlf_r  -WF,-DLOOP_LENGTH=3000000 -qfixed=132 -O3 -qstrict \
          -qrealsize=8 -qsmp=omp -o slow geom.F
#hostfile not used

# create task list (similar to poe command file), 3 threads for the "slow" execu
table
cat << EOF > cmds
env OMP_NUM_THREADS=1 fast
env OMP_NUM_THREADS=3 slow
EOF
#
#set necessary POE environment variables for MPMD job

export MP_PGMMODEL=mpmd
# run programs "fast" and "slow"
mpirun.lsf -cmdfile cmds
#rm fast slow  cmds

The bhist command

Here are some ways you can use the bhist command to display historical information about jobs:

bhist -J job_name
bhist -C start_time, end_time
bhist -D start_time, end_time
bhist -S start_time, end_time
bhist -T start_time, end_time

The bpeek command

Here is how you can use the bpeek command to display stdout and stderr of a selected, unfinished job. Note that bpeek -f uses tail -f to display output instead of cat:

bpeek [-q queue_name | -m host_name | -J job_name | job_ID | "job_ID[index_list]
"]

The bmod command

Please do not use the bmod command.

The bswitch command

Here are some ways you can use the bswitch command to switch unfinished jobs from one queue to another:

bswitch [-J job_name] [-m host_name | -m host_group]
   [-q queue_name] [-u user_name | -u user_group | -u all]
   destination_queue [0]
bswitch destination_queue [job_ID | "job_ID[index_list]"] ...
bswitch [-h | -V]

The bstop and bresume commands

Please do not use the bstop or bresume commands.

The bkill command

Here are some ways you can use the bkill command to send signals to kill, suspend, or resume unfinished jobs (we omit option -r because we ask you not to use it):

bkill [-l] [-g job_group_name | -sla service_class_name]
   [-J job_name] [-m host_name | -m host_group]
   [-q queue_name] [-s (signal_value | signal_name)]
   [-u user_name | -u user_group | -u all]
   [job_ID ... | 0 | "job_ID[index]" ...]
bkill [-h | -V]

Comparison of LoadLeveler and LSF queue commands

This chart helps you convert your LoadLeveler jobs to LSF jobs.

LoadLeveler LSF Description
llsubmit script bsub < script Submit a job script for execution
llq bjobs
bhist
Show status of running and pending jobs.
Display historical information about your jobs.
llcancel bkill Kill a job.
llhold bstop (Please do not use bstop) Hold a job.
llclass bqueues Show configuration of queues.
  busers Display information about users and groups.
  bpeek Peek at the stderr and stdout of an unfinished job.
  bacct Display accounting information for finished job.
llstatus bhosts Summarize load on each host.

Table of contents - Bluefire Quick Start Guide

If you have questions about this document, please contact us via any of the methods (phone, email, ticket, or in person) described here: CISL Customer Support.

© Copyright 2009. University Corporation for Atmospheric Research (UCAR). All Rights Reserved.

Address of this page: http://www.cisl.ucar.edu/docs/bluefire/lsf.html