last update: 09/24/2009
Load Sharing Facility (LSF) is a product of Platform Computing. LSF is a batch job management subsystem for multi-host, multi-vendor complexes. LSF provides batch job capabilities on blueice, filling the same role as LoadLeveler on bluesky. This section provides basic information about LSF.
An LSF job is a script or a file containing LSF directives. You submit a batch job to the queues with the bsub directive:
bsub < lsf_job_script_fileYou can obtain a list of LSF batch queues using the bqueues command. See the bqueues man page for options.
You can list all of your queued and running jobs using the bjobs command. See the bjobs man page for options.
You can get a quick summary of all jobs running on the system using the lsfq command, although you may prefer to use the batchview command instead (see description below).
An LSF man page is available by typing man lsfintro on bluefire.
LSF commands
These LSF commands are essential for running batch jobs on blueice:
bhosts - shows information about available hosts (lshosts)
bqueues - shows information about available queues
bsub - submits jobs to batch subsystem
bjobs - lists jobs in the batch subsystem
bhist - displays historical information about user's jobs
bpeek - displays stdout and stderr of user's unfinished job
bmod - modifies job submission options for user's job --Please do not use bmod
bkill - kills, suspends, or resumes user's jobs --Please do not use argument -r
The bhosts command
The bhosts command is typically used with the following options:
bhosts [-w|-l][-R "res_req"][host_name|host_group]
to display information about hosts and platforms.The following partial bhosts output is what you might see on bluefire. A full listing would be much longer, containing a line for each of the 100 nodes. The bhosts command is useful for learning which nodes are unreachable or unavailable, although no reason is given for either status. For example, a node may be listed as unreachable or unavailable if it is taken out of the pool for servicing. Similarly, a node may be listed as unreachable or unavailable if it is reserved for a special project.
bl1012en$ bhosts HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV bl0101en ok - 8 0 0 0 0 0 bl0102en ok - 8 0 0 0 0 0 bl0103en unreach - 8 0 0 0 0 0 bl0104en ok - 8 0 0 0 0 0 bl0105en unavail - 8 0 0 0 0 0 bl0106en ok - 8 0 0 0 0 0 bl0201en ok - 8 0 0 0 0 0 ...The commands
lshosts [-w | -l] [-R "res_req"] [host_name | cluster_name]
lshosts -s [shared_resource_name ...]
display hosts and their static resource information.The bqueues command
The bqueues command is typically used with the following options:
bqueues [-w|-l|-r][-m host_name|-m all]
[-u user_name|-u all][queue_name .]
to display information about job queues.By default, the bqueues command returns the following information about all queues:
- queue name
- queue priority
- queue status
- job slot statistics
- job state statistics
bv0101en$ bqueues QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP special 500 Open:Active - - - - 0 0 0 0 premium 300 Open:Active - - - - 0 0 0 0 regular 200 Open:Active - - - - 1184 784 400 0 economy 160 Open:Active - - - - 0 0 0 0 hold 104 Open:Active - - - - 0 0 0 0 standby 100 Open:Active - - - - 0 0 0 0 share 100 Open:Active - - - - 0 0 0 0The bsub command
The bsub command is typically used as follows:
bsub [options] command [cmd_args]
to submit a job for batch execution.Options for the bsub command include:
-B Sends mail at dispatch and initiation times. -H Holds job in PSUSP and waits for bresume (Please do not use -H) -I | -Ip | -Is Submits as batch interactive -K Submits job and locks cmd line with status updates -N Sends job report by e-mail (use only with -I | -Is | -Ip or -o) -r Rerun job on another host if host terminates -x Exclusive execution mode -a esub_parameters Specifies parallel job launcher (PJL) to be used -b [[month:]day:]hour:minute Dispatch date/time -C core_limit Limits size of core dumps (-C 0 recommended unless corefiles required for debugging) -c [hours:]minutes[/host_name | /host_model] Cpu time limit -D data_limit -e err_file File to use as stderr -E "pre_exec_command [arguments ...]" Pre-exec command invoked before batch stream command processing -ext[sched] "external_scheduler_options" N/A -f "local_file operator [remote_file]" ... Files to be copied between local/remote systems -F file_limit Per process file size limit -g job_group_name Submits job to a job group -G user_group Associates job with a specific group -i input_file | -is input_file Specifies stdin for job -J job_name | -J "job_name[index_list]%job_slot_limit" Specifies job name -k "checkpoint_dir [checkpoint_period][method=method_name]" Makes a job checkpointable and specifies checkpoint directory td>-L login_shell Uses login_shell for runtime environment -m "host_name[@cluster_name][+[pref_level]] | host_group[+[pref _level]] Selects and ranks hosts/groups on which to run -M mem_limit Sets per process memory limit -n min_proc[,max_proc] Sets min/max number of processors required to run job (Required) -o out_file Specifies stdout -P project_name Specifies project name (Required) -p process_limit Limits total number of processes -q queue_name Specifies queue for job (default provided by system) -R "res_req" Specifies resource requirements -sla service_class_name Specifies service class for job -sp priority Specifies priority amongst user's jobs -S stack_limit Sets per-process stack limit -t [[month:]day:]hour:minute Specifies job termination date -T thread_limit Sets limit on number of concurrent jobs -U reservation_ID Uses reservation via brsvadd command -u mail_user Mail-to address -v swap_limit Sets total process virtual memory limit -w 'dependency_expression' Defines dependencies to be met before job initiation -wa '[signal | command | CHKPNT]' Specifies action to be taken before job control step occurs -wt '[hours:]minutes' Specifies time interval before job control occurs to send warni ng signal -W [hours:]minutes[/host_name | /host_model] Specifies run time limit for job (Required) -Zs Spools command file and runs from there Job submission - LSF usage is different from LoadLeveler a>
The following LSF job submission commands illustrate differences from LoadLeveler. (Note: this list is not exhaustive, and likely will continue to grow.)
LSF command bsub allows submission of an executable, e.g.
bsub -i infile -o outfile -e errfile a.outwhereas the analogous LoadLeveler command llsubmit requires a job script to run an executable.LSF can also be used to submit a job script. However, then the LSF command bs ub requires redirection of its command file, specifically
bsub < myscriptIf redirection is missing, the job will not submit properly (ignoring all of your #BSUB directives), and you will see it vanish from the queues with no explanation.Sample LSF job scripts
#!/bin/csh # # LSF batch script to run a serial code # #BSUB -P 99999999 # Project 99999999 #BSUB -n 1 # number of tasks #BSUB -J seriallsf.test # job name #BSUB -o seriallsf.out # output filename #BSUB -e seriallsf.err # input filename #BSUB -q regular # queue #BSUB -W 1:00 # 1 hour wallclock limit (required) # Fortran example #xlf90 -o samp_f -Mextend samp.f ./samp_f # C example #cc -o samp_c samp.c ./samp_c # C++ example #g++ --no_auto_instantiation -o samp_cc samp.cc ./samp_ccImportant: To submit MPI jobs, use mpirun.lsf, not mpirun, so that the LSF job scheduler can properly allocate node resources.
#!/bin/csh # # LSF batch script to run the test MPI code # #BSUB -P 99999999 # Project 99999999 #BSUB -n 64 # number of total (MPI) tasks #BSUB -R "span[ptile=32]" # run a max of 32 MPI tasks per node #BSUB -J mpilsf.test # job name #BSUB -o mpilsf.out # output filename #BSUB -e mpilsf.err # error filename #BSUB -q regular # queue #BSUB -W 1:00 # 1 hour wallclock limit (required) # Fortran example #mpxlf90 -o mpi_samp_f mpisamp.f mpirun.lsf ./mpi_samp_f # C example #mpcc -o mpi_samp_c mpisamp.c mpirun.lsf ./mpi_samp_c # C++ example #mpCC -o mpi_samp_cc mpisamp.cc mpirun.lsf ./mpi_samp_ccNotes for SMT: The above example shows the "default" way to run (without SMT). It will run 64 tasks on two 32-way nodes, placing one task per physical processor, i.e.:
#BSUB -n 64
#BSUB -R span[ptile=32]On an SMT-enabled node, you could run this same application on a single 32-way nodes with twice the number of tasks per physical processor, and your job script would use the LSF commands:
#BSUB -n 64
#BSUB -R span[ptile=64]
#!/bin/csh # # LSF script to run the test OMP codes # #BSUB -P 99999999 # Project 99999999 #BSUB -n 1 # number of tasks #BSUB -R "span[hosts=1]" # max jobs run on one host #BSUB -J omplsf.test # job name #BSUB -o omplsf.out # ouput filename #BSUB -e omplsf.err # input filename #BSUB -q regular # queue #BSUB -W 1:00 # 1 hour wallclock limit setenv OMP_NUM_THREADS 32 # Fortran example #xlf90 -o samp_f -qsmp=omp samp.f ./samp_f # C example cc -qsmp=omp -o samp_c samp.c ./samp_c # C++ example g++ --no_auto_instantiation -qsmp=omp -o sampcc samp.cc ./samp_cc
#!/bin/ksh # # LSF batch script to run an MPMD code. This example has two executables # each requiring one task # #BSUB -n 2 #BSUB -R "span[ptile=1]" #max number of tasks/node. Job will thus run on # #two nodes. Using ptile of 2 to 32 will make job # #run on same node #BSUB -o mpmdlsf.%J.out # output filename #BSUB -e mpmdlsf.%J.err # error filename #BSUB -J mpmdlsf.test # job name #BSUB -W 0:10 # 10 minutes wall clock time #BSUB -q regular # queue #BSUB -P xxxxxxxx #insert correct project number # Run this executable in MPMD mode: export MP_PGMMODEL=mpmd export MP_SHARED_MEMORY=yes # # Fortran example cat << 'EOF' > mpmd.f program main implicit none include 'mpif.h' character (len=MPI_MAX_PROCESSOR_NAME) nodename integer name_len, ierr ! Establish the number of the process being run with the variable "rank" integer i integer rank,error,tag,length,status(MPI_STATUS_SIZE) integer hz, clock0, clock1, t real(kind=8):: sum, buf(2), elapsed call mpi_init(error) ! Assign the process number to the current process with "rank" call mpi_comm_rank(MPI_COMM_WORLD,rank,error) length=2 tag =999 ! Test for the process being run, then calculate sum (process 1) or ! print sum (process 0) call system_clock(count_rate = hz) call system_clock(count = clock0) if (rank .eq. 1) then sum=0.0 do i=1,1000000 sum=sum+exp(.00000001*i) end do call system_clock(count = clock1) elapsed = real(clock1 - clock0) / hz buf(1)=sum buf(2)=elapsed/1000.0 call mpi_send(buf,length,MPI_REAL8,0,tag,MPI_COMM_WORLD,error) call MPI_GET_PROCESSOR_NAME(nodename, name_len, ierr) print *,'f90 MPI task ', rank,' sending data from node ', nodename(1:name_len) elseif (rank .eq. 0) then call mpi_recv(buf,length,MPI_REAL8,1,tag,MPI_COMM_WORLD,status,error) call MPI_GET_PROCESSOR_NAME(nodename, name_len, ierr) print *,'f90 MPI task ', rank,' receiving data on node ', nodename(1:name_len) print 10, buf(1), buf(2) 10 format(' f90 mpmd Results: Sum = ',1pe12.6,' Loop time = ',0pf12.8) end if call mpi_finalize(error) stop end 'EOF' mpxlf_r -qarch=auto -qtune=auto -O3 -qstrict -qrealsize=8 \ -qfixed=132 -o mpmdf mpmd.f cp mpmdf mpmdf2 # cat << end > cmds mpmdf mpmdf2 end # mpirun.lsf -cmdfile cmds # rm mpmdf mpmdf2 cmds
#!/bin/ksh # # LSF batch script to run the test mixed MPI/OMP codes. # The following example runs a job that has been "hard-wired" # for two MPI tasks. Each task will run 32 threads, so the # job has been set up to run on two blueefire nodes. Note that the # OpenMP threads are not taken into account in the -n task # count, but you should ensure that you allow for each # thread to have its own processor, or your performance # will suffer #BSUB -P 99999999 # Project 99999999 #BSUB -n 2 # total tasks (MPI) needed #BSUB -R "span[ptile=1]" # max number of tasks (MPI) per node #BSUB -o mixlsf.out # output filename #BSUB -e mixlsf.err # error filename #BSUB -J mixlsf.test # job name #BSUB -q regular # queue #BSUB -W 1:00 # wallclock limit of 1 hour # # set EXE=./mix # export OMP_NUM_THREADS=32 # Fortran example #mpxlf90_r -qsmp=omp -o mix mix.f mpirun.lsf $EXE # C example #mpcc_r -qsmp=omp -o mix mix.c #mpirun.lsf $EXE # C++ example #mpCC_r --no_auto_instantiation -qsmp=omp -o mix mix.cc #mpirun.lsf $EXENotes: In the above application (not using SMT), you have two tasks and each task has 32 OpenMP threads. By using ptile=1 you are specifying that the maximum number of (MPI) tasks per node is 1, so the job will have to run on two nodes. Your job script would use the following LSF commands and set the number of OpenMP threads via an environment variable:
#BSUB -n 2
#BSUB -R "span[ptile=1]"
export OMP_NUM_THREADS=32Notes for SMT: Since our program is hard-wired for 2 MPI tasks, the only way to take advantage of SMT is to double the number of threads (i.e., to 64). If you have a program in which you can vary the number of MPI tasks, you could modify the above script by treating the nodes as if they had twice the number of processors, that is, doubling the number of tasks (-n) you request. This would keep the same node count:
#BSUB -n 4
#BSUB -R "span[ptile=2]"
export OMP_NUM_THREADS=32
#!/bin/ksh # # LSF batch script to run a test Fortran mixed MPI/OMP code using task geometry # Equivalent LoadLeveler directives are given below. #BSUB -n 2 # total number of (MPI) tasks ##BSUB -R "span[ptile=1]" # max tasks per node--not needed # if task geometry used in LSF #BSUB -o geomlsf.out.%J # output filename #BSUB -e geomlsf.err.%J # error filename #BSUB -P 99999999 # account number (project) #BSUB -J geomlsf # job name #BSUB -q regular # queue #BSUB -W 2:00 # 2 hour run limit #------------LoadLeveler equivalents, commented out---------------------------- ##@ job_type = parallel ##@ network.MPI = csss,not_shared,us #Handled by POE elim in LSF ##@ total_tasks = 2 #Note that total_tasks and ##@ tasks_per_node = 1 #tasks_per_node are not needed ##@ node_usage = not_shared #with task geometry in LoadLeveler ##@ output = geomlsf.out.$(jobid) ##@ error = geomlsf.err.$(jobid) ##@ account_no = 99999999 ##@ jobname = geomlsf ##@ class = regular ##@ wall_clock_limit = 2:00 ##@ task_geometry = {(0) (1)} ##@ ja_report = yes #Replaced by CISL Portal #------------------------------------------------------------------------------ # Set up LSF Task Geometry to run across two nodes # This is set using LSF environment variable rather than BSUB directive # We will run program "fast" on 1 node with no threading (one MPI task) # We will run program "slow" on 1 node with three threads under 1 MPI task export LSB_PJL_TASK_GEOMETRY="{(0) (1)}" # # create SMP OMP executable # No threads are allowed in the build of the "fast" executable # cat > geom.F << 'EOF' program main use omp_lib include 'mpif.h' implicit none integer i,lct,iam,rank,error,tag,length,status(MPI_STATUS_SIZE) real(kind=8):: sum, buf(2), elapsed, rtc logical print character (len=MPI_MAX_PROCESSOR_NAME) nodename integer name_len, ierr call mpi_init(error) call mpi_comm_rank(MPI_COMM_WORLD,rank,error) length=2 lct = LOOP_LENGTH sum=0.0 elapsed=rtc() CALL MPI_GET_PROCESSOR_NAME(NODENAME, name_len, ierr) print *, 'Task ', rank, ' loop length=', lct, 'on node ', nodename(1:name_ len) print=.true. !$omp parallel do reduction(+:sum) private(iam) do i=1,lct if(print .eqv. .true.)then iam=omp_get_thread_num() print *, 'Task ',rank,' Thread ',iam, 'i = ', i, 'on node ', nodename (1:name_len) print=.false. print *, 'Parallel loop, Task ', rank, ' nthreads = ', omp_get_num_th reads() end if sum=sum+exp(.00000001*i) end do elapsed=rtc()-elapsed buf(1)=sum buf(2)=elapsed if( rank .eq. 1) then print *, 'Task 1 sending Sum results and Loop time to Task 0' call mpi_send(buf,length,MPI_REAL8,0,tag,MPI_COMM_WORLD,error) else print*,'Rank= 0, Sum=',buf(1), ' Loop time=', buf(2) call mpi_recv(buf,length,MPI_REAL8,1,tag,MPI_COMM_WORLD,status,error) print *, 'Task 0 received Sum results and Loop time from Task 1' print*,'Rank= 1, Sum=',buf(1), ' Loop time=', buf(2) end if call mpi_finalize(error) stop end 'EOF' mpxlf_r -WF,-DLOOP_LENGTH=1000000 -qfixed=132 -O3 -qstrict \ -qrealsize=8 -o fast geom.F # # build of the "slow" executable allows the use of threads to speed it up # so is compiled with -qsmp=omp mpxlf_r -WF,-DLOOP_LENGTH=3000000 -qfixed=132 -O3 -qstrict \ -qrealsize=8 -qsmp=omp -o slow geom.F #hostfile not used # create task list (similar to poe command file), 3 threads for the "slow" execu table cat << EOF > cmds env OMP_NUM_THREADS=1 fast env OMP_NUM_THREADS=3 slow EOF # #set necessary POE environment variables for MPMD job export MP_PGMMODEL=mpmd # run programs "fast" and "slow" mpirun.lsf -cmdfile cmds #rm fast slow cmds
The bhist command
Here are some ways you can use the bhist command to display historical information about jobs:
bhist -J job_name bhist -C start_time, end_time bhist -D start_time, end_time bhist -S start_time, end_time bhist -T start_time, end_timeThe bpeek command
Here is how you can use the bpeek command to display stdout and stderr of a selected, unfinished job. Note that bpeek -f uses tail -f to display output instead of cat:
bpeek [-q queue_name | -m host_name | -J job_name | job_ID | "job_ID[index_list] "]The bmod command
Please do not use the bmod command.
The bswitch command
Here are some ways you can use the bswitch command to switch unfinished jobs from one queue to another:
bswitch [-J job_name] [-m host_name | -m host_group] [-q queue_name] [-u user_name | -u user_group | -u all] destination_queue [0] bswitch destination_queue [job_ID | "job_ID[index_list]"] ... bswitch [-h | -V]The bstop and bresume commands
Please do not use the bstop or bresume commands.
The bkill command
Here are some ways you can use the bkill command to send signals to kill, suspend, or resume unfinished jobs (we omit option -r because we ask you not to use it):
bkill [-l] [-g job_group_name | -sla service_class_name] [-J job_name] [-m host_name | -m host_group] [-q queue_name] [-s (signal_value | signal_name)] [-u user_name | -u user_group | -u all] [job_ID ... | 0 | "job_ID[index]" ...] bkill [-h | -V]Comparison of LoadLeveler and LSF queue commands
This chart helps you convert your LoadLeveler jobs to LSF jobs.
LoadLeveler LSF Description llsubmit script bsub < script Submit a job script for execution llq bjobs
bhistShow status of running and pending jobs.
Display historical information about your jobs.llcancel bkill Kill a job. llhold bstop (Please do not use bstop) Hold a job. llclass bqueues Show configuration of queues. busers Display information about users and groups. bpeek Peek at the stderr and stdout of an unfinished job. bacct Display accounting information for finished job. llstatus bhosts Summarize load on each host.
Table of contents - Bluefire Quick Start Guide
If you have questions about this document, please contact us via any of the methods (phone, email, ticket, or in person) described here: CISL Customer Support.
© Copyright 2009. University Corporation for Atmospheric Research (UCAR). All Rights Reserved.
Address of this page: http://www.cisl.ucar.edu/docs/bluefire/lsf.html