Getting started on bluevista
last update: 07/04/2008

Running code

This section helps you start running jobs on bluevista.

Running interactive jobs

Short, interactive jobs may be run at the command line for debugging or test purposes. However, parallel jobs cannot be run interactively. See the section on running batch jobs below.

Running batch jobs using LSF

The batch system on bluevista is named the Load Sharing Facility. We use the acronym LSF in referring to it. This product is provided by Platform Computing, Inc, who also provides a full set of manuals for LSF version 7.0.3 at the Platform Knowledge Centre. The Centre page features a search engine specific to the manuals. We especially recommend reading the short manual Running Jobs with Platform LSF for basic information about LSF, as an official companion to this bluevista document you are presently reading.

An LSF job is a script or a file containing LSF directives. You submit a batch job to the queues with the bsub directive:

bsub < lsf_job_script_file

You can obtain a list of LSF batch queues using the bqueues command. See the bqueues man page for options.

You can list all of your queued and running jobs using the bjobs command. See the bjobs man page for options.

You can get a quick summary of all jobs running on the system using the lsfq command.

The spinfo command provides a quick summary of queue time limits and hardware information.

An LSF man page is available by typing man lsfintro on bluevista. More LSF documentation is installed on bluevista at /usr/local/docs/LSF/6.1/*.pdf

If you need to convert a LoadLeveler job script to run under LSF, see Frequently used options in job scripts below.

More information is provided in LSF for Bluevista Users (293 KB PowerPoint presentation).

LSF commands

These LSF commands are essential for running batch jobs on bluevista:

The bhosts command

The bhosts command is typically used with the following options:
bhosts [-w|-l][-R "res_req"][host_name|host_group]
to display information about hosts and platforms.

The commands
lshosts [-w | -l] [-R "res_req"] [host_name | cluster_name]
lshosts -s [shared_resource_name ...]

display hosts and their static resource information as shown below, where the ellipses indicate omitted host names for the sake of brevity.

bv0101en$ bhosts
HOST_NAME          STATUS       JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV
bv0101en           ok              -      8      0      0      0      0      0
bv0102en           ok              -      8      0      0      0      0      0
bv0103en           ok              -      8      0      0      0      0      0
...
bv1404en           ok              -      8      0      0      0      0      0
bv1405en           ok              -      8      0      0      0      0      0
bv1406en           ok              -      8      0      0      0      0      0

The bqueues command

The bqueues command is typically used with the following options:
bqueues [-w|-l|-r][-m host_name|-m all]
[-u user_name|-u all][queue_name .]

to display information about job queues.

By default, the bqueues command returns the following information about all queues:

bv0101en$ bqueues
QUEUE_NAME      PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP
special         500  Open:Active       -    -    -    -     0     0     0     0
premium         300  Open:Active       -    -    -    -     0     0     0     0
regular         200  Open:Active       -    -    -    -  1184   784   400     0
economy         160  Open:Active       -    -    -    -     0     0     0     0
hold            104  Open:Active       -    -    -    -     0     0     0     0
standby         100  Open:Active       -    -    -    -     0     0     0     0
share           100  Open:Active       -    -    -    -     0     0     0     0

The bsub command

The bsub command is typically used as follows:
bsub [options] command [cmd_args]
to submit a job for batch execution.

Options for the bsub command include:

-B Sends mail at dispatch and initiation times.
-H Holds job in PSUSP and waits for bresume
-I | -Ip | -Is Submits as batch interactive
-K Submits job and locks cmd line with status updates
-N Sends job report by e-mail (use only with -I | -Is | -Ip or -o)
-r Rerun job on another host if host terminates
-x Exclusive execution mode
-a esub_parameters Specifies parallel job launcher (PJL) to be used
-b [[month:]day:]hour:minute Dispatch date/time
-C core_limit Limits size of core dumps (-C 0 recommended?)
-c [hours:]minutes[/host_name | /host_model] Cpu time limit
-D data_limit  
-e err_file File to use as stderr
-E "pre_exec_command [arguments ...]" Pre-exec command invoked before batch stream command processing
-ext[sched] "external_scheduler_options" N/A
-f "local_file operator [remote_file]" ... Files to be copied between local/remote systems
-F file_limit Per process file size limit
-g job_group_name Submits job to a job group
-G user_group Associates job with a specific group
-i input_file | -is input_file Specifies stdin for job
-J job_name | -J "job_name[index_list]%job_slot_limit" Specifies job name
-k "checkpoint_dir [checkpoint_period][method=method_name]" Makes a job checkpointable and specifies checkpoint directory
-L login_shell Uses login_shell for runtime environment
-m "host_name[@cluster_name][+[pref_level]] | host_group[+[pref _level]] Selects and ranks hosts/groups on which to run
-M mem_limit Sets per process memory limit
-n min_proc[,max_proc] Sets min/max number of processors required to run job
-o out_file Specifies stdout
-P project_name Specifies project name
-p process_limit Limits total number of processes
-q queue_name Specifies queue for job (default provided by system)
-R "res_req" Specifies resource requirements
-sla service_class_name Specifies service class for job
-sp priority Specifies priority amongst user's jobs
-S stack_limit Sets per-process stack limit
-t [[month:]day:]hour:minute Specifies job termination date
-T thread_limit Sets limit on number of concurrent jobs
-U reservation_ID Uses reservation via brsvadd command
-u mail_user Mail-to address
-v swap_limit Sets total process virtual memory limit
-w 'dependency_expression' Defines dependencies to be met before job initiation
-wa '[signal | command | CHKPNT]' Specifies action to be taken before job control step occurs
-wt '[hours:]minutes' Specifies time interval before job control occurs to send warni ng signal
-W [hours:]minutes[/host_name | /host_model] Specifies run time limit for job
-Zs Spools command file and runs from there

Job submission - LSF usage is different from LoadLeveler

The following LSF job submission commands illustrate differences from LoadLeveler. (Note: this list is not exhaustive, and likely will continue to grow.)

LSF command bsub allows submission of an executable, e.g.

  bsub -i infile -o outfile -e errfile a.out
whereas the analgous LoadLeveler command llsubmit requires a job script to run an executable.

LSF can also be used to submit a job script. However, then the LSF command bs ub requires redirection of its command file, specifically

  bsub < myscript
If redirection is missing, the job will not submit properly (ignoring all of your #BSUB directives), and you will see it vanish from the queues with no explanation.

Sample LSF job scripts

Serial job

#!/bin/csh
#
# LSF batch script to run a serial code
#
#BSUB -P 99999999                       # Project 99999999
#BSUB -n 1                              # number of tasks
#BSUB -J seriallsf.test                 # job name
#BSUB -o seriallsf.out                  # output filename
#BSUB -e seriallsf.err                  # input filename
#BSUB -q regular                        # queue
#BSUB -W 1:00                           # 1 hour wallclock limit (required)

# Fortran example
#xlf90 -o samp_f -Mextend samp.f
./samp_f

# C example
#cc -o samp_c samp.c
./samp_c

# C++ example
#g++  --no_auto_instantiation -o samp_cc samp.cc
./samp_cc

MPI job

Important: To submit MPI jobs, use mpirun.lsf, not mpirun, so that the LSF job scheduler can properly allocate node resources.

#!/bin/csh
#
# LSF batch script to run the test MPI code
#
#BSUB -P 99999999                       # Project 99999999
#BSUB -a poe                            # select poe
#BSUB -x                                # exclusive use of node (not_shared)
#BSUB -n 16                             # number of total (MPI) tasks
#BSUB -R "span[ptile=8]"                # run a max of 8 tasks per node
#BSUB -J mpilsf.test                    # job name
#BSUB -o mpilsf.out                     # output filename
#BSUB -e mpilsf.err                     # error filename
#BSUB -q regular                        # queue
#BSUB -W 1:00                           # 1 hour wallclock limit (required)

# Fortran example
#mpxlf90 -o mpi_samp_f mpisamp.f
mpirun.lsf ./mpi_samp_f

# C example
#mpcc -o mpi_samp_c mpisamp.c
mpirun.lsf ./mpi_samp_c

# C++ example
#mpCC -o mpi_samp_cc mpisamp.cc
mpirun.lsf ./mpi_samp_cc

Notes for SMT: The above example shows the "normal" way to run (without SMT). It will run 16 tasks on two 8-way nodes, placing one task per physical processor:
#BSUB -n 16
#BSUB -R span[ptile=8]

On an SMT-enabled node, you could run this same application on two 8-way nodes with twice the number of tasks per physical processor, and your job script would use the LSF commands:
#BSUB -n 32
#BSUB -R span[ptile=16]


OpenMP job

#!/bin/csh
#
# LSF script to run the test OMP codes
#
#BSUB -P 99999999               # Project 99999999
#BSUB -x                        # exclusive use of node
#BSUB -n 1                      # number of tasks
#BSUB -R "span[hosts=1]"        # max jobs run on one host
#BSUB -J omplsf.test            # job name
#BSUB -o omplsf.out             # ouput filename
#BSUB -e omplsf.err             # input filename
#BSUB -q regular                # queue
#BSUB -W 1:00                   # 1 hour wallclock limit

setenv OMP_NUM_THREADS 8

# Fortran example
#xlf90 -o samp_f -qsmp=omp samp.f
./samp_f

# C example
cc -qsmp=omp -o samp_c samp.c
./samp_c

# C++ example
g++ --no_auto_instantiation -qsmp=omp -o sampcc samp.cc
./samp_cc

MPMD (MPI) job

#!/bin/ksh
#
# LSF batch script to run a MPMD code. This example has two executables
# each requiring one task. Note on ptile: Setting ptile=1 as below 
# has the job running on two nodes; setting ptile=2,3,4,5,6,7 or 8 will
# force the job onto one node.
#
#BSUB -n 2                      # number of tasks
#BSUB -R "span[ptile=1]"        # max number of tasks/node (see ptile note above)
#BSUB -o mpmdlsf.%J.out         # output filename
#BSUB -e mpmdlsf.%J.err         # error filename
#BSUB -J mpmdlsf.test           # job name
#BSUB -W 0:10                   # 10 minutes wall clock time 
#BSUB -q regular                # queue
#BSUB -P 99999999               # Project 99999999

# Run this executable in MPMD mode:
export MP_PGMMODEL=mpmd
export MP_SHARED_MEMORY=yes
#
# Fortran example
cat << 'EOF' > mpmd.f
      program main
      implicit none
      include 'mpif.h'

      character (len=MPI_MAX_PROCESSOR_NAME) nodename
      integer name_len, ierr

! Establish the number of the process being run with the variable "rank"
      integer i
      integer rank,error,tag,length,status(MPI_STATUS_SIZE)
      integer hz, clock0, clock1, t
      real(kind=8)::  sum, buf(2), elapsed

      call mpi_init(error)
! Assign the process number to the current process with "rank"
      call mpi_comm_rank(MPI_COMM_WORLD,rank,error)
      length=2
      tag =999

! Test for the process being run, then calculate sum (process 1) or
! print sum (process 0)
      call system_clock(count_rate = hz)
      call system_clock(count = clock0)
      if (rank .eq. 1) then
          sum=0.0
          do i=1,1000000
              sum=sum+exp(.00000001*i)
          end do
          call system_clock(count = clock1)
          elapsed = real(clock1 - clock0) / hz
          buf(1)=sum
          buf(2)=elapsed/1000.0
          call mpi_send(buf,length,MPI_REAL8,0,tag,MPI_COMM_WORLD,error)
          call MPI_GET_PROCESSOR_NAME(nodename, name_len, ierr)
          print *,'f90 MPI task ', rank,' sending data from node ', nodename(1:name_len)
      elseif (rank .eq. 0) then
          call  mpi_recv(buf,length,MPI_REAL8,1,tag,MPI_COMM_WORLD,status,error)
          call MPI_GET_PROCESSOR_NAME(nodename, name_len, ierr)
          print *,'f90 MPI task ', rank,' receiving data on node ', nodename(1:name_len)
          print 10, buf(1), buf(2)
10        format(' f90 mpmd Results: Sum = ',1pe12.6,' Loop time =  ',0pf12.8)
      end if

      call mpi_finalize(error)
      stop
      end
'EOF'
mpxlf_r -qarch=auto -qtune=auto -O3 -qstrict -qrealsize=8 \
                  -qfixed=132 -o mpmdf mpmd.f
cp mpmdf mpmdf2
#
cat << end > cmds
mpmdf
mpmdf2
end
#
mpirun.lsf -cmdfile cmds
#
rm mpmdf mpmdf2 cmds


Hybrid job

#!/bin/ksh
#
# LSF batch script to run the test mixed MPI/OMP codes
# The following example runs a job that has been "hard-wired"
# for two MPI tasks. Each task will run 8 threads, so the
# job has been set up to run on two nodes. Note that the
# OpenMP threads are not taken into account in the -n task
# count, but you should ensure that you allow for each
# thread to have its own processor, or your performance
# will suffer

#BSUB -P 99999999                 # Project 99999999
#BSUB -a poe                      # use LSF poe elim
#BSUB -x                          # exclusive use of node (not_shared)
#BSUB -n 2                        # total tasks (MPI) needed
#BSUB -R "span[ptile=1]"          # max number of tasks (MPI) per node
#BSUB -o mixlsf.out               # output filename
#BSUB -e mixlsf.err               # error filename
#BSUB -J mixlsf.test              # job name
#BSUB -q regular                  # queue
#BSUB -W 1:00                     # wallclock limit of 1 hour
#
#
set EXE=./mix
#
export OMP_NUM_THREADS=8

# Fortran example
#mpxlf90_r -qsmp=omp -o mix mix.f
mpirun.lsf $EXE

# C example
#mpcc_r -qsmp=omp -o mix mix.c
#mpirun.lsf $EXE

# C++ example
#mpCC_r --no_auto_instantiation -qsmp=omp -o mix mix.cc
#mpirun.lsf $EXE

Notes for SMT: In the above application (not using SMT), you have two tasks and each task has eight OpenMP threads. By using ptile=1 you are specifying that the maximum number of (MPI) tasks per node is 1, so the job will have to run on two nodes. Your job script would use the following LSF commands and set the number of OpenMP threads via an environment variable:
#BSUB -n 2
#BSUB -R "span[ptile=1]"
export OMP_NUM_THREADS=8

To take advantage of SMT, you can modify the above script by treating the nodes as if they had twice the number of processors, that is, doubling the number of tasks (-n) you request. This would keep the same node count:
#BSUB -n 4
#BSUB -R "span[ptile=2]"
export OMP_NUM_THREADS=8


Hybrid MPMD job

#!/bin/ksh
#
# LSF batch script to run a test Fortran mixed MPI/OMP code using task geometry
# Equivalent LoadLeveler directives are given below.

#BSUB -a poe                            # select the POE elim for IBMs
#BSUB -n 2                              # total number of (MPI) tasks
##BSUB -R "span[ptile=1]"               # max tasks per node--not needed
                                        # if task geometry used in LSF
#BSUB -x                                # exclusive use of node (not_shared)
#BSUB -o geomlsf.out.%J                 # output filename
#BSUB -e geomlsf.err.%J                 # error filename
#BSUB -P 99999999                       # account number (project)
#BSUB -J geomlsf                        # job name
#BSUB -q regular                        # queue
#BSUB -W 2:00                           # 2 hour run limit
#------------LoadLeveler equivalents, commented out----------------------------
##@ job_type = parallel
##@ network.MPI = csss,not_shared,us   #Handled by POE elim in LSF
##@ total_tasks = 2                    #Note that total_tasks and
##@ tasks_per_node = 1                 #tasks_per_node are not needed
##@ node_usage = not_shared            #with task geometry in LoadLeveler
##@ output = geomlsf.out.$(jobid)
##@ error = geomlsf.err.$(jobid)
##@ account_no = 99999999
##@ jobname = geomlsf
##@ class = regular
##@ wall_clock_limit = 2:00
##@ task_geometry = {(0) (1)}
##@ ja_report = yes                    #Not available in LSF. Use CISL Portal
         #https://www.portal.scd.ucar.edu:8443/scd-portal/displayMainPage.do

#------------------------------------------------------------------------------
# Set up LSF Task Geometry to run across two nodes
# This is set using LSF environment variable rather than BSUB directive
# We will run program "fast" on 1 node with no threading (one MPI task)
# We will run program "slow" on 1 node with three threads under 1 MPI task
export LSB_PJL_TASK_GEOMETRY="{(0) (1)}"
#
# create SMP OMP executable
# No threads are allowed in the build of the "fast" executable
#
cat > geom.F << 'EOF'
      program main
      use omp_lib
      include 'mpif.h'

      implicit none
      integer i,lct,iam,rank,error,tag,length,status(MPI_STATUS_SIZE)
      real(kind=8):: sum, buf(2), elapsed, rtc
      logical print

      character (len=MPI_MAX_PROCESSOR_NAME) nodename
      integer name_len, ierr

      call mpi_init(error)
      call mpi_comm_rank(MPI_COMM_WORLD,rank,error)
      length=2
      lct = LOOP_LENGTH
      sum=0.0

      elapsed=rtc()
      CALL MPI_GET_PROCESSOR_NAME(NODENAME, name_len, ierr)
      print *, 'Task ', rank, ' loop length=', lct, 'on node ', nodename(1:name_
len)
      print=.true.
!$omp  parallel do reduction(+:sum) private(iam)
      do i=1,lct
         if(print .eqv. .true.)then
           iam=omp_get_thread_num()
           print *, 'Task ',rank,' Thread ',iam, 'i = ', i, 'on node ', nodename
(1:name_len)
           print=.false.
           print *, 'Parallel loop, Task ', rank, ' nthreads = ', omp_get_num_th
reads()
         end if
         sum=sum+exp(.00000001*i)
      end do
      elapsed=rtc()-elapsed
      buf(1)=sum
      buf(2)=elapsed

      if( rank .eq. 1) then
      print *, 'Task 1 sending Sum results and Loop time to Task 0'
      call mpi_send(buf,length,MPI_REAL8,0,tag,MPI_COMM_WORLD,error)

      else

      print*,'Rank= 0, Sum=',buf(1), ' Loop time=', buf(2)
      call mpi_recv(buf,length,MPI_REAL8,1,tag,MPI_COMM_WORLD,status,error)
      print *, 'Task 0 received Sum results and Loop time from Task 1'
      print*,'Rank= 1, Sum=',buf(1), ' Loop time=', buf(2)

      end if

      call mpi_finalize(error)
      stop
      end

'EOF'
mpxlf_r  -WF,-DLOOP_LENGTH=1000000 -qfixed=132 -O3 -qstrict \
         -qrealsize=8 -o fast geom.F
#
# build of the "slow" executable allows the use of threads to speed it up
# so is compiled with -qsmp=omp
mpxlf_r  -WF,-DLOOP_LENGTH=3000000 -qfixed=132 -O3 -qstrict \
          -qrealsize=8 -qsmp=omp -o slow geom.F
#hostfile not used

# create task list (similar to poe command file), 3 threads for the "slow" execu
table
cat << EOF > cmds
env OMP_NUM_THREADS=1 fast
env OMP_NUM_THREADS=3 slow
EOF
#
#set necessary POE environment variables for MPMD job

export MP_PGMMODEL=mpmd
# run programs "fast" and "slow"
mpirun.lsf -cmdfile cmds
#rm fast slow  cmds

The bhist command

Here are some ways you can use the bhist command to display historical information about jobs:

bhist -J job_name
bhist -C start_time, end_time
bhist -D start_time, end_time
bhist -S start_time, end_time
bhist -T start_time, end_time

The bpeek command

Here is how you can use the bpeek command to display stdout and stderr of a selected, unfinished job. Note that bpeek -f uses tail -f to display output instead of cat:

bpeek [-q queue_name | -m host_name | -J job_name | job_ID | "job_ID[index_list]
"]

The bmod command

Here are some ways you can use the bmod command to modify job submission options of a job:

bmod [bsub options] [job_ID | "job_ID[index]"]
bmod -g job_group_name | -gn [job_ID]
bmod [-sla service_class_name | -slan] [job_ID]
bmod [-h | -V]
Caution: It appears that the LSF bmod command does not remember changes f rom one bmod to the next. In other words, if you use bmod to change the wallcloc k to, say, two hours, then issue a second bmod to change the queue from, say, st andby to regular, the second bmod overrides the first, and the wallclock change is lost. So, if you use multiple bmod's against a job, then each subsequent bmod must contain all of the changes from the previous bmods.

The bswitch command

Here are some ways you can use the bswitch command to switch unfinished jobs from one queue to another:

bswitch [-J job_name] [-m host_name | -m host_group]
   [-q queue_name] [-u user_name | -u user_group | -u all]
   destination_queue [0]
bswitch destination_queue [job_ID | "job_ID[index_list]"] ...
bswitch [-h | -V]

The bstop and bresume commands

Here are some ways you can use the bstop command to suspend unfinished jobs:

bstop [-a] [-d] [-g job_group_name |-sla service_class_name]
    [-J job_name] [-m host_name | -m host_group]
    [-q queue_name] [-u user_name | -u user_group | -u all] [0]
    [job_ID | "job_ID[index]"] ...
bstop [-h | -V]

Here are some ways you can use the bresume command to resume one or more suspended jobs:

bresume [-g job_group_name] [-J job_name] [-m host_name ]
    [-q queue_name] [-u user_name | -u user_group | -u all ] [0]
bresume [job_ID | "job_ID[index_list]"] ...
bresume [-h | -V]

The bkill command

Here are some ways you can use the bkill command to send signals to kill, suspend, or resume unfinished jobs:

bkill [-l] [-g job_group_name | -sla service_class_name]
   [-J job_name] [-m host_name | -m host_group]
   [-q queue_name] [-r | -s (signal_value | signal_name)]
   [-u user_name | -u user_group | -u all]
   [job_ID ... | 0 | "job_ID[index]" ...]
bkill [-h | -V]

Comparison of LoadLeveler and LSF queue commands

This chart helps you convert your LoadLeveler jobs to LSF jobs.

LoadLeveler LSF Description
llsubmit script bsub < script Submit a job script for execution
llq bjobs
bhist
Show status of running and pending jobs.
Display historical information about your jobs.
llcancel bkill Kill a job.
llhold bstop Hold a job.
llclass bqueues Show configuration of queues.
  busers Display information about users and groups.
  bpeek Peek at the stderr and stdout of an unfinished job.
  bacct Display accounting information for finished job.
llstatus bhosts Summarize load on each host.

Frequently used options in job scripts

This chart helps you convert your LoadLeveler jobs to LSF jobs.

LoadLeveler LSF Description
#@job_name = jobname #BSUB -J jobname Assign job a name
#@notify_user = login_name
#@notification = start
#BSUB -B Send email when job starts
#@notification = complete #BSUB -N Email finished job report
#@error = error-file #BSUB -e error-file Redirect stderr to specified file
#@output = out-file #BSUB -o out-file Redirect stdout
  #BSUB -a application E-sub parameter
#@account_no = project-number #BSUB -P project-number Charge job to specified project
#@wall_clock_limit = runtime #BSUB -W runtime Set run (wall clock) limit
#@class = queue_name #BSUB -q queue_name Submit job to specified queue
Nodes needed, total MPI tasks, and maximum MPI tasks per node
#@node = num_nodes   Specify number of nodes to use
#@total_tasks = total_procs #BSUB -n total_num_tasks Specify total number of MPI tasks and give a processor to each< /td>
#@tasks_per_node = num_procs #BSUB -R "span[ptile=max_num_tasks_per_node]" (used for MPI) Specify maximum number of procs (MPI tasks) used on each node

Using Simultaneous Multi-Threading (SMT)

Simultaneous Multi-Threading (SMT) is a feature that became available under AIX 5.3 and works on Power 5-based systems. No code changes are needed. With simple modifications to your job scripts, you may be able to boost performance by 20% or more on some applications.

Under SMT, the Power 5 doubles the number of active threads on a processor by implementing a second, on-board "virtual" processor that is enabled by the CPU architecture. The basic concept of SMT is that no single process uses all processor execution units at the same time, so a second thread can utilize unused cycles.

To take advantage of SMT, double the value of the ptile parameter, i.e. ptile=16 instead of ptile=8. This establishes 16 virtual processors on the bluevista node, instead of just 8 physical processors.

Pure MPI jobs

An MPI-only non-SMT job that is submitted to run on 4 8-way nodes (that is, -n 32 and ptile=8) can be modified to utilize SMT on 2 8-way nodes by specifying -n 32 and ptile=16 or can continue to use 4 8-way nodes and take advantage of SMT by specifying -n 64 and ptile=16, assuming the job scales up. The latter method might also be preferable if wallclock time is the primary consideration.

The relative benefit of each of these approaches can then be examined by comparing LSF's report of "Resource usage summary" that is included in the file specified by the -o bsub option.

Hybrid jobs

A non-SMT job that runs 8 MPI tasks across 4 8-way bluevista nodes with each MPI task spawning 4 OpenMP threads would specify -n 8, ptile=2 and OMP_NUM_THREADS=4. The same job can be run with SMT by keeping -n 8 and OMP_NUM_THREADS=4 but switching to ptile=4 and would then use half the number of 8-way nodes. Alternatively, keeping the node count the same would be configured by -n 16, ptile=4, and OMP_NUM_THREADS=4.

Note for hybrid jobs: Under AIX 5.3, there is a known defect that causes performance problems in hybrid applications when the application reads stdin as redirected from a file, eg: cam < namelist. The workaround is to set MP_STDINMODE=0 in the environment. This may be important for getting best performance under SMT.

Examples of jobs scripts using SMT are on bluevista under the /usr/local/examples/lsf/batch/smt directory.

MPMD jobs

To run an MPMD program such as CCSM using SMT, an MPI job with 21 tasks can fit on two bluevista nodes instead of three with just these simple changes:

  1. Modify ptile setting (maximum number of tasks per node) in LSF:
    #BSUB -R "span[ptile=16]"    #bluevista default without SMT is 8
    
  2. The number of tasks your job requests remains the same:
    #BSUB -n 21    # number of tasks
    
  3. If your job uses task geometry, modify the LSB_PJL_TASK_GEOMETRY environment variable as if the node had 16 processors rather than 8, for example:

    Old task geometry: export LSB_PJL_TASK_GEOMETRY="{(0,1,2,3,4,5,6,7)(8,9,10,11,12,13,14,15) \
    (16,17,18,19,20)}" \
    (32,33,34,35,36,37,38,39)(40,41)}"

    Note: Backslashes (\) denote line is continuous.

    New task geometry: export LSB_PJL_TASK_GEOMETRY="{(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)\ (16,17,18,19,20)}"

SMT should aid in getting better throughput of your jobs and better performance for your GAU charges. Below are suggestions for testing whether SMT usage will benefit your applications:

Instructions for using SMT with the Community Climate System Model (CCSM) run scripts are given in the document, "Taking advantage of Simultaneous Multi-Threading on bluevista when running CCSM."


Next page | Table of contents - Getting started on bluevista

If you have questions about this document, please contact CISL Customer Support. You can also reach us by telephone 24 hours a day, seven days a week at 303-497-1200. Additional contact methods: consult1@ucar.edu and during business hours in NCAR Mesa Lab Suite 39.

© Copyright 2005-2006. University Corporation for Atmospheric Research (UCAR). All Rights Reserved.

Address of this page: http://www.scd.ucar.edu/docs/bluevista/run.html