frost user document heading  
               
NCAR
Last update: 11/20/2007

Frost user doc contents

Compiling and running on frost

This document is for any user with an NCAR account on frost or with a TeraGrid account. Frost is an IBM BlueGene/L supercomputer designed to run C, C++, and Fortran programs using MPI. Toward this end, we provide some tips on compiling, building executables with the linker, and running executables. All information in these pages is equally suitable for both NCAR and TeraGrid users. Many of the TeraGrid tips are found on the User Portal located on the TeraGrid Main Page.

Compiling

The frost compilers we focus on here are the cross-compilers IBM provides in their BlueGene/L software stack for Fortran, C, and C++. They are referred to as the IBM XL Fortran, C and C++ compilers and they are cross-compilers because you use them to compile code on Frost's interactive nodes, then run the resulting executables on the batch nodes. The compilers are not available on frost's batch nodes.

frost cross-compilers
LanguageCompiler name
Cblrts_xlc
C++blrts_xlC
F77blrts_xlf
F90blrts_xlf90
F95blrts_xlf95
  • The system include files are in /bgl/BlueLight/ppcfloor/bglsys/include
  • The mpi and system driver libraries are under /bgl/BlueLight/ppcfloor/bglsys/lib
    • The following libraries must be linked as per this order: -lmpich.rts -lmsglayer.rts -lrts.rts -ldevices.rts
  • You may also use these scripts in /contrib/bgl/bin to automatically set the include and MPI libraries for you:
    • mpxlc
    • mpxlC
    • mpxlf
    • mpxlf90
  • The mpicc and mpif77 are available, but they use the GNU gcc compilers, which do not offer the level of performance of the XL compilers. Please use the XL compilers when compiling your codes.
  • OpenMP directives are not available under any compilers on frost.
  • TeraGrid users may wish to use the softenv command to provide uniformity on frost with other systems where they have used this command. To enable the softenv command and other TeraGrid commands on frost, you must remove file .nosoft from your home directory on frost, then logout. When you log back in, you will find the commands in your path. General information on softenv, compilers, and environment variables common to TeraGrid systems is available under the Documentation tab on the User Portal located on the TeraGrid Main Page

Compiler flags

  • Optimization levels:
    • -O: good place to start, use with -qmaxmem=64000
    • -O2: same as -O
    • -O3 -qstrict: tries more aggressive optimization, while strictly obeying program semantics
    • -O3: aggressive, allows re-association, will replace division by multiplication with the inverse
    • -qhot: turns on high-order transformation module will add vector routines, unless -qhot=novector
      • check listing: -qreport=hotlist
  • Start with -g -O -qarch=440 -qmaxmem=64000

Example

  • Here's an example of compiling and linking a simple program:
    • bash$ blrts_xlc -c example.c -I/bgl/BlueLight/ppcfloor/bglsys/include
    • bash$ blrts_xlc -o example example.o -I/bgl/BlueLight/ppcfloor/bglsys/include -L/bgl/BlueLight/ppcfloor/bglsys/lib -lmpich.rts -lmsglayer.rts -lrts.rts -ldevices.rts
  • The same compile using mpxlc from /contrib/bgl/bin:
    • bash$ mpxlc -c example.c
    • bash$ mpxlc -o example example.o

Note: The compile and link flags are the same for Fortran77/90/95 and C/C++

Further information

To get further information on the IBM XL compilers for BlueGene/L, see these web pages:

Of particular interest on the above XL C/C++/Fortran websites is the following IBM BlueGene/L document:

  • Using the XL Compilers for BlueGene

In browsing it, please see the "Related information" chapter for information on man pages for the xlf, xlc, and xlC compilers, and further documentation references.

Note: We do not discuss GNU compilers here, other than to mention that the gcc compiler is available on frost as executable /usr/local/bin/gcc. It is mainly used to compile and build products (e.g. gmake) whose toolchain requires it. Otherwise, we recommend using the IBM XL compilers, for performance reasons.

Queues on frost

cobalt is the batch system installed on frost. It handles partition sizes of 1024, 512, 256, 128, 64, and 32 nodes. The job queuing commands are located under /usr/bin on the login node; below we provide a brief overview of cobalt commands cqstat (job status), cqstat (job status), cqsub (job submission), and cqdel (job deletion). TeraGrid users may wish to logon to the User Portal from the TeraGrid Home Page to check the system load (click the Resources tab) of the various computers.

Charges on frost

Charging differs, depending on whether usage is on an NCAR account or a TeraGrid account:

  • NCAR users: No formal charging has been established, but per-user accounting statistics are gathered and monitored, and the NCAR administrators contact users when unusual patterns are seen.
  • TeraGrid users can see their per-machine charges via their User Portal logon.

Job Submission on frost

TeraGrid users may check the User Portal's Batch Queue Prediction Form to see when their intended job may run on a given system. At this time (October 2007) the User Portal does not have job-submission capability.

Submitting a job with Cobalt

Use the cqsub command to submit a job to the queue.

cqsub executable

Required flags:

  • -n NP, where NP is the number of nodes
  • -t TIME, where TIME is how much time your job will take to run, in hours:minutes:seconds format (though the seconds field is ignored.)

Example:

$ cqsub -n 32 -t 00:10:00 example.rts
submitting walltime=10.0 minutes
162

In this example STDOUT is stored in 162.output, and STDERR is stored in 162.error in the current directory.

Some optional flags (see man cqstat for:

  • -O OUTPUT_PREFIX, where OUTPUT_PREFIX is the name of the output prefix, which means the output files will be named OUTPUT_PREFIX.output and OUTPUT_PREFIX.error for STDOUT and STDERR, respectively. If OUTPUT_PREFIX is not specified, the output will be placed in .output and .error
  • -C CWD, where CWD is the working directory for the code to run in (not neccessarily where the executable resides). The output files and any other files that are opened without specifying a path will be stored in the directory specified by CWD.
  • -m MODE, where MODE is co (coprocessor mode) or vn (virtual-node mode)
  • -c COUNT, where COUNT is the number of processors to use. By default this is equal to the number of nodes in coprocessor mode, and twice the number of nodes in virtual-node mode. This option is generally used in conjunction with -m vn to specify an odd number of processes in virtual-node mode.
  • Example: To specify 55 processes in virtual-node mode:
    $ cqsub -n 28 -c 55 -m vn -t 00:10:00 example.rts
  • -N email address, sends an email message at the start and stop of the job to the specified email address. Multiple email addresses, separated by colons, can be specified.

Job status with Cobalt

Use the cqstat command to see what jobs are queued or running. WallTime is in hours:minutes:seconds.

$ cqstat
JobID  User      WallTime  Nodes  State    Location
=====================================================================
18453  lando     07:30:00  256    running  256_R000_J102_N0
18454  hsolo     06:30:00  256    running  256_R000_J203_N8
18455  hsolo     06:30:00  256    running  256_R001_J102_N0
18456  luke      06:30:00  256    running  256_R001_J203_N8
18464  yoda      00:30:00  1024   queued   N/A
18465  yoda      00:30:00  1024   queued   N/A
18466  luke      00:30:00  900    queued   N/A
18467  lando     00:30:00  1024   queued   N/A
18468  lando     00:30:00  1024   queued   N/A

The -f flag gives more info:

$ cqstat -f
JobID  JobName  User      WallTime  RunTime   Nodes  State    Location          Mode  Procs  Queue    StartTime
=========================================================================================================================
18453  -        lando     07:30:00  02:48:40  256    running  256_R000_J102_N0  vn    512    default  04/20/06 12:08:57
18454  -        hsolo     06:30:00  02:45:04  256    running  256_R000_J203_N8  vn    512    default  04/20/06 12:12:32
18455  -        hsolo     06:30:00  01:01:25  256    running  256_R001_J102_N0  vn    512    default  04/20/06 13:56:12
18456  -        luke      06:30:00  00:56:00  256    running  256_R001_J203_N8  vn    512    default  04/20/06 14:04:34
18464  -        yoda      00:30:00  N/A       1024   queued   N/A               vn    1089   default  N/A
18465  -        yoda      00:30:00  N/A       1024   queued   N/A               vn    1296   default  N/A
18466  -        luke      00:30:00  N/A       900    queued   N/A               co    900    default  N/A
18467  -        lando     00:30:00  N/A       1024   queued   N/A               vn    1936   default  N/A
18468  -        lando     00:30:00  N/A       1024   queued   N/A               vn    2025   default  N/A

Cancelling a job with Cobalt

Use the cqdel command to cancel a job that has been submitted to the queue.

$ cqdel 162
      Deleted Jobs
JobID   User
==============
  162  joeuser 

Note: It may take some time for the job to be deleted if it is running, but you can check the `.error` file to see if the job is being deleted.

System availability commands

Use the partlist command to see what partitions are available and which are in use.

$ partlist
Name              Queue    State
==================================
NCAR_R00          default  busy*
NCAR_R000         default  busy*
256_R000_J102_N0  default  busy*
128_R000_J102_N0  default  busy*
64_R000_J102_N0   default  busy
32_R000_J102_N0   default  busy*
32_R000_J104_N1   default  busy*
64_R000_J106_N2   default  idle
32_R000_J106_N2   default  idle
32_R000_J108_N3   default  idle
...

This example would show up in a case where there was a job running on 64_R000_J102_N0, which makes NCAR_R00, NCAR_R000, 256_R000_J102_N0, etc. busy because they overlap with 64_R000_J102_N0, but 64_R000_J106_N2 is available.

You may also use the nodes helper script which will display only the currently free partitions. nodes -v lists all of the partitions (from the partlist -l output) and indents the output to show the partition hierarchy.

Use the showres -s command to see what system reservations are in place.

When your job isn't running...

If your job is 'queued' and it seems like it should be running, please take the following actions, in the order shown:

  • Check the output from partlist or nodes. That will tell you which partitions are idle (available for jobs to run).
  • If there are reservations in showres output that start before your job's walltime would end, then your job will not run.
  • Check that the queues are running using 'qstat -q'. The queues may be stopped for system maintenance.
  • Your job may be limited by queue restrictions. Check the queue restrictions with cqstat -q.
  • If your job is listed in the hold state, then it has been held by an administrator. Send email as per the next item in this bullet list to see why.
  • If still stymied, send a message describing the situation to the appropriate one of these addresses:
    • NCAR: frost-help@ucar.edu
    • TeraGrid: help@teragrid.org

Special windows

Reservations and JumboFridays:

  • Both NCAR and TeraGrid: JumboFridays for half or full rack jobs during the big run window (usually 8-10am on Fridays.)
  • NCAR reservations: please email frost-help@ucar.edu with your request as far in advance as possible.
  • TeraGrid reservations: Please use the TeraGrid Resource Advance Reservation Form

Mapping to MPI Processes

Order of assignment

Example: The BGLMPI_MAPPING variable can control the order in which tasks are assigned to nodes.

  • cqsub -e BGLMPI_MAPPING=TXYZ -m vn -n 32 ...

The T-coordinate is for the first or second processor in a node. So in the above example, since it is using virtual-node mode, the processes would be assigned first to the processors on a node, then in the x dimension, then the y dimension, then the z direction.

Process 	Coordinates
0 		<0,0,0,0>
1 		<0,0,0,1>
2 		<1,0,0,0>
3	 	<1,0,0,1>
4	 	<2,0,0,0>
5 		<2,0,0,1>
6 		<3,0,0,0>
7 		<3,0,0,1>
8 		<0,1,0,0>
9 		<0,1,0,1>
The dimensions for partitions of various sizes are:

32     4,4,2,1
64     8,4,2,1
128    8,4,4,1
256    8,4,8,1
512    8,8,8,1
1024   8,8,16,1

Mapfile

Example: You can also use a mapfile that defines the coordinates of the torus to which each process is assigned:

  • cqsub -e BGLMPI_MAPPING=/home/joeuser/sample.map -n 32 ...

The mapfile format is a text file with each line specifying the x,y,z,t coordinates of each process (t is processor 0 or 1 of each node in virtual node mode.) For example:

0 0 0 0
0 0 2 0
0 2 0 0
0 2 2 0
...

So with this map MPI process 0 is placed on the node at 0,0,0,0, process 1 is at 0,0,2,0, etc. Note that the mapfile must define the coordinates for the full partition where your job is running.

Check Placement

You can check where your processes are being placed, by checking the string returned by the MPI_Get_processor_name. It will look something like this:

  • Processor <0,0,0,0> in a <4, 4, 2, 2> mesh

Next page | Table of contents - Frost user guide

If you have questions about this document, please contact CISL Customer Support. You can also reach us by telephone 24 hours a day, seven days a week at 303-497-1278. Additional contact methods: consult1@ucar.edu and during business hours in NCAR Mesa Lab Suite 39.

© Copyright 2004-2007. University Corporation for Atmospheric Research (UCAR). All Rights Reserved.

Address of this page: http://www.cisl.ucar.edu/docs/frost/cbr.jsp