Last updated: Sep 24th 2009
OverviewBluefire is an IBM clustered Symmetric MultiProcessing (SMP) system based on the Power6 chip. Its hardware taxonomy and software stack is similar to that of blueice, NCAR's former Power5+ system replaced by bluefire on July 1, 2008.
For the NCAR and University user community, its purpose is to provide high-performance supercomputing for numerically intensive models and applications. It is best suited to running codes that can be run efficiently in parallel.
The programming models are the familiar ones: OpenMP, Message Passing Interface (MPI), and hybrid jobs that use both OpenMP and MPI.
See Appendix A. for a comparison of blueice and bluefire processors.
Bluefire software is identical with that of blueice, although versions may differ, and there may be some difference in product names due to the difference in switches.
The Community Computing users and Climate Simulation Lab users share the system. The number of nodes available to each group is flexible, to maximize the system's productivity - meaning, if one group is not using all of its available compute nodes, then members of the other group may use those nodes. LSF determines these splits based on runtime usage. Platform Computing Inc provides the LSF batch system; the manuals are available for browsing at the Platform Knowledge Centre.
Users active on blueice in the past 6 months already have a login. Other users with valid projects may request a bluefire login by sending e-mail to consult1@ucar.edu or contacting CISL Customer Support. Please include the following information with your login request:
To log in, type
If you have a different login on your home system than bluefire, use:
(Mac users may need to use the -Y option with ssh to enable X Window forwarding.) You will be prompted for your one-time password; use your Cryptocard to obtain it. Note: The roy gateway computer is not used for accessing bluefire.
Bluefire node names use the prefix "be." However, because your home
directory has been transferred over from blueice, your home directory
path still shows the "bl" prefix, for example: /blhome/
You may use scp (secure copy) or sftp (secure FTP) to transfer files between bluefire and remote platforms. The transfer may be initiated in either of two ways:
The former security model used for the supers at NCAR permitted you to install public keys from the super to the remote platform, and this allowed you to initiate file transfers from the super that did not require you to type in your passphrase. This method is still possible for transfers initiated from bluefire. See "How to install ssh keys on remote systems."
Batch job file transfer: This setup allows you to transfer files from batch jobs running on bluefire as well. However, batch file transfer jobs (including to the Mass Storage System) must be submitted to the share queue, because scp/sftp are not available on nodes in the other batch queues.
Once the keys are in place you may initiate a file transfer from a remote machine by executing:
Notes:
Bluefire queues have similar names to the previous blueice queues, although there are equivalent queues for the regular memory (64 GB) and large memory (128 GB) nodes. There are 69 regular memory nodes and 48 large memory nodes on the full system. The queue structure is described here:
| Queue Name | Queue Charging Factor | Run Limit | Memory Limit |
|---|---|---|---|
| capability (by special permission only) | 1 | 12 hours | 64GB per node |
| debug | 1 | 6 hours | 64GB per node |
| dedicated | 1 | 6 hours | 64GB per node |
| economy | 0.5 | 6 hours | 64GB per node |
| hold | 0.33 | 6 hours | 64GB per node |
| lrg_capability | 1 | 12 hours | 128GB per node |
| lrg_economy | 0.5 | 6 hours | 128GB per node |
| lrg_hold | 0.33 | 6 hours | 128GB per node |
| lrg_premium | 1.5 | 6 hours | 128GB per node |
| lrg_regular | 1 | 6 hours | 128GB per node |
| lrg_standby | 0.1 | 6 hours | 128GB per node |
| premium | 1.5 | 6 hours | 64GB per node |
| regular | 1 | 6 hours | 64GB per node |
| share (2 nodes available) | 1 | 12 hours | 256GB per node (shared) |
| special | 1 | 6 hours | 64GB per node |
| standby | 0.1 | 6 hours | 64GB per node |
Charging began immediately for bluefire usage as soon as it became
available. The following formula specifies how your computing account
is charged for running jobs on bluefire:
GAUs charged = wallclock hours used * number of nodes used * number of processors in that node * computer factor * queue charging factor
The "number of processors used in that node" is 32 for all queues on bluefire except for the debug and share queues.
The "computer factor" is a multiplier that equalizes the way GAUs are consumed on different computing platforms. Faster computers have higher computer factors. The computer charging factor for bluefire is 1.4.
The "queue charging factor" is a multiplier that reflects the priority given to jobs in a queue: higher-priority jobs are charged more.
You can check your GAU charges via the CISL user portal at:
Log in using your userid and one-time password from your CryptoCard.
If you are a new portal user, you will need to set up a GAU tab. Go to the "Manage Tabs" tab, select "GAU" as a tab to display, and "Save". A tab labelled "GAU" should now be available where you can select reporting options of your resource usage by date and job number. See the online help document for details.
Note that the charges are only posted the day following a run.
Note: The charging formula described in the CISL Portal gives different names to these variables, and it does not make a distinction between dedicated-node charging and shared-node charging. The following table helps prevent confusion caused by the terminology used in the CISL Portal:
| Charging formula | CISL Portal terminology |
|---|---|
| GAUs charged | GAU calculation |
| wallclock hours used | wallclock hours |
| number of nodes used * number of processors in that node |
CPUs reserved Note: CPUs (processors) must be reserved in even multiples of the processors in a node unless your job runs in the share queue (see share queue formula above). |
| computer factor | system multiplier |
| queue charging factor | queue multiplier |
When you understand the different terminology used for the portal, you can see that both charging formulas are equivalent.
Jobs from NCAR divisions or CSL proposal groups that have exceeded either the 30-day or 90-day usage limits* will be placed in the hold queue and run at a priority below jobs in the economy queue. Affected jobs will be charged at 1/3 the rate they would have been charged if they had been run in a regular queue ("rg").
Jobs from NCAR divisions or CSL proposal groups that have exceeded both the 30-day and 90-day usage limits* will be rejected, and users will receive an email suggesting that they submit their jobs to a standby queue. Note that standby queue time limits are three hours, so users may need to change their job's time limit before resubmitting to a standby queue.
Continuing users will have the same shell and environment on bluefire as on blueice. If you are a new user, you will be given Korn shell as the default. If you need to change your shell, you may do so by logging in to bluefire and then rsh'ing to bems. Follow the prompts to change your shell. The change may take up to 60 minutes to propagate.
At present, quotas are the same as they were for blueice. They may be increased in the future.
Home directory: Each user is assigned 5 GBs disk space in their home directory, /blhome/logon_id. The files in this directory are backed up.
/tmp directory: Please do not use /tmp in writing to disk. Instead, use the /ptmp directory for scratch purposes, as discussed immediately below. /tmp disk space usage is required by the OS, is very limited in size, and causes system problems when swamped.
/ptmp scratch directory: Each user is allocated up to 400 GBs scratch space in their /ptmp directory, /ptmp/logon_id. Users are encouraged to use this scratch space, but we emphasize that this filesystem is considered temporary and will be scrubbed when the overall /ptmp space is nearly full (typically 85% used). An automatic scrubber will delete least recently accessed files until the filesystem is below 85% full. Because of the enormous size of ptmp, your ptmp files are not backed up. This means that when they are scrubbed, they are gone forever unless you have copied them to the NCAR Mass Storage System, your home directory, or some other archival storage. We encourage your vigilance in backing up critical files.
Checking home and ptmp usage: You may check your home and ptmp quotas and usage by executing command /usr/local/bin/spquota. You may check the overall usage of /ptmp by executing command /usr/local/bin/df -h /ptmp. Executing this last command will help you anticipate the system scrubbing of ptmp files by showing the percentage of overall ptmp usage.
Divisional file systems: CISL provides file space for UCAR users that can be used to supplement user home directories and /ptmp space. Space on these divisional file systems is provided on the basis of requests made to designated divisional representatives who implement their own policies regarding space and quota allocations. These divisional file systems are not backed up by CISL, nor are they scrubbed. To acquire space on the file system provided for your division, contact the proper divisional representative (http://www.cisl.ucar.edu/docs/internal/divisionreps.html) listed under the password-protected link in this sentence.
/fis file system: CISL provides file space for some users and projects under the /fis filesystem. You may see which divisions and projects have space in /fis by executing the Unix "df" command to list disk free space. Please speak to your project manager for clarification on /fis usage relative to your projects. The /fis filesystem is regarded as high-availability, high-reliability GPFS disk space. These files are backed up.
Load Sharing Facility (LSF) is the basic scheduling system on bluefire.
To get the best performance from your jobs, we recommend what we refer to as the "Big 3": SMT, large pages, and processor binding.
Simultaneous Multi-Threading (SMT) is a feature that became available under AIX 5.3 for Power 5- and Power 6- based systems. To use SMT, no source code changes are required in your application's Fortran, C, or C++ code, but we recommend some simple modifications to your job scripts described in the paragraphs immediately below. By making these changes, you may be able to boost performance by 20% or more on some applications.
Under SMT, the Power 6 doubles the number of active threads on a processor by implementing a second, on-board "virtual" processor that is enabled by the CPU architecture. The basic concept of SMT is that no single process uses all processor execution units at the same time, so a second thread can utilize unused cycles.
Bluefire has 32 cpus/cores per node. Since the nodes have Simultaneous Multi-Threading (SMT) enabled, it will appear that there are 64 virtual cpus in each node. Typically 64 tasks/threads per node is most efficient; however, we recommend that you compare performance for your application for 32 and 64 tasks per node.
To take advantage of SMT on bluefire, double the value of the ptile parameter, i.e. ptile=64 instead of ptile=32. This establishes 64 virtual processors on the bluefire node, instead of just 32 physical processors.
An MPI-only non-SMT job that is submitted to run on 4 32-way nodes (that is, -n 128 and ptile=32) can be modified to utilize SMT on 2 32-way nodes by specifying -n 128 and ptile=64 or can continue to use 4 nodes and take advantage of SMT by specifying -n 256 and ptile=64, assuming the job scales up. The latter method might also be preferable if wallclock time is the primary consideration.
The relative benefit of each of these approaches can then be examined by comparing LSF's report of "Resource usage summary" that is included in the file specified by the -o bsub option.
A non-SMT job that runs 32 MPI tasks across 4 32-way bluefire nodes with each MPI task spawning 4 OpenMP threads would specify -n 32, ptile=8 and OMP_NUM_THREADS=4. The same job can be run with SMT by keeping -n 32 and OMP_NUM_THREADS=4 but switching to ptile=16 and would then use half the number of 32-way nodes. Alternatively, keeping the node count the same (4 nodes) would be configured by -n 64, ptile=16, and OMP_NUM_THREADS=4. Note for hybrid jobs: Under AIX 5.3, there is a known defect that causes performance problems in hybrid applications when the application reads stdin as redirected from a file, e.g., cam < namelist. The workaround is to set MP_STDINMODE=0 in the environment. This may be important for getting best performance under SMT.
Examples of jobs scripts using SMT are on bluefire under the /usr/local/examples/lsf_batch directory.
A pure OpenMP jobs is usually submitted with -n 1 and ptile=1
with the environmental variable OMP_NUM_THREADS set to the
requested number of OpenMP threads (usually 32).
Simpling setting this environmental variable to 64 will exploit SMT
for pure OpenMP jobs that scale up to 64 threads. To have more
control, we recommend using the XLSMPOPTS environmental variable,
setting it as follows
# for ksh and bash
or
export XLSMPOPTS="startproc=0:stride=n:stack=128000000"
# for csh
setenv XLSMPOPTS "startproc=0:stride=n:stack=128000000"
To maximize performance, n should be the largest possible
stride which will make possible
to run the requested number of threads in the available 64 processors.
For example, for OMP_NUM_THREADS=64 stride must be 1.
For OMP_NUM_THREADS=32 stride should be 2.
For OMP_NUM_THREADS=16 stride should be 4. And so on.
To run an MPMD program such as CCSM using SMT, an MPI job with 80 tasks can fit on two bluefire nodes instead of four with just the simple changes below. (In this case, each of the 16 atm tasks has 4 threads, so a total of 128 processors is used.)
#BSUB -R "span[ptile=64]" #bluefire default without SMT is 32
#BSUB -n 80 # number of tasks
Old task geometry (uses 4 nodes):
export LSB_PJL_TASK_GEOMETRY="{(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,\
21,22,23,24,25,26,27,28,29,30,31) (32,33,34,35,36,37,38,39,\
40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61)\
(62,63,64,65,66,67,68,69)(70,71,72,73,74,75,76,77,78,79)}"
Note: Backslashes (\) denote line is continuous.
New task geometry (uses 2 nodes):
export LSB_PJL_TASK_GEOMETRY="{(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,\
21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,\
40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63) \
(64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79)}"
SMT should aid in getting better throughput of your jobs and better performance for your GAU charges. Below are suggestions for testing whether SMT usage will benefit your applications:
Instructions for using SMT with the Community Climate System Model (CCSM) run scripts are given in the document, Taking advantage of Simultaneous Multi-Threading on bluevista when running CCSM" (Note: You will need to adjust for bluefire's larger nodes).
The default page size is 4 KB. On POWER6 systems, AIX 5L Version 5.3 supports a new 64-KB page size when running the 64-bit kernel. 64-KB pages are intended to be general-purpose. They are easy to use, and it is expected that many applications will see performance benefits when using 64-KB pages rather than 4-KB pages. IBM has reported performance improvements on a variety of workloads ranging from 1% to 13% when compared to the default 4-KB pages.
A user can specify a different page size to use for each of the three regions of a process's address space (data, stack, and text). The ldedit command may be used to set these page size options in an existing executable:
ldedit -btextpsize=64K -bdatapsize=64K -bstackpsize=64K a.out
A user can also set a process's preferred page sizes via the LDR_CNTRL environment variable. The following example will cause a.out to use 4-KB pages for its data, 64-KB pages for its text, and 64-KB pages for its stack:
Korn shell:
export LDR_CNTRL=DATAPSIZE=64K@TEXTPSIZE=64K@STACKPSIZE=64K
This will override any page size settings in an executable's XCOFF header.
Caveat: Using 64-KB pages rather than 4 KB pages for a multithreaded process's data may reduce the maximum number of threads a process can create due to alignment requirements for stack guard pages. If you encounter this limit, you may disable stack guard pages by setting the environment variable AIXTHREAD_GUARDPAGES to 0.
AIX 5.3 also supports large pages (16 MB) and "huge" pages (16 GB). However, these must be configured by the system administrator and the system rebooted. Users must be specifically authorized to use large pages. For further information on special requests for use of large pages, contact the CISL Consulting Office by any of the methods in our CISL Customer Support.
Further details are discussed in the IBM Whitepaper, "Guide to Multiple Page Size Support on AIX 5L Version 5.3".
We highly recommend using processor binding for all parallel jobs. If you have been using the bindproc.x script in /contrib on blueice, this has been replaced with an IBM-provided launch script on bluefire. To use it, set (Korn or Bourne shell syntax):
You may provide a comma (,) separated list of cpu-ids.
For hybrid programs, use:
along with your OMP_NUM_THREADS environment variable setting.
Important: All parallel jobs should begin using one of the launch scripts mentioned above with their mpirun.lsf command.
Totalview is available on bluefire under /usr/local/bin. Note that the previous instructions using host lists and poe still work, but it is now easier to use and faster if you use the new usage instructions below.
Debugging parallel programs: All parallel jobs on the IBMs must be run in batch via LSF, including Totalview. To use Totalview, you now specify the "tv" elim with LSF. A host list is no longer required, and Totalview no longer must run in IP, so it runs faster. Once the tv elim is included in your LSF directives, you use mpirun.lsf as usual, rather than invoking the totalview command. An example script is given below for an MPI parallel program.
Before using Totalview, be sure that you have X Window forwarding turned on (for example, by logging on using ssh -Y).
Run this script by submitting it using bsub < script. Wait for the Totalview window to pop up. To begin debugging a parallel program, press "go" in the main window. You will see a dialog box saying that poe is a parallel program and asking if you want to stop the program. Press "yes." The source code will then be displayed so that you can set breakpoints.
#!/usr/bin/csh # # LSF batch script to debug an MPI code (cpi) # under totalview # #BSUB -n 2 # number of total tasks #BSUB -o mpilsf.out.%J # output filename (%J to add job id) #BSUB -e mpilsf.err.%J # error filename #BSUB -J mpilsf.test # job name #BSUB -q debug # queue #BSUB -W 0:15 # wallclock time limit #BSUB -P 12345678 # project number #BSUB -a tv # use totalview elim mpirun.lsf ./cpi
To debug for memory leaks and other memory problems, it is necessary to link in a TotalView library that replaces the malloc on the system. The following example shows how to link and run a program with memory leaks under the TotalView debugger on bluefire. Note that the debug queue is suitable for one-node debug jobs up to 64 processors. For larger jobs, you will need to use the other batch queues.
#!/usr/bin/csh #debug.leak - compiles and runs memory debug example setenv LIBPATH /usr/local/toolworks/tvheap_mr xlf90 -o leak -g -q64 -qfixed leak.f -L/usr/local/toolworks/tvheap_mr \ -L/usr/local/toolworks/totalview/lib \ /usr/local/toolworks/totalview/lib/aix_malloctype64_5.o #Note: Version 8.7 became the default on October 5, 2009. cat << 'EOF1' > leak.f program testit implicit none integer i, ierror do i=1,10 call loknlod print *, 'made it through loknlod ', i end do stop end subroutine loknlod c simulates memory leak by failing to deallocate arrays real, allocatable:: foo(:,:) real, allocatable:: foo2(:,:) integer i,j,ierror print *, 'stepped into loknload' allocate(foo(50,100),stat=ierror) if(ierror /=0) then write(*,*)"Error trying to allocate foo" stop endif allocate(foo2(100,100),stat=ierror) if(ierror /=0) then write(*,*)"Error trying to allocate foo2" stop endif do j=1,50 do i=1,100 foo(i,j) = 6 + i foo2(i,j) = 100 + i end do end do return end subroutine loknlod 'EOF1' cat << 'EOF2' > run.leak #!/usr/bin/csh # # LSF batch script to do memory debugging # under totalview # #BSUB -n 2 # number of total tasks #BSUB -o leak.lsf.out.%J # output filename (%J to add job id) #BSUB -e leak.lsf.err.%J # error filename #BSUB -J leak.lsf.test # job name #BSUB -q debug # queue #BSUB -W 0:15 # wallclock time limit #BSUB -P 12345678 # account number #BSUB -a tv # use Totalview elim mpirun.lsf ./leak 'EOF2' bsub < run.leak #Under Totalview 8.7, when a Startup Parameters window pops #up, press OK. Go the Totalview main window and press "Go." (Source code #may not be visible yet.) When informed the program is #parallel and asked whether you wish to stop the job, select "Yes." #Set breakpoint near end of program (click box in front of line 9). #After running to the breakpoint (Go), select "Open MemoryScape" #under the Debug menu. In the Memory Debugging Session window, #choose "Leak Detection Source Report."
Some alternative methods of memory debugging involve using the IBM libhmd library or utilities available in GNU. These methods are briefly discussed at: http://www.cisl.ucar.edu/docs/pdf/LeakDetection.pdf.
There are two easy ways to analyze the memory footprint of your program.
Use the standard AIX real time tools
Type ps or top (with the appropriate arguments) from the command line, to have a snapshot of the current memory usage of all the programs running. You may have to search for your job.
Use the CSG "Job Memory Usage" tool
If you would like to know the total (peak) memory usage of just your job, without continuously monitoring the memory in real time, you can use the job_memusage.exe tool. It will print on stdout the memory usage of your program, when the latter terminates. It is located under /contrib/bin and can be used like:
/contrib/bin/job_memusage.exe your-program [your-arguments]If you have argument(s) to pass, you can, and it works also for output redirection, such as "<".
It works either interactively (i.e. on command line), for OpenMP, for MPI, and hybrid programs.
Command line or OpenMP example:/contrib/bin/job_memusage.exe ./hello_world.exeMPI and hybrid (usually in a LSF script) example:export MP_LABELIO=yes # if you use ksh mpirun.lsf /contrib/bin/job_memusage.exe ./cam < namelistor:setenv MP_LABELIO yes # if you use csh mpirun.lsf /contrib/bin/job_memusage.exe ./cam < namelistWhen your job returns, there will be some output. For command line there will be a single line, with the total memory usage of your job. For MPI and hybrid there will be a line for every node on which your program ran, and that's why it is useful to enable the MP_LABELIO environment variable (which is not strictly required): to identify every single node among the others.The job_memusage.exe tool is compatible with the launch tool described above. Both tools can be used at the same time like in this example:
export MP_LABELIO=yes # if you use ksh export TARGET_CPU_RANGE="-1" mpirun.lsf /usr/local/bin/launch /contrib/bin/job_memusage.exe ./wrf.exeor:setenv MP_LABELIO yes # if you use csh setenv TARGET_CPU_RANGE "-1" mpirun.lsf /usr/local/bin/launch /contrib/bin/job_memusage.exe ./wrf.exe
Here are four easy ways to time your program.
Use the Unix command "date" in a simple script
With a simple script such as this:
echo "start date:" date run-your-executable echo "end date:" dateThe output of this script provides the start time and the end time of your program; the difference is the wall-clock your program used.
Use the Unix command "timex" from the command line or a one-line script
The output of the command
timex my_programyields three numbers identified as real, user, and sys:
"real" is wall-clock time
"user" is the time used by user program
"sys" is time system used to load and unload your program and others.Use built-in functions specific to C, C++ or Fortran
For C/C++, call the functions clock(), time(...), difftime(...), etc. to get different types of timing.
For Fortran, call the function date_and_time(...)
Use MPI functions specific to C, C++ or Fortran
If your are writing a MPI program, you should use the MPI functions and not the previous built-in functions
For C/C++, call the functions MPI_Wtime() MPI_Wtick()
For Fortran, call the functions MPI_WTIME() MPI_WTICK()
Information on available tools is given here.
We are working to provide new examples; see directory /usr/local/examples on the system.
The "modules" utility is now available on Bluefire for modifying your environment to find alternate compilers and software under /contrib.
To get set up to use modules, you will need to add the following line to your .cshrc file after all other path-setting commands, if you are a C shell user.
source /contrib/Modules/3.2.6/init/cshOr you need to add the following line to your .profile file, if you are using Korn shell.
. /contrib/Modules/3.2.6/init/kshAfter this setup is executed upon login, you can use the module command. Some sample dotfiles with modules settings can be found under /usr/local/skel.
To see which modules are in force, type
module list
To load a new module (ImageMagick for example), type:
module load ImageMagick-6.5.3-10
To show all available modulefiles, type:
module av
For help, type:
module help
Contact CISL Customer Support, call 303-497-1278, or send email to consult1@ucar.edu.
| Resource | Bluefire/Power6 | Blueice/Power5+ |
|---|---|---|
| Clock cycle | 4.7GHz | 1.9GHz |
| Memory/processor | 2-4 GB | 2-4 GB |
| L1 cache | L1 cache is 128 KB (64 KB data + 64 KB instruction) per processor. | L1 cache is 96 KB (32 KB data + 64 KB instruction) per processor. |
| L2 cache | L2 cache is 4 MB per processor on-chip. | L2 cache is 2 MB per processor on-chip. |
| L3 cache | The off-chip L3 cache is 32 MB per two-processor chip, and is shared by the two processors on the chip. L3 cache memory is connected to the chip via an 80-GB-per-second bus. | 36 MB per processor pair, shared by all processors |
| Switch Latency | Infiniband 1.3 µs (peak) | HPS 5.0 µs (peak) |
| Switch Bandwidth | 20 GBps each direction | 1.7 GBps each direction |
| Multiple Functional Units | Main thing is faster clock; 2 floating point units; 3 fixed point units; two load/store units | 2 floating point units; 3 fixed point units; two load/store units |
| Simultaneous Multi-Threading (SMT) | Yes - SMT appears to the OS as multiple CPUs. Threaded applications may take advantage of SMT. To use on bluefire, use double the number of tasks you used on blueice | Same. |