Bluefire Quick Start Guide

Last updated: Sep 24th 2009

Overview
Recommended use
Hardware
Software stack
Job scheduling
How to get an account
Logging in
File transfer
Queues and charging
Shells and environment
Disk space
Load Sharing Facility (LSF)
Running jobs
Using Simultaneous Multi-Threading (SMT)
Multiple page size support
Processor binding is mandatory
Totalview debugger
Memory analyzer
Program timers
Performance analyzers
Examples
Modules utility
Getting help
Appendix A. Comparison of bluefire and blueice chips

Overview

Bluefire is an IBM clustered Symmetric MultiProcessing (SMP) system based on the Power6 chip. Its hardware taxonomy and software stack is similar to that of blueice, NCAR's former Power5+ system replaced by bluefire on July 1, 2008.

Recommended use

For the NCAR and University user community, its purpose is to provide high-performance supercomputing for numerically intensive models and applications. It is best suited to running codes that can be run efficiently in parallel.

The programming models are the familiar ones: OpenMP, Message Passing Interface (MPI), and hybrid jobs that use both OpenMP and MPI.

Hardware (full system)

See Appendix A. for a comparison of blueice and bluefire processors.

Software stack

Bluefire software is identical with that of blueice, although versions may differ, and there may be some difference in product names due to the difference in switches.

Job scheduling

The Community Computing users and Climate Simulation Lab users share the system. The number of nodes available to each group is flexible, to maximize the system's productivity - meaning, if one group is not using all of its available compute nodes, then members of the other group may use those nodes. LSF determines these splits based on runtime usage. Platform Computing Inc provides the LSF batch system; the manuals are available for browsing at the Platform Knowledge Centre.

How to get an account

Users active on blueice in the past 6 months already have a login. Other users with valid projects may request a bluefire login by sending e-mail to consult1@ucar.edu or contacting CISL Customer Support. Please include the following information with your login request:

Logging in

To log in, type

If you have a different login on your home system than bluefire, use:

or

(Mac users may need to use the -Y option with ssh to enable X Window forwarding.) You will be prompted for your one-time password; use your Cryptocard to obtain it. Note: The roy gateway computer is not used for accessing bluefire.

Bluefire node names use the prefix "be." However, because your home directory has been transferred over from blueice, your home directory path still shows the "bl" prefix, for example: /blhome/.

File transfer

You may use scp (secure copy) or sftp (secure FTP) to transfer files between bluefire and remote platforms. The transfer may be initiated in either of two ways:

  1. initiated from bluefire, if the target machine accepts incoming ssh sessions from bluefire.

    The former security model used for the supers at NCAR permitted you to install public keys from the super to the remote platform, and this allowed you to initiate file transfers from the super that did not require you to type in your passphrase. This method is still possible for transfers initiated from bluefire. See "How to install ssh keys on remote systems."

    Batch job file transfer: This setup allows you to transfer files from batch jobs running on bluefire as well. However, batch file transfer jobs (including to the Mass Storage System) must be submitted to the share queue, because scp/sftp are not available on nodes in the other batch queues.

  2. initiated from remote machines, this method (unix/linux only) requires you to install the public key of the remote machine into bluefire. Please follow these steps to install the keys:

    1. Go to the .ssh directory under your home directory on your workstation or the remote machine from which you plan to initiate a file transfer for bluefire. Use "cat" or "more" to display the contents of your id_rsa.pub file, for example:
        % more id_rsa.pub
    2. In another window, login to bluefire using your CryptoCard.
    3. Execute the utility to store keys:

        % /usr/local/bin/bluefire_scp_setup
    4. When prompted, use your mouse to copy your public keys from your workstation window and paste them into the bluefire window. You may need to press <return>.

    Note: Please allow up to 60 minutes for the keys to be updated before attempting key-based file transfer.

    Once the keys are in place you may initiate a file transfer from a remote machine by executing:

    or

    Notes:

    Queues and charging

    Bluefire queues have similar names to the previous blueice queues, although there are equivalent queues for the regular memory (64 GB) and large memory (128 GB) nodes. There are 69 regular memory nodes and 48 large memory nodes on the full system. The queue structure is described here:

    Queue Name Queue Charging Factor Run Limit Memory Limit
    capability (by special permission only) 1 12 hours 64GB per node
    debug 1 6 hours 64GB per node
    dedicated 1 6 hours 64GB per node
    economy 0.5 6 hours 64GB per node
    hold 0.33 6 hours 64GB per node
    lrg_capability 1 12 hours 128GB per node
    lrg_economy 0.5 6 hours 128GB per node
    lrg_hold 0.33 6 hours 128GB per node
    lrg_premium 1.5 6 hours 128GB per node
    lrg_regular 1 6 hours 128GB per node
    lrg_standby 0.1 6 hours 128GB per node
    premium 1.5 6 hours 64GB per node
    regular 1 6 hours 64GB per node
    share (2 nodes available) 1 12 hours 256GB per node (shared)
    special 1 6 hours 64GB per node
    standby 0.1 6 hours 64GB per node

    Charging began immediately for bluefire usage as soon as it became available. The following formula specifies how your computing account is charged for running jobs on bluefire:
    GAUs charged = wallclock hours used * number of nodes used * number of processors in that node * computer factor * queue charging factor

    The "number of processors used in that node" is 32 for all queues on bluefire except for the debug and share queues.

    The "computer factor" is a multiplier that equalizes the way GAUs are consumed on different computing platforms. Faster computers have higher computer factors. The computer charging factor for bluefire is 1.4.

    The "queue charging factor" is a multiplier that reflects the priority given to jobs in a queue: higher-priority jobs are charged more.

    Using the CISL Portal to check charges

    You can check your GAU charges via the CISL user portal at:

    Log in using your userid and one-time password from your CryptoCard.

    If you are a new portal user, you will need to set up a GAU tab. Go to the "Manage Tabs" tab, select "GAU" as a tab to display, and "Save". A tab labelled "GAU" should now be available where you can select reporting options of your resource usage by date and job number. See the online help document for details.

    Note that the charges are only posted the day following a run.

    Note: The charging formula described in the CISL Portal gives different names to these variables, and it does not make a distinction between dedicated-node charging and shared-node charging. The following table helps prevent confusion caused by the terminology used in the CISL Portal:

    Charging formula CISL Portal terminology
    GAUs charged GAU calculation
    wallclock hours used wallclock hours
    number of nodes used *
    number of processors in that node
    CPUs reserved   Note: CPUs (processors) must be reserved in even multiples of the processors in a node unless your job runs in the share queue (see share queue formula above).
    computer factor system multiplier
    queue charging factor queue multiplier

    When you understand the different terminology used for the portal, you can see that both charging formulas are equivalent.

    Exceeding allocation threshold limits*

    Jobs from NCAR divisions or CSL proposal groups that have exceeded either the 30-day or 90-day usage limits* will be placed in the hold queue and run at a priority below jobs in the economy queue. Affected jobs will be charged at 1/3 the rate they would have been charged if they had been run in a regular queue ("rg").

    Jobs from NCAR divisions or CSL proposal groups that have exceeded both the 30-day and 90-day usage limits* will be rejected, and users will receive an email suggesting that they submit their jobs to a standby queue. Note that standby queue time limits are three hours, so users may need to change their job's time limit before resubmitting to a standby queue.

    Shells and environment

    Continuing users will have the same shell and environment on bluefire as on blueice. If you are a new user, you will be given Korn shell as the default. If you need to change your shell, you may do so by logging in to bluefire and then rsh'ing to bems. Follow the prompts to change your shell. The change may take up to 60 minutes to propagate.

    At present, quotas are the same as they were for blueice. They may be increased in the future.

    Disk space

    Home directory: Each user is assigned 5 GBs disk space in their home directory, /blhome/logon_id. The files in this directory are backed up.

    /tmp directory: Please do not use /tmp in writing to disk. Instead, use the /ptmp directory for scratch purposes, as discussed immediately below. /tmp disk space usage is required by the OS, is very limited in size, and causes system problems when swamped.

    /ptmp scratch directory: Each user is allocated up to 400 GBs scratch space in their /ptmp directory, /ptmp/logon_id. Users are encouraged to use this scratch space, but we emphasize that this filesystem is considered temporary and will be scrubbed when the overall /ptmp space is nearly full (typically 85% used). An automatic scrubber will delete least recently accessed files until the filesystem is below 85% full. Because of the enormous size of ptmp, your ptmp files are not backed up. This means that when they are scrubbed, they are gone forever unless you have copied them to the NCAR Mass Storage System, your home directory, or some other archival storage. We encourage your vigilance in backing up critical files.

    Checking home and ptmp usage: You may check your home and ptmp quotas and usage by executing command /usr/local/bin/spquota. You may check the overall usage of /ptmp by executing command /usr/local/bin/df -h /ptmp. Executing this last command will help you anticipate the system scrubbing of ptmp files by showing the percentage of overall ptmp usage.

    Divisional file systems: CISL provides file space for UCAR users that can be used to supplement user home directories and /ptmp space. Space on these divisional file systems is provided on the basis of requests made to designated divisional representatives who implement their own policies regarding space and quota allocations. These divisional file systems are not backed up by CISL, nor are they scrubbed. To acquire space on the file system provided for your division, contact the proper divisional representative (http://www.cisl.ucar.edu/docs/internal/divisionreps.html) listed under the password-protected link in this sentence.

    /fis file system: CISL provides file space for some users and projects under the /fis filesystem. You may see which divisions and projects have space in /fis by executing the Unix "df" command to list disk free space. Please speak to your project manager for clarification on /fis usage relative to your projects. The /fis filesystem is regarded as high-availability, high-reliability GPFS disk space. These files are backed up.

    Running jobs

    Load Sharing Facility (LSF) is the basic scheduling system on bluefire.

    To get the best performance from your jobs, we recommend what we refer to as the "Big 3": SMT, large pages, and processor binding.

    Using Simultaneous Multi-Threading (SMT)

    Simultaneous Multi-Threading (SMT) is a feature that became available under AIX 5.3 for Power 5- and Power 6- based systems. To use SMT, no source code changes are required in your application's Fortran, C, or C++ code, but we recommend some simple modifications to your job scripts described in the paragraphs immediately below. By making these changes, you may be able to boost performance by 20% or more on some applications.

    Under SMT, the Power 6 doubles the number of active threads on a processor by implementing a second, on-board "virtual" processor that is enabled by the CPU architecture. The basic concept of SMT is that no single process uses all processor execution units at the same time, so a second thread can utilize unused cycles.

    Bluefire has 32 cpus/cores per node. Since the nodes have Simultaneous Multi-Threading (SMT) enabled, it will appear that there are 64 virtual cpus in each node. Typically 64 tasks/threads per node is most efficient; however, we recommend that you compare performance for your application for 32 and 64 tasks per node.

    Pure MPI jobs

    To take advantage of SMT on bluefire, double the value of the ptile parameter, i.e. ptile=64 instead of ptile=32. This establishes 64 virtual processors on the bluefire node, instead of just 32 physical processors.

    An MPI-only non-SMT job that is submitted to run on 4 32-way nodes (that is, -n 128 and ptile=32) can be modified to utilize SMT on 2 32-way nodes by specifying -n 128 and ptile=64 or can continue to use 4 nodes and take advantage of SMT by specifying -n 256 and ptile=64, assuming the job scales up. The latter method might also be preferable if wallclock time is the primary consideration.

    The relative benefit of each of these approaches can then be examined by comparing LSF's report of "Resource usage summary" that is included in the file specified by the -o bsub option.

    Hybrid jobs

    A non-SMT job that runs 32 MPI tasks across 4 32-way bluefire nodes with each MPI task spawning 4 OpenMP threads would specify -n 32, ptile=8 and OMP_NUM_THREADS=4. The same job can be run with SMT by keeping -n 32 and OMP_NUM_THREADS=4 but switching to ptile=16 and would then use half the number of 32-way nodes. Alternatively, keeping the node count the same (4 nodes) would be configured by -n 64, ptile=16, and OMP_NUM_THREADS=4. Note for hybrid jobs: Under AIX 5.3, there is a known defect that causes performance problems in hybrid applications when the application reads stdin as redirected from a file, e.g., cam < namelist. The workaround is to set MP_STDINMODE=0 in the environment. This may be important for getting best performance under SMT.

    Examples of jobs scripts using SMT are on bluefire under the /usr/local/examples/lsf_batch directory.

    Pure OpenMP jobs

    A pure OpenMP jobs is usually submitted with -n 1 and ptile=1 with the environmental variable OMP_NUM_THREADS set to the requested number of OpenMP threads (usually 32). Simpling setting this environmental variable to 64 will exploit SMT for pure OpenMP jobs that scale up to 64 threads. To have more control, we recommend using the XLSMPOPTS environmental variable, setting it as follows

    # for ksh and bash
    export XLSMPOPTS="startproc=0:stride=n:stack=128000000"
    or
    # for csh
    setenv XLSMPOPTS "startproc=0:stride=n:stack=128000000"

    To maximize performance, n should be the largest possible stride which will make possible to run the requested number of threads in the available 64 processors. For example, for OMP_NUM_THREADS=64 stride must be 1. For OMP_NUM_THREADS=32 stride should be 2. For OMP_NUM_THREADS=16 stride should be 4. And so on.

    MPMD jobs

    To run an MPMD program such as CCSM using SMT, an MPI job with 80 tasks can fit on two bluefire nodes instead of four with just the simple changes below. (In this case, each of the 16 atm tasks has 4 threads, so a total of 128 processors is used.)

    1. Modify ptile setting (maximum number of tasks per node) in LSF:
            #BSUB -R "span[ptile=64]"    #bluefire default without SMT is 32
      
    2. The number of tasks your job requests remains the same:
            #BSUB -n 80    # number of tasks
      
    3. If your job uses task geometry, modify the LSB_PJL_TASK_GEOMETRY environment variable as if the node had 64 processors rather than 32, for example:

      Old task geometry (uses 4 nodes):
      export LSB_PJL_TASK_GEOMETRY="{(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,\
      21,22,23,24,25,26,27,28,29,30,31) (32,33,34,35,36,37,38,39,\
      40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61)\
      (62,63,64,65,66,67,68,69)(70,71,72,73,74,75,76,77,78,79)}"

      Note: Backslashes (\) denote line is continuous.

      New task geometry (uses 2 nodes):
      export LSB_PJL_TASK_GEOMETRY="{(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,\
      21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,\
      40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63) \
      (64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79)}"

    SMT should aid in getting better throughput of your jobs and better performance for your GAU charges. Below are suggestions for testing whether SMT usage will benefit your applications:

    Instructions for using SMT with the Community Climate System Model (CCSM) run scripts are given in the document, Taking advantage of Simultaneous Multi-Threading on bluevista when running CCSM" (Note: You will need to adjust for bluefire's larger nodes).

    Multiple page size support

    64-KB pages

    The default page size is 4 KB. On POWER6 systems, AIX 5L Version 5.3 supports a new 64-KB page size when running the 64-bit kernel. 64-KB pages are intended to be general-purpose. They are easy to use, and it is expected that many applications will see performance benefits when using 64-KB pages rather than 4-KB pages. IBM has reported performance improvements on a variety of workloads ranging from 1% to 13% when compared to the default 4-KB pages.

    A user can specify a different page size to use for each of the three regions of a process's address space (data, stack, and text). The ldedit command may be used to set these page size options in an existing executable:

    ldedit -btextpsize=64K -bdatapsize=64K -bstackpsize=64K a.out
    

    A user can also set a process's preferred page sizes via the LDR_CNTRL environment variable. The following example will cause a.out to use 4-KB pages for its data, 64-KB pages for its text, and 64-KB pages for its stack:

    Korn shell:

    export LDR_CNTRL=DATAPSIZE=64K@TEXTPSIZE=64K@STACKPSIZE=64K
    

    This will override any page size settings in an executable's XCOFF header.

    Caveat: Using 64-KB pages rather than 4 KB pages for a multithreaded process's data may reduce the maximum number of threads a process can create due to alignment requirements for stack guard pages. If you encounter this limit, you may disable stack guard pages by setting the environment variable AIXTHREAD_GUARDPAGES to 0.

    Page sizes for very high performance environments

    AIX 5.3 also supports large pages (16 MB) and "huge" pages (16 GB). However, these must be configured by the system administrator and the system rebooted. Users must be specifically authorized to use large pages. For further information on special requests for use of large pages, contact the CISL Consulting Office by any of the methods in our CISL Customer Support.

    Further details are discussed in the IBM Whitepaper, "Guide to Multiple Page Size Support on AIX 5L Version 5.3".

    Processor binding is mandatory

    We highly recommend using processor binding for all parallel jobs. If you have been using the bindproc.x script in /contrib on blueice, this has been replaced with an IBM-provided launch script on bluefire. To use it, set (Korn or Bourne shell syntax):

    You may provide a comma (,) separated list of cpu-ids.

    For hybrid programs, use:

    along with your OMP_NUM_THREADS environment variable setting.

    Important: All parallel jobs should begin using one of the launch scripts mentioned above with their mpirun.lsf command.

    TotalView debugger

    Totalview is available on bluefire under /usr/local/bin. Note that the previous instructions using host lists and poe still work, but it is now easier to use and faster if you use the new usage instructions below.

    Debugging parallel programs: All parallel jobs on the IBMs must be run in batch via LSF, including Totalview. To use Totalview, you now specify the "tv" elim with LSF. A host list is no longer required, and Totalview no longer must run in IP, so it runs faster. Once the tv elim is included in your LSF directives, you use mpirun.lsf as usual, rather than invoking the totalview command. An example script is given below for an MPI parallel program.

    Before using Totalview, be sure that you have X Window forwarding turned on (for example, by logging on using ssh -Y).

    Run this script by submitting it using bsub < script. Wait for the Totalview window to pop up. To begin debugging a parallel program, press "go" in the main window. You will see a dialog box saying that poe is a parallel program and asking if you want to stop the program. Press "yes." The source code will then be displayed so that you can set breakpoints.

    #!/usr/bin/csh
    #
    # LSF batch script to debug an MPI code (cpi)
    # under totalview
    #
    #BSUB -n 2                            # number of total tasks
    #BSUB -o mpilsf.out.%J                # output filename (%J to add job id)
    #BSUB -e mpilsf.err.%J                # error filename
    #BSUB -J mpilsf.test                  # job name
    #BSUB -q debug                        # queue
    #BSUB -W 0:15                         # wallclock time limit
    #BSUB -P 12345678                     # project number
    #BSUB -a tv	                      # use totalview elim
    
    mpirun.lsf ./cpi 
    

    Totalview Memory debugging example

    To debug for memory leaks and other memory problems, it is necessary to link in a TotalView library that replaces the malloc on the system. The following example shows how to link and run a program with memory leaks under the TotalView debugger on bluefire. Note that the debug queue is suitable for one-node debug jobs up to 64 processors. For larger jobs, you will need to use the other batch queues.

    #!/usr/bin/csh
    #debug.leak - compiles and runs memory debug example
    
    setenv LIBPATH /usr/local/toolworks/tvheap_mr
    
    xlf90 -o leak -g -q64 -qfixed leak.f -L/usr/local/toolworks/tvheap_mr \
      -L/usr/local/toolworks/totalview/lib \
        /usr/local/toolworks/totalview/lib/aix_malloctype64_5.o
    
    #Note: Version 8.7 became the default on October 5, 2009.
    
    cat << 'EOF1' > leak.f
          program testit
          implicit none
          integer i, ierror
    
          do i=1,10
             call loknlod
             print *, 'made it through loknlod ', i
          end do
          stop
          end
    
          subroutine loknlod
    c simulates memory leak by failing to deallocate arrays
          real, allocatable:: foo(:,:)
          real, allocatable:: foo2(:,:)
          integer i,j,ierror
    
          print *, 'stepped into loknload'
          allocate(foo(50,100),stat=ierror)
          if(ierror /=0) then
             write(*,*)"Error trying to allocate foo"
             stop
          endif
    
          allocate(foo2(100,100),stat=ierror)
          if(ierror /=0) then
             write(*,*)"Error trying to allocate foo2"
             stop
          endif
          
          do j=1,50
             do i=1,100
                foo(i,j) = 6 + i
                foo2(i,j) = 100 + i
             end do
          end do
          return
          end subroutine loknlod
    'EOF1'
    
    cat << 'EOF2' > run.leak
    #!/usr/bin/csh
    #
    # LSF batch script to do memory debugging
    # under totalview
    #
    #BSUB -n 2                          # number of total tasks
    #BSUB -o leak.lsf.out.%J            # output filename (%J to add job id)
    #BSUB -e leak.lsf.err.%J            # error filename
    #BSUB -J leak.lsf.test              # job name
    #BSUB -q debug                      # queue
    #BSUB -W 0:15                       # wallclock time limit
    #BSUB -P 12345678                   # account number
    #BSUB -a tv                         # use Totalview elim
    
    mpirun.lsf ./leak
    'EOF2'
    
    bsub < run.leak
    
    #Under Totalview 8.7, when a Startup Parameters window pops
    #up, press OK. Go the Totalview main window and press "Go." (Source code
    #may not be visible yet.) When informed the program is
    #parallel and asked whether you wish to stop the job, select "Yes." 
    #Set breakpoint near end of program (click box in front of line 9).
    #After running to the breakpoint (Go), select "Open MemoryScape" 
    #under the Debug menu. In the Memory Debugging Session window,
    #choose "Leak Detection Source Report."
    
    

    Other memory debugging methods

    Some alternative methods of memory debugging involve using the IBM libhmd library or utilities available in GNU. These methods are briefly discussed at: http://www.cisl.ucar.edu/docs/pdf/LeakDetection.pdf.

    Memory analyzer

    There are two easy ways to analyze the memory footprint of your program.

    Use the standard AIX real time tools

    Type ps or top (with the appropriate arguments) from the command line, to have a snapshot of the current memory usage of all the programs running. You may have to search for your job.

    Use the CSG "Job Memory Usage" tool

    If you would like to know the total (peak) memory usage of just your job, without continuously monitoring the memory in real time, you can use the job_memusage.exe tool. It will print on stdout the memory usage of your program, when the latter terminates. It is located under /contrib/bin and can be used like:

    /contrib/bin/job_memusage.exe your-program [your-arguments]
    
    If you have argument(s) to pass, you can, and it works also for output redirection, such as "<".
    It works either interactively (i.e. on command line), for OpenMP, for MPI, and hybrid programs.
    Command line or OpenMP example:
    /contrib/bin/job_memusage.exe ./hello_world.exe 
    
    MPI and hybrid (usually in a LSF script) example:
    export MP_LABELIO=yes # if you use ksh
    mpirun.lsf /contrib/bin/job_memusage.exe ./cam < namelist
    
    or:
    setenv MP_LABELIO yes # if you use csh
    mpirun.lsf /contrib/bin/job_memusage.exe ./cam < namelist
    
    When your job returns, there will be some output. For command line there will be a single line, with the total memory usage of your job. For MPI and hybrid there will be a line for every node on which your program ran, and that's why it is useful to enable the MP_LABELIO environment variable (which is not strictly required): to identify every single node among the others.

    The job_memusage.exe tool is compatible with the launch tool described above. Both tools can be used at the same time like in this example:

    export MP_LABELIO=yes # if you use ksh
    export TARGET_CPU_RANGE="-1"
    
    mpirun.lsf /usr/local/bin/launch /contrib/bin/job_memusage.exe ./wrf.exe 
    
    or:
    setenv MP_LABELIO yes # if you use csh
    setenv TARGET_CPU_RANGE "-1"
    mpirun.lsf /usr/local/bin/launch /contrib/bin/job_memusage.exe ./wrf.exe 
    

    Program timers

    Here are four easy ways to time your program.

    Use the Unix command "date" in a simple script

    With a simple script such as this:

       echo "start date:"
       date
       run-your-executable
       echo "end date:"
       date
    

    The output of this script provides the start time and the end time of your program; the difference is the wall-clock your program used.

    Use the Unix command "timex" from the command line or a one-line script

    The output of the command

    timex my_program
    

    yields three numbers identified as real, user, and sys:
    "real" is wall-clock time
    "user" is the time used by user program
    "sys" is time system used to load and unload your program and others.

    Use built-in functions specific to C, C++ or Fortran

    For C/C++, call the functions clock(), time(...), difftime(...), etc. to get different types of timing.

    For Fortran, call the function date_and_time(...)

    Use MPI functions specific to C, C++ or Fortran

    If your are writing a MPI program, you should use the MPI functions and not the previous built-in functions

    For C/C++, call the functions MPI_Wtime() MPI_Wtick()

    For Fortran, call the functions MPI_WTIME() MPI_WTICK()

    Performance analyzers

    Information on available tools is given here.

    Examples

    We are working to provide new examples; see directory /usr/local/examples on the system.

    Modules utility

    The "modules" utility is now available on Bluefire for modifying your environment to find alternate compilers and software under /contrib.

    To get set up to use modules, you will need to add the following line to your .cshrc file after all other path-setting commands, if you are a C shell user.

    source /contrib/Modules/3.2.6/init/csh

    Or you need to add the following line to your .profile file, if you are using Korn shell.

    . /contrib/Modules/3.2.6/init/ksh

    After this setup is executed upon login, you can use the module command. Some sample dotfiles with modules settings can be found under /usr/local/skel.

    To see which modules are in force, type

    module list
    

    To load a new module (ImageMagick for example), type:

    module load ImageMagick-6.5.3-10
    

    To show all available modulefiles, type:

    module av
    

    For help, type:

    module help
    

    Getting help

    Contact CISL Customer Support, call 303-497-1278, or send email to consult1@ucar.edu.

    Appendix A. Comparison of bluefire and blueice chips

    Resource Bluefire/Power6 Blueice/Power5+
    Clock cycle 4.7GHz 1.9GHz
    Memory/processor 2-4 GB 2-4 GB
    L1 cache L1 cache is 128 KB (64 KB data + 64 KB instruction) per processor. L1 cache is 96 KB (32 KB data + 64 KB instruction) per processor.
    L2 cache L2 cache is 4 MB per processor on-chip. L2 cache is 2 MB per processor on-chip.
    L3 cache The off-chip L3 cache is 32 MB per two-processor chip, and is shared by the two processors on the chip. L3 cache memory is connected to the chip via an 80-GB-per-second bus. 36 MB per processor pair, shared by all processors
    Switch Latency Infiniband 1.3 µs (peak) HPS 5.0 µs (peak)
    Switch Bandwidth 20 GBps each direction 1.7 GBps each direction
    Multiple Functional Units Main thing is faster clock; 2 floating point units; 3 fixed point units; two load/store units 2 floating point units; 3 fixed point units; two load/store units
    Simultaneous Multi-Threading (SMT) Yes - SMT appears to the OS as multiple CPUs. Threaded applications may take advantage of SMT. To use on bluefire, use double the number of tasks you used on blueice Same.