Author's name: |
Charles Archer |
Email Address: |
archerc@us.ibm.com |
Institution: |
IBM |
Other Authors' Names, Email Addresses, and Institutions: |
Brian Smith, smithbr@us.ibm.com - IBM
Mike Blocksome, blocksom@us.ibm.com - IBM
Joe Ratterman, jratt@us.ibm.com - IBM |
Title: |
|
Utilization of the Communications Coprocessor on Blue Gene/L |
PDF |
Abstract: |
Coprocessor mode in the existing implementation of MPI for Blue Gene/L is beneficial as an offload engine to obtain maximal performance in communication bound applications. Most notably, collective operations and bandwidth bound receives are assisted by the coprocessor. However, the the use of the coprocessor does not imply asynchronous communication that allows overlap of sends and receives with computation. As a result, applications carefully written to exploit overlap on other MPI platforms may not see any benefit from similar code on BlueGene/L. Collective operations are, by nature, synchronous, and Blue Gene/L's implementation uses the coprocessor to assist when necessary for collective operations. To achieve fully asynchronous behavior, design changes are required for the light weight kernel and the message passing facilities of BG/L. This presentation will review a set of design principles for addressing these issues, and will report on performance results of these implementations. |

Author's name: |
Cecelia DeLuca |
Email Address: |
cdeluca@ucar.edu |
Institution: |
NCAR |
Other Authors' Names, Email Addresses, and Institutions: |
V. Balaji, v.balaji@noaa.gov - NOAA GFDL
Arlindo da Silva, arlindo.dasilva@nasa.gov - NASA GMAO
Rocky Dunlap, rocky@cc.gatech.edu - GA Tech
Chris Hill, cnh@plume.mit.edu - MIT
Robert Ferraro, robert.ferraro@nasa.gov - NASA JPL
Erik Kluzek, erik@ucar.edu - NCAR
Peggy Li, peggy.li@nasa.gov - NASA JPL
Leo Mark, leomark@cc.gatech.edu - GA Tech
Roberto Mechoso, mechoso@atmos.ucla.edu - UCLA
Don Middleton, don@ucar.edu - NCAR
Serguei Nikonov, serguei.nikonov@noaa.gov - NOAA GFDL
Spencer Rugaber, spencer@cc.gatech.edu - GA Tech
Don Stark, stark@ucar.edu - NCAR
Max Suarez, max.j.suarez@nasa.gov - NASA GMAO
Gerhard Theurich, gtheurich@sgi.com - SGI
Silverio Vasquez, svasquez@ucar.edu - NCAR
Weiyu Yang, weiyu.yang@noaa.gov - NOAA NCEP |
Title: |
|
The Earth System Modeling Framework and Earth System Curator: Software Components as Building Blocks of Community |
PDF |
Abstract: |
The Earth System Modeling Framework (ESMF) - http://www.esmf.ucar.edu - is an established multi-agency initiative to develop high performance common modeling infrastructure for climate and weather models. ESMF is the technical foundation for the NASA Modeling, Analysis, and Prediction (MAP) Climate Variability and Change program and the DoD Battlespace Environments Institute (BEI). It is being incorporated into the Community Climate System Model (CCSM), the Weather Research and Forecast (WRF) Model, NOAA NCEP and GFDL models, a variety of Army, Navy, and Air Force models, the GEOS-5 atmospheric general circulation model, the Space Weather Modeling Framework (SWMF), and many others. The new, NSF-funded Earth System Curator - http://www.cc.gatech.edu/projects/curator - is a prototype database and toolkit that will store information about model configurations, prepare models for execution, and run them locally or in a distributed fashion.
The key concept that underlies both ESMF and the Earth System Curator is that of software components. Components are software units that are “composable,” meaning they can be combined to form coupled applications. These components may be representations of physical domains, such as atmospheres or oceans; processes within particular domains such as atmospheric radiation or chemistry; or computational functions, such as I/O. ESMF provides interfaces, an architecture, and tools for structuring components hierarchically to form complex, coupled modeling applications. ESMF components may be run sequentially, concurrently, or in a mixed mode on computers ranging from laptops to the world's largest supercomputers. The Earth System Curator will enable modelers to describe, archive, search, compose, and run component-based models. Together these projects encourage a new paradigm for modeling: one in which the community can draw from a federation of many interoperable component s in order to create and deploy modeling applications. The goal is to enable a rich network of collaborations and a new generation of models that can simulate the Earth's environment and predict its behavior better than ever before. |

Author's name: |
John Dennis |
Email Address: |
dennis@ucar.edu |
Institution: |
National Center for Atmospheric Research |
Title: |
|
Scaling the Parallel Ocean Program (POP) to 30,000 processors on Blue Gene/L |
PDF |
Abstract: |
We present the results of work to improve the scalability of the Parallel Ocean Program (POP) on Blue Gene/L. We discover that it is possible to significantly increase the simulation rate of POP on large processor counts by apply two techniques to enhance scalability. Our first technique removes land points within the barotropic solver of POP. The elimination of all land points within the barotropic solver both reduces the amount of data that must be loaded from the memory hierarchy and the total message volume. Our second technique involves the use of an alternative partitioning algorithm based on space-filling curves that reduces load-imbalance. The combined impact of both techniques doubled the simulation rate of the POP 0.1 degree benchmark from 3.9 to 7.9 simulated years per wall-clock day on 30k processors of the IBM Blue Gene/L Watson system. A rate of 7.9 years per day represents the highest simulation rate of the POP 0.1 degree benchmark currently achieved. |

Author's name: |
Richard Gerber |
Email Address: |
ragerber@lbl.gov |
Institution: |
NERSC/Lawrence Berkeley National Lab |
Other Authors' Names, Email Addresses, and Institutions: |
Farid Parpia, parpia@us.ibm.com - IBM
Stephen R. Behling, sbehling@us.ibm.com - IBM
|
Title: |
|
Experiences Configuring, Debugging, Validating and Running on NERSC's New 122-node POWER 5 p575 System |
PDF |
Abstract: |
NERSC's 122-node IBM p575 POWER 5 system, Bassi, was installed in the fall of 2005 and went into production service in January 2006. I will describe how NERSC and IBM used a suite of benchmarks and application codes to debug, configure, and validate the system.
Because NERSC's 2,000 users come from many fields of science and use a wide variety of computational algorithms and codes, the choice of a default Bassi configuration was not obvious. There are a plethora of configuration options, many of which significantly affect the performance of any given code. I will discuss the default settings chosen by NERSC, the rationale for choosing them, and the problems we've thus encountered.
I will also share some user experiences using Bassi and transitioning from NERSC's IBM SP POWER 3+ system. |

Author's name: |
Siddhartha Ghosh |
Email Address: |
sghosh@ucar.edu |
Institution: |
NCAR |
Other Authors' Names, Email Addresses, and Institutions: |
Irfan Elahi, irfan@ucar.edu - NCAR
Jim Edwards, jedwards@ucar.edu - IBM
Wei Huang, huangwei@ucar.edu - NCAR
Juliana Rew, juliana@ucar.edu - NCAR |
Title: |
|
AIX 5.3 Experiences at NCAR |
PDF |
Abstract: |
The Scientific Computing Division (SCD) at the National Center for Atmospheric Research (NCAR) recently upgraded its Power 5 cluster, Bluevista, to AIX 5.3. A feature of this OS level is the enabling of Symmetric Multithreading (SMT), which enables two tasks or threads to access the resources of one physical processor. Another feature is dynamic adjustment of page sizes. The XLF compiler was also upgraded to 10.1.
Benchmarks were run on a variety of real applications before and after the upgrade and numerical accuracy and performance compared. Applications included the Community Climate System Model (CCSM), the Community Atmosphere Model (CAM), the Weather Research and Forecasting Model (WRF), the Parallel Ocean Program (POP), and the 3D HD/MHD/Hall-MHD turbulence model (HD3D). SMT was then enabled on the system. IBM provided a script to bind tasks to processors for pure MPI applications and a task binding library for hybrid OpenMP/MPI applications. NCAR's batch job scheduling software (Load Sharing Facility) was modified to accommodate a higher number of tasks to be scheduled per node. Performance results from SMT-enabled runs were compared. Finally, the performance of the model runs was checked by creating a larger Technical Large Page (TLP) pool and allowing applications to request large pages from that pool.
Relative benefit from SMT, XLF 10.1, and/or large pages will be discussed for our applications. The ability of the algorithm used in the task binding library to take advantage of SMT will be evaluated. |

Author's name: |
Siddhartha Ghosh |
Email Address: |
sghosh@ucar.edu |
Institution: |
NCAR |
Other Authors' Names, Email Addresses, and Institutions: |
Rich Loft, loft@ucar.edu - NCAR
Yu-heng Tseng, yhtseng@lbl.gov - LBNL
Chris Ding, chqding@lbl.gov - LBNL
Michael Wehner, mfwehner@lbl.gov - LBNL |
Title: |
|
Computational and I/O performance study of FV CAM in Bluegene/L and pwr5 systems |
PDF |
Abstract: |
Community Atmospheric Model (CAM) is a large-scale community climate simulation code developed at NCAR in collaboration with many other Institutions and individuals in the US. It supports many different dynamical cores of which Finite Volume (FV) core is envisioned to be suitable for running concurrently in large number of processors primarily due to it's support for two dimensional domain decomposition and lesser communication overhead over fully spectral Eulerian dynamical core. The Physics component is parallelized using OpenMP threads while the dynamical core uses Message Passing Interface to communicate between domains. The latest release of CAM version 3.1 has I/O bottleneck as it read/writes restart file and history variables in task-0 by gather/scatter-ing all the data. In an early work, we developed an efficient parallel I/O algorithm incorporating with the Parallel NetCDF library. The present work will demonstrate its I/O and computational performance characteristic s particularly in Bluegene/L compare those performance numbers with that obtained from few 8-way Power-5 nodes interconnected through two host-bus adapters to a Federation switch. We will also address the benefits obtained from enabling of SMT and how we learned to get best out of it. |

Author's name: |
Chris Gottbrath |
Email Address: |
Chris.Gottbrath@etnus.com |
Institution: |
Etnus, LLC |
Title: |
|
Memory Debugging on AIX and Power Linux with TotalView |
PDF |
Abstract: |
This talk will introduce the heap memory debugging capabilities of the Etnus TotalView debugger on IBM AIX and Power Linux. TotalView brings information about the status of the heap memory into the interactive debugging session. This gives the user a completely new way to understand and ultimately solve their heap problems; in the context of their running application -- without waiting for their program to exit. TotalView can help locate and eliminate memory leaks, array bounds problems, references to dangling pointers, and errors such as calling free twice on the same allocation. |

Author's name: |
Alan Gray |
Email Address: |
alang@epcc.ed.ac.uk |
Institution: |
University of Edinburgh |
Title: |
|
Performance Benefits from Upgrading from Power4 to Power5 Technology |
PDF |
Abstract: |
HPCx, the flagship UK academic supercomputing facility, recently underwent an upgrade from IBM Power4 to Power5 technology. The current system features 96 IBM eServer 575 compute nodes: a total of 1536 processors (which is a similar size to the previous system). Performance before and after the upgrade will be compared with the presentation of results from a range of benchmarks, both synthetic and involving real applications representing typical use of the system. New to Power5 technology is Simultaneous Multithreading (SMT), which enables 2 tasks to access the resources of one physical processor simultaneously. This aims to utilise the resources more fully by reducing the number of cycles that the functional units remain idle. The motivation for SMT, and expectations for applications (taking into account the results of the above comparison) will be discussed. Application benchmark results using SMT on HPCx will be presented, and it will be seen that performance improvements are available for real applications in certain situations. Included also for comparison will be results from EPCC's Blue Gene/L system. |

Author's name: |
Brian Gunney |
Email Address: |
gunneyb@llnl.gov |
Institution: |
Lawrence Livermore National Lab |
Other Authors' Names, Email Addresses, and Institutions: |
David Hysom, hysom@llnl.gov - LLNL |
Title: |
|
Parallelizing the Communication-Intensive Point-Clustering Algorithm |
PDF |
Abstract: |
The clustering algorithm is widely used in structured adaptive mesh refinement (SAMR). It has been difficult to parallelize due to its use of many collective operations and virtually no floating-point operation. Its parallel performance is a critical road block to scaling SAMR applications on current computers at Lawrence Livermore National Lab. Several structural changes were made to improve its parallel performance. We made collective calls using groups of processors rather than globally. We used hand-coded peer-to-peer communications, which were faster than creating and using MPI communicators for each group. We found independent tasks, allowing the use of task-parallelism to switch out tasks waiting for communication, thus reducing the time waiting for messages. We allowed the tasks to be driven by the arrival of messages, removing the artificial ordering imposed by the SPMD approach. Finally, we distributed the task of coordinating the entire algorithm to relieve the bottleneck of the one coordinating process. On a 16K partition of BlueGene/L, we showed up to a 400 time increase in the speed of this communication intensive algorithm. |

Author's name: |
John Hague |
Email Address: |
ibj@ecmwf.int |
Institution: |
IBM |
Title: |
|
Performance of IFS on Power5+ |
PDF |
Abstract: |
This talk will describe the performance the ECMWF's Integrated Forecast System on their 1.9GHz Power5+ System (2 clusters of over 2000 processors each) - running the 10 day global forecast and the 4D-Var data assimilation. Particular reference will be made to features such as SMT, 64K pages, and RDMA. Results on other systems, such as the JS21 Blade and BlueGene will be mentioned. |

Author's name: |
Bernd Mohr |
Email Address: |
b.mohr@fz-juelich.de |
Institution: |
Research Centre Juelich, Germany |
Other Authors' Names, Email Addresses, and Institutions: |
Felix Wolf, f.wolf@fz-juelich.de - RCJ, Germany
Brian Wylie, b.wylie@fz-juelich.de - RCJ, Germany
Markus Geimer, m.geimer@fz-juelich.de - RCJ, Germany |
Title: |
|
Analyzing the Performance of Parallel Applications on IBM Systems with KOJAK |
PDF |
Abstract: |
The KOJAK performance measurement and analysis environment, developed in collaboration between Research Centre Juelich and the University of Tennessee, facilitates insight into execution inefficiencies of parallel applications executing on a range of widely-used shared and distributed memory computer systems. Automated instrumentation of user applications is complemented with automatic analysis of performance problems arising from inefficient usage of parallel programming interfaces (such as MPI, OpenMP, and SHMEM). Performance problems classified by type and quantified by severity can be thoroughly investigated using an interactive browser (CUBE) which presents an integrated, hierarchical view of performance behaviour, call paths and threads of execution. Additionally, execution traces can be exported for visualisation and further analysis with third-party tools such as VAMPIR and Paraver.
Recent KOJAK extensions improve the support for IBM parallel systems considerably:
* In conjunction with the Paraver group of Barcelona Supercomputing Center, KOJAK was ported to the Mare Nostrum system (and thereby to PowerPC-based clusters in general). This not only makes automatic performance analysis available for this system for the first time, but also analysis and visualization with Vampir through VTF3 and OTF converters provided by KOJAK.
* Also, a trace file converter has been implemented to translate KOJAK's EPILOG format to the one used by Paraver. All aspects are translated: the machine and application resource model, all event, communication, and state records, as well as hardware counter information. This makes it possible to use the flexible analysis and visualization features provided by Paraver on traces measured with KOJAK.
* Work has begun on a redesign and reimplementation of KOJAK to better support massively parallel systems and applications. Besides profile-guided intelligent instrumentation and selective tracing to reduce the per-processor trace size, we parallelized our automatic trace analysis component EXPERT. We report on early experiences using KOJAK to trace real-world applications executing on our BlueGene/L system. |

Author's name: |
Bernd Mohr |
Email Address: |
b.mohr@fz-juelich.de |
Institution: |
Research Centre Juelich, Germany |
Other Authors' Names, Email Addresses, and Institutions: |
Klaus Wolkersdorfer, k.wolkersdorfer@fz-juelich.de - RCJ, Germany
Jutta Docter, j.docter@fz-juelich.de - RCJ, Germany |
Title: |
|
Early Experiences with Juelich's 8-Rack BlueGene/L System |
PDF |
Abstract: |
In January 2006, the BlueGene/L system of the John von Neumann Institute for Computing (NIC) of the Research Center Juelich was upgraded from a one rack system to a eight rack system in just 2 weeks. Currently, it is one of the most powerful systems in Europe (faster than Mare Nostrum and even faster than the Earth Simulator). In this talk, we report on our experiences installing, maintaining, running, and using a 16384 processor non-product system. The system is operated in conjunction with a 41-node Power4+ 32way SMP cluster integrated through a shared GPFS filesystem. |

Author's name: |
Dmitry Pekurovsky |
Email Address: |
dmitry@sdsc.edu |
Institution: |
San Diego Supercomputer Center/UCSD |
Other Authors' Names, Email Addresses, and Institutions: |
P.K. Yeung, yeung@peach.ae.gatech.edu - Department of Aerospace Engineering, Georgia Institute of Technology
D. Donzis, donzis@peach.ae.gatech.edu - Department of Aerospace Engineering, Georgia Institute of Technology
S. Kumar, sameerk@us.ibm.com - IBM T.J.Watson Lab
W. Pfeiffer, pfeiffer@sdsc.edu - San Diego Supercomputer Center/UCSD
G. Chukkapalli, giri.chukkapalli@sun.com - San Diego Supercomputer Center |
Title: |
|
Scalability of a pseudospectral DNS turbulence code with 2D domain decomposition on Power4+/Federation and Blue Gene systems |
PDF |
Abstract: |
The subject of this work is a DNS turbulence code that is used to address a number of research questions in turbulence and turbulent mixing. It uses spectral representation in spatial dimensions and second order finite difference in time. The major part of the computation involves 3D Fourier Transforms. The code uses ESSL for 1D fast fourier transforms. The MPI all-to-all exchange routine is used to transpose the arrays and is responsible for most of the communication time. 2D decomposition allows to use processor counts beyond the linear grid size, which in turn provides enough power to study grids of previously infeasible sizes.
Performance was investigated using up to 1024 IBM Power4+ processors connected by Federation switch at San Diego Supercomputer Center as well as up to 32K processors of Blue Gene W at the IBM T.J.Watson lab. Measurements show good scalability (both strong and weak) all throughout the studied processor count range. Observed performance is consistent with a simple communication model. |

Author's name: |
Wayne Pfeiffer |
Email Address: |
pfeiffer@sdsc.edu |
Institution: |
San Diego Supercomputer Center |
Title: |
|
Evaluation of Blue Gene Performance and Applicability |
PDF |
Abstract: |
The San Diego Supercomputer Center has two large IBM supercomputers of differing architectures: a single-rack Blue Gene and a cluster of p655 and p690 nodes called DataStar. Both have more than 2,000 compute processors, support large GPFS configurations, and are heavily used.
In this paper, the performance of Blue Gene is compared to that of DataStar for both synthetic benchmarks and representative applications. In addition, guidance is provided as to which types of applications run well on Blue Gene and which ones do not.
Because Blue Gene processors are slower than those on DataStar, applications must scale to large processor counts to get absolute performance that is attractive. Thus Blue Gene has proved valuable to only a modest number of SDSC users, but some high-profile ones. Pluses for those users are that:
+ Their applications run relatively fast and scale well;
+ Turnaround is good with only a few users.
An important plus for the SDSC systems staff is that:
+ The hardware is reliable and easy to maintain.
Minuses are that:
- Some applications run relatively slowly or do not scale well enough to take advantage of the large number of processors;
- Some typical problems need to run in coprocessor mode (i.e., using only one processor per node) to fit in the available memory;
- Other typical problems will not fit at all. |

Author's name: |
Kurt Pinnow |
Email Address: |
kwp@us.ibm.com |
Institution: |
IBM |
Other Authors' Names, Email Addresses, and Institutions: |
Roy Musselman, Amanda Peters, Brent Swartz |
Title: |
|
Blue Gene Application Performance Optimization |
PDF |
Abstract: |
Blue Gene has reached the apex of the 26th Top 500 list and won many prizes for the high level performance it is able to deliver. These achievements have been accomplished by Blue Gene's team's persistent focus on three key performance essentials. The first of these is the design and capability of the Blue Gene system itself. The second aspect is the effort taken to highly optimize certain instructions sequences to get them to perform well in a single processor environment. The third is the design of efficient parallel algorithms that deliver the utmost in efficiency and enable the parallel implementation deliver stellar performance in aggregate. This presentation with focus on these three aspects and show how they have been used together to deliver award winning performance.
This presentation will focus first of Blue Gene's overall design and show how this design leads to scalable performance in driving I/O, and in enabling MPI, and in floating point computation. The presentation will provide details on performance latencies and throughput for key interfaces. Next general BG/L supercomputer application optimization tips and techniques will be covered. This includes the compiler options and directives required to obtain improved performance on BG/L. Also covered are the code and algorithm changes required to utilize the two PowerPC 440 floating point units on each BG/L processor, using Single-Instruction-Multiple-Data (SIMD) instructions. Finally the presentation will treat the third aspect - good parallel design. Here the presentation will treat the well-known Blast algorithm (used to rank genomic alignment sequences) by showing how the sequence alignment problem can be distributed across many nodes. As a result, BG/L was shown to perform at least 2 million BLAST searches per day against a database of 2.5 million protein sequences. |

Author's name: |
Kurt Pinnow |
Email Address: |
kwp@us.ibm.com |
Institution: |
IBM |
Other Authors' Names, Email Addresses, and Institutions: |
Roy Musselman, Dave Hermsmeier |
Title: |
|
Blue Gene External Performance Monitoring |
PDF |
Abstract: |
This presentation will provide indepth view of Blue Gene's External Performance Monitor. The External Monitor is capable of pulling various types of performance oriented data from running Blue Gene applications and making this data available to external systems for viewing and/or processing. This presentation will give a live demo of Blue Gene's External Performance Monitoring capabilities highlighting its ability to pull data from applications running on Blue Gene's core.
In the demo, the External Performance Monitor will be started for an application running on Blue Gene and various performance data will be collected for the running application and sent in real time to an external system. The presentation will focus on the types of performance data that can be collected and the options that can be taken on Blue Gene to filter the information as it flows from the system. The demo will show how the information can be put to a summary display, stored in an external data base, and/or passed to a visualizer application designed to process the Blue Gene data stream.
Blue Gene's External monitoring capability will be discussed alongside the capability provided by other Blue Gene performance collection mechanisms and recommendations made on when one tool should be used in place of another. |

Author's name: |
David Skinner |
Email Address: |
dskinner@nersc.gov |
Institution: |
NERSC |
Other Authors' Names, Email Addresses, and Institutions: |
Richard Gerber, ragerber@nersc.gov - NERSC
Nick Wright, nwright@sdsc.edu - San Diego Supercomputer Center/UCSD |
Title: |
|
Integrated Performance Monitoring on POWER3,4&5 |
PDF |
Abstract: |
We will present recent work on performance profiling DOE applications in a production environment. This presentation will address both performance data collection as well as various web and XML technologies that aggregate, digest, and format this data for use by both the users and managers of HPC centers.
The creation of a portable, scalable, and low overhead profiling infrastructure for HPC applications is driven by the need to both optimize HPC applications on specific architectures and to characterize the performance of applications across architectures in order to make optimal matches of workload and HPC resources. IPM (Integrated Performance Monitoring) has been deployed on several IBM SP's and we will present some high level summaries of several years worth of performance profiles. |

Author's name: |
Tom Spelce |
Email Address: |
spelce1@llnl.gov |
Institution: |
Lawrence Livermore National Laboratory |
Title: |
|
Early performance results for the LLNL Purple system |
PDF |
Abstract: |
The Purple system at LLNL is a very large cluster based on IBM Power 5 nodes and a multistage Federation switch. The architecture of the system will be described as well as performance results from a wide variety of applications. In addition to providing compute and I/O performance results, current best practices and areas for improvement will also be discussed. |

Author's name: |
Kevin Stratford |
Email Address: |
kevin@epcc.ed.ac.uk |
Institution: |
Edinburgh Parallel Computing Centre
University of Edinburgh, UK |
Title: |
|
Lattice Boltzmann for Complex Fluids on Blue Gene |
PDF |
Abstract: |
The lattice Boltzmann (LB) equation offers a number of attractive features for the study of complex fluids --- that is, multi component and/or multiphase mixtures --- which have many technologically important uses. Such complex fluids include suspensions, emulsions, gels, and liquid crystals. Computer simulation can provide an important tool with which to investigate the properties of not only existing materials, but also guide the design of new ones.
In this presentation, I will provide an overview of the LB method and then focus on its use to study the problem of colloidal suspensions. These are fluid systems which contain freely moving solid particles (colloids) typically a few nanometres to a few microns in diameter. Hydrodynamic forces between the particles mediated by the fluid can play an important role in these systems, so any numerical method used to study them should include them faithfully. LB is particularly suited to such complex, changing geometries and interactions, problems which can be particularly demanding in other methods.
For fluid on its own, the LB method is highly parallel: an entirely local pressure calculation means that only nearest neighbour communication between the lattice sites is required. However, the inclusion of solid particles leads to some complication. First, to be able to scale to large numbers of processors, particle data cannot be replicated: it must be distributed. Parallelisation for the particles can then borrow ideas familiar from molecular dynamics, such as cell lists and so on. Importantly, long-range hydrodynamic interactions between particles are mediated by the LB fluid, and for many problems direct particle-particle interactions can be restricted to those which are short range. Again, parallelisation can take place by communicating particle information between nearest neighbour domains only.
The performance of the method for a number of different benchmarks involving particle suspensions has been investigated. I will compare performance results from IBM p690+ and BlueGene systems, and also report on scaling on up to 8 racks of the BlueGene machine at Thomas J. Watson Research Center. Remarkably, for particle suspensions, the computational effort remains approximately constant as the number of particles in a given system is increased. LB thus allows one to study these more complex systems in some sense "for free". It also scales extremely well to the largest number of processors. |

Author's name: |
George Walter VandenBerghe |
Email Address: |
George.Vandenberghe@noaa.gov |
Institution: |
NOAA National Center for Environmental Prediction (NCEP) |
Title: |
|
Continuing Experience with two generations of IBM P6XX clusters at NOAA/NCEP |
PDF |
Abstract: |
Numerical Weather prediction is one of the grand challenge problems addressed by HPC. The U.S. NOAA National Center for Environmental prediction (NCEP) has used large IBM computing clusters to integrate and support weather forecast models since the late 1990s. A previous presentation (VandenBerghe SCICOMP 6) discussed early favorable experience with these clusters. Since then new capabilities such as LoadLeveler preemption along with some new pathologies such as very rapid problem working set growth have appeared. Data management problems have also increased and become a significant fraction of total system administration issues. This talk will discuss further experiences with two additional generations of IBM cluster solutions since 2Q 2002 at a very large computing center with a well defined set of problems. In summary, these clusters continue to support NCEP's computing needs efficiently but there are some residual problems and also emerging trends that could become prominent problems if not addressed. |

Author's name: |
Mariana Vertenstein |
Email Address: |
mvertens@ucar.edu |
Institution: |
NCAR |
Other Authors' Names, Email Addresses, and Institutions: |
Jonathan Wolfe, jwolfe@ucar.edu - NCAR |
Title: |
|
Community Climate System Model (CCSM) Efficiency on IBM HPC platforms |
PDF |
Abstract: |
An overview will be given of efficiency considerations specific to running the CCSM MPMD system on IBM HPC platforms. Comparisons will also be made with other large climate models. |

Author's name: |
MuQun Yang |
Email Address: |
ymuqun@ncsa.uiuc.edu |
Institution: |
National Center for Supercomputing Applications |
Other Authors' Names, Email Addresses, and Institutions: |
Albert Cheng, acheng@ncsa.uiuc.edu - NCSA
Quincey Koziol, koziol@ncsa.uiuc.edu - NCSA
Christian Chilan, chilan@ncsa.uiuc.edu - NCSA |
Title: |
|
Parallel IO Supports and Performance Study with HDF5: A Scientific Data Package |
PDF |
Abstract: |
The amount and complexity of the data used in scientific applications demand a portable standard for flexible efficient access across high performance computing platforms. HDF5 is a portable file format and library developed at the National Center for Supercomputing Applications (NCSA), for storing, retrieving, analyzing, visualizing and converting data. It provides parallel IO supports through Message Passing Interface Input and Output(MPI-IO). Parallel netCDF(PnetCDF) is a parallel version of NetCDF developed by Argonne National Lab and Northwestern University. It also provides parallel IO supports through MPI-IO. Collective IO is an option to support efficient IO for non-contiguous subsetting of arrays inside MPI-IO. Recently NCSA HDF group has used flash IO benchmark, which simulates the IO pattern of the astrophysical thermonuclear flashes, to compare the IO performance among the parallel HDF5 in independent mode, parallel HDF5 in collective mode and parallel NetCDF in collective mode. We will present performance comparison results to illustrate the effectiveness of using collective IO inside HDF5 at UCAR IBM power4 SP cluster and LLNL IBM power 5 cluster. These results also show the robustness of MPI-IO software and GPFS implemented by IBM.
HDF5 allows the array data to be stored on disk with chunks to improve IO performance with subsetting of large arrays and support extensible dimensions and in-memory data compression are widely used in parallel scientific applications such as WRF. However, the shape of the data selection inside HDF5 can be irregular with chunked data storage. In the upcoming 1.8 release of HDF5, besides relying on MPI-IO, we provide several implementation options and controls to assure the performance. We also provide optional Application Programming Interfaces for users to participate decision-making process to gain better performance. We will present these efforts in the workshop. In the process of testing HDF5 collective IO features, we frequently find bugs in several MPI-IO packages. Most bugs have been fixed in newer versions of these MPI-IO packages. However, many HDF5 applications still rely on computing architectures that still use an older version of an MPI-IO package. We will share our software maintenance experience with MPI-IO in the presentation. |

|