SCD ASR header SCD ASR header

Training, consulting, and documentation

Technical consulting services

Providing software engineering and math libraries support for scientists using NCAR/SCD's high-performance scientific computing facilities is the core business and mission of SCD's Technical Consulting Group (TCG). This group is the primary point of contact for customers with questions and concerns about their scientific computing efforts. TCG provides a single point of contact for SCD's community of researchers who need to resolve technical problems and obtain advice on optimal software design and implementation techniques. TCG leverages their contact with researchers by channeling customer needs into SCD's planning process. When the assistance of other SCD staff is required to resolve a problem, TCG coordinates all SCD efforts and manages the follow-through with customers. Collaborations with other SCD groups, researchers, vendors, and other high-performance computing centers are central to maintaining the expertise required to support this mission.

In FY2000, TCG's key thrusts in pursuit of their mission have focused on three key elements: researcher support, training, and documentation. All three of these activities in FY2000 have been specifically focused on the code conversion efforts surrounding the new IBM SP. While the actual installation of the machine was a limited-term project lasting only a few months, the subsequent conversion of millions of lines of user source code is expected to proceed over the course of two to three years with TCG playing the key role of helping our research community migrate science to the new machines. Training, documentation, and hands-on assistance are critical to the success of this effort.

Training classes and seminars sponsored by the SCD Technical Consulting Group

Developing staff and customer expertise on these new computer architectures has been a key concern, and training has played a key role in addressing it. Over the last year, TCG has arranged eight classes on using the new IBM SP and Compaq ES40 clusters. The material covered in the classes has included introductions to the IBM SP and Compaq ES40 clusters, introductions to MPI and OpenMP programming, and memory-cache optimization for RISC microprocessor-based systems. While most of the classes were delivered at NCAR, they were open to all local and remote SCD customers. Scientists from local institutions attended as well as staff from more geographically distant sites. One class was delivered at COLA outside Washington, D.C. While additional remote classes are being investigated, it appears that many remote university researchers have local resources for parallel programming assistance through their computer science departments or other collaborators, but rely heavily on SCD web documentation and examples for information of how to use and write scripts for the resource scheduling software running on each machine.

Code migration for the IBM SP complex and Compaq ES40 cluster

 

The SMP machines

Over the last year, SCD has focused much effort in the area of symmetric multiprocessor (SMP) clusters. Three such machines have been installed: two IBM SP clusters (blackforest and babyblue) and one Compaq ES40 cluster (prospect). While the individual nodes of such machines are independent SMP computers, a layer of system software allows multiple nodes of the cluster to behave as a scheduled resource that can be applied to a single job. SCD is currently using three different implementations of the system software layer, one from IBM called LoadLeveler(tm), one created at NASA Ames called PBS (Portable Batch Scheduler), and one written by Compaq called RMS (Resource Management System).

Since these clusters are, fundamentally, groupings of independent SMP compute nodes tied together with resource scheduling software, performance and utilization rely heavily on the application programmer's ability to extract and express parallelism efficiently. The nodes are independent computers, so the implementation of parallelism is primarily accomplished through the use of industry-standard message-passing libraries such as MPI. However, for codes with more modest computational requirements that can be satisfied by a single node, a more efficient shared-memory programming model is usable through the addition of OpenMP directives to the original serial source code. Finally, among the most computationally intensive applications, hybrid application codes have been developed that bind large numbers of SMP nodes together using the MPI message-passing libraries, but use the more efficient OpenMP shared-memory directives to maximally utilize all of the processors within the SMP node.

The five-point migration plan

TCG has helped researchers migrate their codes to these new machines by targeting five distinct focus areas for migration: community codes, large project codes, strategic codes, and researcher initiatives. The fifth focus area is to undertake any efforts that are both simple to complete and produce significant code migration results. These five focus areas have been targeted to leverage our ability to maximize the number of researchers who can take advantage of the new machines in the minimum amount of time. This approach also has the benefit of quickly building up a large number of satisfied customers who can enhance community interest in the new machines. In particular, large project codes and community codes such as CCM, CCSM, POP, PCM, Mozart, TIGCM, MHD3D, MM5, and WRF account for a significant number of the computational cycles delivered by SCD, and helping our scientific community make these codes available to other researchers has been TCG's priority. Other codes, such as SEAM (which is used by a relatively small number of people), are viewed as strategic keys to future model development and thus are also considered high-priority. No distinction is made between NCAR and University researchers when TCG allocates resources for assistance.

The five stages of an application port

The migration efforts over the last year have focused on rewriting the applications for these new machines. This effort is approached in five stages. First, the control scripts for the application must be modified to function with the resource scheduling software in use on the machine in question. Each of the resource schedulers implements its own macro-language and a different set of capabilities, so these control scripts are specific to each combination of resource scheduler and machine type. Second, application source code compilation and correctness issues are addressed. While the Fortran and C standards are specific about intended use and constructs, the standards leave many grey areas as being "implementation dependent;" additionally, many programmers tend to apply programming idioms that may work with one vendor's compiler, but are not sufficiently generic to be portable to compilers from other vendors. Some of these errors can be caught and fixed at compile and link time, but others don't present any problem symptoms until the application is run. Much of the effort involved in porting a code to a new machine is deeply tied to addressing issues at this stage. Third, there may be data issues to be addressed. Specifically, some of the applications must use data which came from a machine with a different data format or compare their results against such data. These issues are quite common. For example, the Compaq Alpha processors and Intel processors order the floating-point data bytes differently than the IBM or SGI processors, even though all use the same IEEE standard format; furthermore, the Cray does not use the IEEE floating-point format. Fourth, we address optimization issues. Because the SMP nodes on these machines are built using microprocessor technology, the performance of the application is typically limited by the performance of the processor-to-memory interface. Optimization of these codes is focused on redesigning the applications' algorithms to make most efficient use of the memory hierarchy. Fifth and last, the issues of parallelism can be addressed. In all three parallel programming paradigms (MPI, OpenMP, and Hybrid MPI/OpenMP), the key issue in developing parallel performance is to coarsen the granularity of parallelism while maintaining a balance between the work performed by each processor. While this can be designed into a parallel model from the beginning -- in which case the code parallelizes easily across a wide variety of machines -- with legacy serial codes, this must be approached in an iterative fashion to maintain the correctness of the application.

Production computing support

Supporting the production computing environment per TCG's core mission statement has continued to be critical to the success of the Scientific Computing Division. Historically, SCD's best asset in the eyes of the research community has been customer support. TCG maintains the highest standards in customer responsiveness, as well as diligence in system test and checkout to guarantee a stable and productive work environment for our customers. TCG recognizes the need to expand the quality of customer outreach, collaboration, and individualized service.

In addition to AIX operating system support on the IBM systems and Compaq's Tru64 UNIX, SCD provides support for platforms running Irix, UNICOS, and Solaris. TCG's core focus this year has been assisting researchers in their migration efforts to the new architectures with an eye toward identifying re-engineering opportunities to improve the portability and scalability of the codes. User interest in these systems has been overwhelming and continues to grow.

TCG is also responsible for testing the user environment in cooperation with HPS to ensure that operating system and programming environment software upgrades have minimal impact on productivity. To this end, TCG was involved in every operating system and compiler environment software installation. As problems were uncovered, TCG developed strategies and diagnostic code for testing and isolating each problem. TCG has been increasingly taking the lead in characterizing problems for the vendors and following the bug fixes through the pipeline.

Documentation provided by the Technical Consulting Group

Developing web content has been a complementary project to TCG's code migration consulting and the training classes. Over the last year, TCG worked closely with SCD's Digital Information Group (DIG) to develop a significant body of web-based documentation and examples covering the full range of introductory to advanced material and FAQs for users of our new machines. The content has been driven from several different directions, including instructional material, customer questions, problems encountered by TCG, frequently asked questions, and of course structured assistance to new customers. The web sites supplement and take advantage of vendor documentation where appropriate, and, as the scientific divisions reach readiness, will include links to their material as well. As with the training and consulting efforts, the documentation effort will continue to feed the growing base of documentation over the next year.

Next

SCD ASR - Table of contents

Message from SCD Director Al Kellie

SCD's FY2000 science highlights

SCD: Providing support for large and small scientific research projects, no matter where they are located

SCD: A center for supercomputing resources and technologies

SCD: A center for data resources, data analysis, and emerging technologies

SCD research: Pushing the frontiers in high-performance computing for geosciences

SCD: Providing supercomputing and communications facilities and infrastructure

SCD community service activities

SCD educational activities

SCD publications and papers

SCD staff

SCD visitors and collaborators