Last update: 05/14/2008
Computer users typically know a lot about exploiting the system resources of their own computer to compile, link, and execute programs. But unless they have experience on multiprocessor cluster systems, they will need to learn new techniques before they can efficiently and economically run parallel programs on these systems.
To exploit the power of cluster computers, parallel programs must direct multiple processors (CPUs) to simultaneously solve different parts of a computation. To be efficient, a parallel program must be designed for a specific system architecture. Users must tailor their programs to run on systems that differ in the number of CPUs connected by shared memory, the number of memory cache levels, how those caches are distributed between CPUs, and the characteristics of the communication mechanism for message passing.
Cluster system architecture begins with groups of CPUs (typically 4 to 32) that are organized into "nodes" where the CPUs on the node communicate via shared memory. Nodes are interconnected with a communication fabric that is organized as a network. Cluster systems have hundreds of nodes. Parallel programs use groups of CPUs on one or more nodes.
You also need to understand how to use each computer's system software and services that help you run your code on that platform. Your ability to work productively on these complex computing platforms is greatly enhanced by system-specific services such as compilers that offer multiple levels of optimization for your code, batch job schedulers, system resource managers for parallel jobs, and optimized libraries.
This brief discussion has introduced three basic aspects of parallel computing:
- System hardware architecture
- User code tailored to specific computer hardware
- System software and services that enable users to modify their code for a specific platform
To become a proficient user of these complex systems, you need to understand all three aspects of parallel computing and how they relate to each other.
There are two reasons why it is important for you to become a proficient user of these systems and make fair use of their resources. First, you preserve the computer time that you received with your GAU allocation. Second, when your code runs efficiently, your job turnaround time (wall-clock time) is reduced, and your job has less exposure to execution delays on these heavily subscribed computers. Reducing your job's wallclock time makes more of your time available for other activities. And better use of the system benefits the entire user community.
For complex parallel programs, modern cluster computers require two basic things from programmers. First, programmers need to design their codes for the system architecture. Then at runtime, programmers must control the software agent that manages task scheduling and system resources. You control this task scheduling and system resource configuration by setting environment parameters at runtime.
A parallel job consists of multiple streams of program instructions executing simultaneously. Another name for an instruction stream is "thread." A process may have one or more threads. All threads associated with a process must run on the same node because they communicate using the main memory they share on that node (shared memory). Multiple processes can run on one node or multiple nodes. When multiple processes that are part of one application run on multiple nodes, they must communicate via a network (message passing). Since network speeds are inherently slower than memory fetches and disk accesses, programmers need to design their codes to minimize data transfers across the network.
Parallel jobs do not always involve multiple processes. A single process with multiple threads of execution is also considered a parallel job. Processes running threads can be combined with other processes (which may also be using threads). This is called a "hybrid" job.
Interprocess communication is often the main performance problem for jobs running on cluster systems. Library calls are available to help users perform efficient communication
· Between processes on a node and/or
· Between processes running on different nodesCluster systems organize memory into a hierarchy of levels of proximity to the CPUs. Typically there is a large slow main memory with one or more levels of very small but very fast cache memory between the CPU and the main memory.
Users must design their programs to use the data cached closest to each CPU as much as possible before time-consuming data movement away from the CPU is required. Programmers must devote considerable effort to managing the flow of data into and out of the hierarchical memories as their jobs are running.
You can maximize CPU performance and the speed at which your code runs by designing your programs to use the differently sized memory caches efficiently. You can also modify your code so the compiler and its optimization routines can improve it for a specific architecture.
The following sections describe the basic concepts that will help you understand these issues.
Parallel programming is required for high performance computing
The production systems operated by SCD are clusters of symmetric multiprocessor (SMP) computers (nodes). These systems can run single processes on a single CPU. However, high performance computing on a cluster typically requires many CPUs to work in parallel so compute-intensive programs can run to completion in a reasonable amount of wall-clock time.
If you are not familiar with parallel programming, we recommend that you study an excellent introductory book: Designing and Building Parallel Programs by Ian Foster. A link is provided at the end of this page, but please keep reading until you get there.
Parallel programming paradigms
Currently, parallel programming paradigms involve two issues:
- Efficient use of CPUs on one process
- Communication between nodes to support interdependent parallel processes running on different nodes and exchanging mutually dependent data
A parallel program usually consists of a set of processes share data with each other by communicating through shared memory of over a network interconnect fabric.
Parallel programs that direct multiple CPUs to communicate with each other via shared memory typically use the OpenMP interface. The independent operations running on multiple CPUs within a node are called threads.
Parallel programs that direct CPUs on different nodes to share data must use message passing over the network. These programs use the Message Passing Interface (MPI).
Finally, programs that use carefully coded hybrid processes can be capable of very high performance with very high efficiency. These hybrid programs use both OpenMP and MPI.
Two parallel programming paradigms: threads and message passing
As stated above, there are currently two ways to achieve parallelism in computing. One is to use multiple CPUs on a node to execute parts of a process. For example, you can divide a loop into four smaller loops and run them simultaneously on separate CPUs.
This is called threading; each CPU processes a thread.The other paradigm is to divide a computation into multiple processes. This causes each of the processes to depend on the same data. This interdependence requires processes to pass messages to each other over a communication medium.
When processes on different nodes exchange data with each other, it is called message passing.
Programmers must explicitly define parallel threads. Programmers must also direct the process intercommunication for message passing. They do both of these by determining where and how to insert the system-supplied library function calls in their code.
For threading or message passing to be effective, programmers must place these function calls in the code in logically appropriate locations. These locations must be determined not only by code function but also by considering CPU restrictions (such as CPU-cache coherence restrictions specific to the system architecture).
Parallel programs are called hybrid when they use both threads and message passing.
Both threading and message passing paradigms divide up the data domain (called domain decomposition). Threads use the OpenMP interface within a node, and message passing uses the MPI interface for multiple nodes. The programmer decomposes the data domain into parts that individual CPUs can manage. Then each CPU that processes its domain will need to communicate its computational results with its neighboring domains.
When you develop a new code for a cluster system, you can reduce your troubleshooting and debugging effort by starting with a single process that runs on a single CPU. Then time your runs as you adjust your code for increased efficiency. This will allow you to focus on basic performance issues. When it is running efficiently, you can divide loops into parts that can run simultaneously. When your multithreaded code runs properly, you can add processes and implement message passing. The intent of this approach is to deal with the inevitable difficulties step by step as the complexity of your code increases.
Hierarchical memory management and caching
Commodity component memories are comprised of main memory and one or more levels of memory caches. The caches are provided to minimize the time delay for data being moved from memory to the CPU. As data moves from main memory to a CPU, the access speed (time between the fetch instruction and the processing of the data) increases, but the capacity for holding the data decreases because each cache closer to the CPU is smaller than the previous cache.
Programmers must maximize the use of data in the level 1 cache (closest to the CPU) where it can processed most efficiently. The second (and often third) cache levels and the main memory hold increasingly more data but with increasingly more delay before it can be processed.
Efficient management of data flowing through memory caches into and out of the CPU can produce a significant improvement in code performance. Therefore, programmers need to devote serious time and consideration to the logic of moving the datastream through the caches to the CPUs and back.
All of the SCD supercomputer systems support the Fortran, C, and C++ computer languages. Since there are no industry-standard names for compiler commands or for their options, the precise format of the command used to compile programs in each of these languages will differ from computer to computer. Consult the software documentation* for the specific computer you will use to find the exact format of the command you need to compile your programs on the system where you intend to run.
*Note: Compiler documentation for the supercomputers currently in use at NCAR is included in these documents:
Your account on each computer is provided with default environment files (dotfiles) and shell scripts. These are often shown in the documentation for that system. These contain the correct paths for using that system's compiler.
The following sections discuss some of the advantages of the languages commonly used on SCD computers. However, you may not have a choice because the code you are using has already been written.
Fortran, multiple versions
Fortran was developed in 1954 by a team of IBM engineers led by John Backus. Fortran has been the most commonly used scientific computing language since its release. The vast majority of scientific codes at NCAR are written in descendents of the original Fortran language. The name "Fortran" is an acronym for "Formula Translator," a reference to the language's intended purpose of efficiently translating algebraic formulas to run on computers.
All of the supercomputer systems at NCAR support the Fortran 77 and Fortran 90 languages. Compared to earlier versions of the language, the Fortran 90 standard adds advanced data structuring facilities and dynamic memory management, as well as features designed to facilitate modular programming. We encourage you to use these capabilities in new code to make your programs more portable and more easily maintainable.
C
The C language was developed in 1973 by Brian Kernighan and Dennis Ritchie at Bell Telephone Laboratories. Each of the NCAR supercomputers runs a variant of the Unix operating system, which is written in C.
Since some operating system interface functions may only be available to C language routines on some of the systems, it is occasionally useful for scientific programs to call routines written in C to access various system-level functions. Some scientific codes developed on Unix-based systems may also be written entirely in C, although it typically offers slightly lower performance than Fortran for numerically intensive codes.
C offers a wider variety of standard data types and more advanced data structuring facilities than older (pre-Fortran 90) versions of the Fortran language. As a result, older Fortran programs may incorporate subprograms written in the C language to perform manipulation of complex data objects. The data structuring facilities in Fortran 90 have largely eliminated the need to do this. All of the supercomputers at NCAR support the ANSI 1989 C language standard.
C++
C++ is an "object-oriented" language developed from C in the early 1984 by Bjarne Stroustrup at Bell Telephone Laboratories. A discussion of the differences between the "object-oriented" programming philosophy embodied in C++ and the "procedure-oriented" programming philosophy exemplified by Fortran and C is beyond the scope of this introduction.
In general, the programming style encouraged by C++ makes efficient implementation of complex numeric codes somewhat difficult. It does, however, encourage the development of extremely modular and easily maintainable programs, and it can simplify the implementation of complex user interface code and some graphic manipulation codes. For this reason, the language is increasingly found in commercial applications, particularly when the cost and effort involved in maintaining those applications is a more important consideration than attaining the highest possible performance.
Designing and Building Parallel Programs by Ian Foster is an excellent web-based introduction to parallel programming.
Overview of computing at NCAR - Table of contents
If you have questions about this document, please contact us via any of the methods (phone, email, ticket, or in person) described here: CISL Customer Support.
© Copyright 2003-2008. University Corporation for Atmospheric Research (UCAR). All Rights Reserved.
Address of this page: http://www.scd.ucar.edu/docs/access/prog.html