IBM SP-cluster systems fundamentals
last update:
9/21/2009
This "critical concepts" section of the IBM SP-cluster systems fundamentals document provides the basic information new users need before they can start writing or modifying programs to run on IBM SP-cluster systems. It then helps you get started running interactive or batch jobs.
When you read "critical concepts" for the first time, do not follow any links. Instead, read "critical concepts" once to familiarize yourself with these concepts, then proceed to the next section, "How to compile, build, and run jobs," and study the examples in detail. As you study the examples, you will see lines of code that are explained by the concepts in this section. After you finish studying all of the examples, then return to this "critical concepts" section and follow the links to the details you need.
We recommend this approach because it is easy to become confused by the details of computing with this architecture. Your ability to begin productive work will be greatly enhanced by first familiarizing yourself with the critical concepts, then studying all of the examples, then studying the details relevant to your programming needs.
Then you can begin preparing your own programs to run on IBM SP-cluster systems. This is the most effective way for you to begin computing on IBM SP-cluster systems.
IBM SP-cluster systems are a collection of processors organized into nodes. Each node contains multiple processors. A processor (called a "CPU" in this document) is the logic circuitry that responds to instructions for controlling the computer. A node is a collection of CPUs that share access to memory (memory space); in general, a node is an entity that accesses a network or an addressable point on a network. IBM SP-cluster systems contain an internal network with hundreds of addressable nodes. This internal network is also called "the switch."
IBM SP-cluster systems can run user programs in a serial process, in parallel processes, and in both.
A process is an instance of a program running in a computer. The system kernel schedules execution of all processes (for example: store information in memory, perform operations on data using a CPU, store data on disk, communicate with other systems, etc.).
A serial process is a program that executes instructions on a single CPU.
A thread is a piece of a process. It runs as a separate entity under the control of that single process, is tracked by that process, and returns its computational result to that process. Threads help your job run faster because several independent pieces of the same process run at once. Threads share the same memory space, so you must make sure that threads in the same process do not interfere with each other.
Threads are run on multiple CPUs within a node using OpenMP directives or POSIX threads. OpenMP directives are used by most NCAR programmers who use threads. Note that "OpenMP" is sometimes called "OMP." More information about the OpenMP standard appears in this tutorial. More information about the POSIX standard appears in the tutorial and the Programming POSIX Threads website.
Note: Threads are a form of parallelism, and people may use the word "parallel" when referring to processes that use threads. This can cause confusion. CISL documentation always refers to threads as "threads" to avoid confusion with "parallel processes" as defined in the next paragraph.
Parallel processes are multiple coordinated independent programs that execute simultaneously on multiple CPUs to achieve a common goal. Parallel processes are controlled by the Parallel Operating Environment (POE) program provided on IBM SP-cluster systems. Parallel programmming has three aspects:
- Using parallel threads on a node.
- Using message passing between processes. Message passing is a form of interprocess communication in which processes send discrete messages to one another to exchange data.
- Parallel programs are called "hybrid" when they use both threads and message passing.
IBM SP-cluster systems are clusters of Symmetric Multi Processor (SMP) systems, a computer architecture that collects multiple CPUs into nodes. Multiple simultaneous processes can be run within a node, on multiple nodes, or both. The CPUs on a node share that node's memory and I/O bus (data path). Each node runs its own copy of the AIX operating system. Any idle CPU can be assigned to any task, and additional CPUs and nodes can be utilized by a job to improve performance and handle increased loads.
There are four basic programming strategies for computing on IBM SP-cluster systems. These four strategies allow you to match your program's requirements to the capabilities of IBM SP-cluster systems' computing architecture. All four strategies have the same goal: to obtain accurate results for computational problems in the minimum amount of wallclock time.
The four programming strategies, in order of increasing demands on the system and the programmer, are:
- One serial process
- One process that spawns multiple threads
- Multiple parallel processes that are single-threaded
One code acting on multiple data structures (Single Program Multiple Data -- SPMD)
Multiple codes acting on multiple data structures (Multiple Programs Multiple Data -- MPMD)- Hybrid: Multiple parallel processes (SPMD or MPMD) that use multiple threads
Table of contents - Bluefire Quick Start Guide
If you have questions about this document, please contact CISL Customer Support. You can also reach us by telephone 24 hours a day, seven days a week at 303-497-1200. Additional contact methods: consult1@ucar.edu and during business hours in NCAR Mesa Lab Suite 42.
© Copyright 2002-2009. University Corporation for Atmospheric Research (UCAR). All Rights Reserved.
Address of this page: http://www.cisl.ucar.edu/docs/bluefire/program.html