|

Juli Rew
|
by Juli Rew
If you are parallel programming on the IBM SP, a number of issues can
arise pertaining to the use of multiple CPUs and multiple nodes. One
of these is sharing: you may need to use multiple nodes that
communicate across a network. Whether and what you choose to share
with other users may affect the performance of your job.
If you are using OpenMP, you may have been advised that it is best not
to share a node. If you're programming with MPI, you've probably been
advised that you want to "share the network." What does sharing mean
in these contexts? Are these guidelines always true? How do you
specify them in your jobs on the SP? Does sharing one resource
conflict with not sharing another?
Sharing nodes
On blackforest, each node (currently Winterhawk II nodes) has four
processors, so an OpenMP job will be most efficient when it is
sufficiently parallel to run on all four processors (i.e., by setting
the OMP_NUM_THREADS environment variable to 4). However, if your job
requires only two threads, the default is to have you share the node
with another job that only requires one or two threads.
This may
impact the performance of your code in cases where you need all the
memory on the node or you are making heavy use of the memory
subsystem. So even if you have less than four threads, you may still
want the whole node to yourself. You will need to decide which type of
node usage works best for your job.
Note that syntax is different for LoadLeveler (batch) jobs and
interactive jobs. Batch jobs use the LoadLeveler keyword
node_usage. Interactive jobs should use the MP_CPU_USE
environment variable to indicate whether or not you wish to share the
node (see Table 1).
Sharing the network
In large MPI jobs, tasks may need to communicate with each other
across nodes. If messages cross a node boundary, they go via a
communications switch (denoted by the abbreviation csss). If
combined with node sharing, sharing the switch means that both the
switch and the node can be shared among your and other users'
tasks.
If combined with node_usage = not_shared, only your program's
tasks have access to the node's CPU, but other programs' tasks can
share the switch. In most cases, sharing is desirable, and is the
default.1 If you don't share the network, but have allowed
node_usage to be shared, the scheduler drains the work on the nodes
because it thinks other users running on the nodes need to share the
switch. This drain has the effect of blocking the queue and taking
the nodes assigned to that queue effectively out of the system.
The default communication is IP over Ethernet, so it is usually
beneficial to specify us (user space protocol), which is
optimized for the SP switch.
Again, syntax is different for LoadLeveler and interactive jobs. Batch
jobs use the LoadLeveler keyword network.MPI. Interactive jobs
should use the MP_ADAPTER_USE environment variable to indicate whether
or not you wish to share the network, as well as specifying
EUIDEVICE=csss and EUILIB=us (see Table 1).
Sharing the memory
Wait! There's yet a third form of sharing that you can specify on the
SP. If you are running an MPI job on one node, you can set an
environment variable, MP_SHARED_MEMORY=yes, to prevent tasks from
unnecessary communications with the switch, since they are not going
off-node. You can use this variable effectively even when some of the
tasks are going off-node, because it will reduce port congestion and
allow for better performance of the inter-node communication.
Table 1 lists the LoadLeveler keywords and environment variables for
sharing nodes, networks, and memory. Note that LoadLeveler keywords
override environment variables in batch jobs.
| Table 1. Sharing nodes and network |
| LoadLeveler keyword |
Environment variable |
#@node_usage = shared|not_shared
(default is shared) |
MP_CPU_USE unique|multiple (interactive only) |
| #@network.MPI = csss,shared,us |
Set these three together:
MP_EUIDEVICE=css0
MP_ADAPTER_USE=shared|dedicated (interactive only)
MP_EUILIB=us |
|
MP_SHARED_MEMORY=yes |
Using Totalview: Share the node and switch
The Totalview debugger is a useful tool for debugging parallel
programs on the SP. Because you run totalview interactively, it
requires that the MP_ADAPTER_USE environment variable be set to shared
and MP_CPU_USE be set to multiple.
1Although shared is the default value for network.MPI, you
should specify it explicitly, because under some combinations of
options, it may be set to not_shared.
Back
to contents
|