ABSTRACT
A new generation of software libraries and algorithms
are needed for the effective and reliable use of (wide area) dynamic,
distributed and parallel environments. Some of the software and algorithm
challenges have already been encountered, such as management of communication
and memory hierarchies through a combination of compile-time and run-time
techniques, but the increased scale of computation, depth of memory
hierarchies, range of latencies, and increased run-time environment variability
will make these problems much harder.
Along these lines, we will discuss work on the
development of parameterizable and annotatable software libraries in the linear
algebra area that will permit performance tuning for a broad range of
architectures including grid computing. Self Adapting Numerical Software (SANS)
is a software effort that will automatically generate highly optimized
numerical kernels for our high performance computers.
In addition, we will describe an implementation of
MPI which extends the message passing model to allow for recovery in the
presence of a faulty process. Our implementation allows a user to catch the
fault and then provide for a recovery.
We will also touch on the issues related to using
diskless checkpointing to allow for effective recovery of an application in the
presence of a process fault.