Future operating system direction for Cray customers

ABSTRACT: The following document describes what work will be done to provide UNICOS functionality and features into the Cellular IRIX operating system that will be used on the future SN-1 computer system. The direction, plans and features necessary to migrate Cray customers using the CRAY T3E system are discussed, as is the overall schedule of the first Cellular IRIX deliverable.

spiral design

Don Mason

Introduction

The original, primary goal of the Scalable Node (SN) project was to provide a way to combine the PVP and the MPP architectures into one product line. This work was necessary to reduce both the cost of Cray systems to customers and the internal cost of supporting two different product lines.

The acquisition of Cray Research by Silicon Graphics, Inc., did not change the basic goals of the SN project, although it did add another SN product. Called the Origin2000, this product was referred to as the SN-0 system internally, and was originally slated by SGI as the next step in their high-performance computing product line. The next step in this architecture will be the SN-1 system, which will be the follow-on architecture for the CRAY T3E system. The SN-2 system will support the PVP functionality found on the CRAY C90 and T90 systems.

When the SN product directions were combined, it was logical not only to make the Origin2000 the first of the new SN product line, but provide for a Cray-branded version of the machine to traditional Cray customers. The Origin2000 machine was too far along in its development to allow for any Cray-specific feature work, but Cray will provide the infrastructure needed to deliver and support the Cray-branded machine.

Future SN-1 and SN-2 systems will include Cray-developed features that will provide the functionality needed by Cray's traditional high-performance computing customer base. Plans are that the functionality will be added such that the SN-1 system will be the CRAY T3E migration path, while the SN-2 system will provide the functionality and features that will allow Cray customers with CRAY T90s and J90s to migrate.

The rest of this paper describes the Cellular IRIX operating system that will be used on the first Cray SN platforms. A preliminary outline of the schedule for this system is also included.


Migration plans

The SN-1 system will use the Cellular IRIX operating system. Cellular IRIX is the distributed version of the SGI IRIX operating system, and provides distribution, single-system image, and fault tolerance. It will be enhanced to provide CRAY T3E functionality for Cray customers.

For CRAY T3E customers, the following three major migration areas are being addressed:

  • Supporting the migration of customer applications as seamlessly as possible

  • Allowing the retention of customer's data

  • Providing the operational environment needed by the Cray customer base
To allow customers to migrate their current applications to the SN-1 as seamlessly as possible, work is being done so that codes need only be recompiled. This means that the Cellular IRIX operating system must support a message passing programming model that includes some UNICOS-specific interfaces, and the ability to launch, schedule, and manage a distributed application.

A second major requirement is that customer's data be retained. This requirement will be addressed with filesystem work and device support. Finally, CRAY T3E operation requirements such as accounting and checkpoint/restart are identified as needed features.

Teams of people from Eagan and Mountain View are defining the needed features to support this work. These teams are currently defining the detailed feature plans and schedules to implement this work.

The rest of this document focuses on the operating system work that is needed; programming environment work is not addressed here.


Cellular IRIX overview

Cellular IRIX is an evolution of the IRIX operating system. This evolution implies an effort similar to the one used to migrate monolithic UNICOS to the distributed UNICOS operating system. That is, most of the basic user interfaces remain intact, source code is reused as much as possible, and the underlying architecture is modified to support a distributed environment.

The base IRIX system is a monolithic SMP UNIX system based on System V. It includes many SGI extensions for fine-grained multithreading, graphics, parallel programming, high-performance I/O, and real-time. Like UNICOS, IRIX is X/Open XPG4 branded.

The Cellular IRIX system adds infrastructure to the monolithic IRIX kernel so that multiple copies of IRIX can cooperate to provide a single-system image. The machine on which Cellular IRIX runs is divided up so that each kernel runs on a subset of the machine's physical processors. Each of these partitions of the system is called a cell. Dividing the system into cells provides the following benefits:

  • It limits the multithreading contention for each kernel, since it only manages a subset of the total number of processors on the system.

  • Memory locality is increased, since each kernel is loaded into memory in close proximity with the processors on which it runs.

  • A cell is a natural unit of fault tolerance, since the running of that cell is logically separated from other cells by a known interface and hardware firewalls.

The division of the system into cells also introduces an extra level of complexity, since the activities of the cells must be coordinated to provide a single-system image. Cellular IRIX provides the infrastructure to accomplish this.

The key distribution infrastructure elements in Cellular IRIX are as follows:

  • The virtualization of kernel subsystems. This is the abstracting of the interface of kernel subsystems so that the functionality of those subsystems can be distributed. The virtual interface of a subsystem becomes, in effect, a "client" interface for accessing that subsystem.

  • A distribution layer is interposed between the client interface to a subsystem and the implementation of that subsystem. This layer provides support for locating remote services and transporting requests to those services.

  • The distribution layer also provides the additional opportunity to cache information about the subsystem in the clients of that subsystem. Cellular IRIX provides a token scheme for managing the coherency of data cached by clients. In the best case, this allows a request to be handled without a remote invocation, since the request can operate purely on locally-cached data.

The division of the architecture into disjoint functional and distribution layers allows Cellular IRIX to provide a common set of distribution services that can be used for the distribution of all subsystems. Additionally, the complexity of distribution is hidden from both clients and subsystems.

One added benefit of the architecture of Cellular IRIX is that it resolves down to an SMP system in the simplest case. In fact, the distribution layer is not even built into the kernel for SMP systems. This is possible since neither the client nor subsystem depends on the distribution layer to implement functionality. Additionally, for multicell systems, the distribution layer is only inserted when an object is accessed by a client from a remote cell.

As a simple picture of the distribution of a function, the following figure shows how the kill() system call to a given file is distributed in a two-cell system. To the upper layers of the system, the kill() path appears the same. However, a distribution layer is interposed between the physical manager of the object (in this case, the cell on which the target process resides) and the reference to that object. If the reference to the object occurs on the same cell that manages that object, then the distribution layer can pass along the reference with minimal overhead.

[Figure to come]


Operating system criteria

The key operating system criteria for SN-1 are as follows:

  • Provide support for traditional Cray features and capabilities to provide a smooth transition for CRAY T3E customers and continued support of IRIX features for a smooth Origin2000 customer transition.

    Many features that Cray has implemented exist in IRIX, such as kernel multithreading, high-performance filesystems, async I/O, Multi-Level Security (MLS), accounting, political scheduling, checkpoint/restart, and POSIX and X/Open compliance.

    Most of these will need additional work to provide the functionality needed to meet the expectations of Cray customers. Additional work will also be necessary to support Data Migration Facility (DMF), User Data Base (UDB), tapes, and parallel programming.

  • Provide scalability to thousands of processors.

    The scalability of Cellular IRIX is provided by:

    • The ability to run multiple instances of the kernel cooperating to provide a single-system image. This reduces contention in kernel multithreading and provides memory locality.

    • Caching of data in clients limits the need for remote operating invocation.

    • Fine-grained kernel multithreading within a cell provides traditional SMP scaling.

    The Cellular IRIX architecture provides a rich set of facilities that provide a well-structured and flexible operating system scaling methodology. The Cellular IRIX scaling methodology provides the tools for the construction of a scalable high-end operating system.

  • Provide resiliency.

    Cellular IRIX is designed with resiliency as a primary goal and much thought has gone into resiliency at many levels. As previously noted, a cell is a unit of resiliency as well as distribution. Cells will have disjoint memory spaces, which will ultimately stop a failure in one cell from corrupting other cells.

  • Provide support for a single-system image.

    The Cellular IRIX system is designed to provide a single-system image. From the point of view of user processes, multiple cells will operate as a single monolithic IRIX system.

  • Provide binary compatibility with the IRIX operating system.

    Because Cellular IRIX is primarily the IRIX source code, binary compatibility will be maintained.

  • Maximize OS development leverage between Eagan and Mountain View.

    Cellular IRIX will be the single operating system for the entire SGI MIPS-based product line and thus will provide this leverage.


CRAY T3E transition

A primary goal for the high-end SN-1 systems is that they provide migration path for existing CRAY T3E customers.

A key goal in this transition is to provide CRAY T3E API compatibility for applications. CRAY T3E applications should be able to be recompiled and executed with both compatible functionality and a performance profile appropriate for a CRAY T3E follow-on system.

Many parts of the UNICOS API will be provided, but some of machine-dependent interfaces may not be ported if they are not critical to the porting of CRAY T3E applications.


Scalability development

The following subsections address the development efforts needed to support scalability.


Programming environment

The key programming models for CRAY T3E systems are message passing (MPI, PVM) and data passing (SHMEM). Their importance for SN-1 is emphasized by the fact that support for message and data passing is critical to the successful transition of existing CRAY T3E customers. Customers cannot be required to make extensive modifications to their codes.

The initial release of Cellular IRIX will not allow processes to span cells. The CRAY T3E programming models, which use multiprocesses, will thus be a primary, if not preferred method for writing applications that want to run on more than one cell.


I/O and filesystems

The primary filesystem for SN-1 will be the SGI XFS filesystem. This filesystem provides support for high-performance I/O through contiguous placement of data blocks, localization of data with associated metadata, disk striping, and user-directed preallocation. It also provides support for quick filesystem recovery through metadata journaling.

Since existing CRAY T3E customers only have the NC1 filesystem, a means must be provided to preserve their data for use under SN-1/Cellular IRIX. A utility to access archived NC1 data is the most likely approach that will be provided.

In addition to addressing the transition of on-disk data, the functionality that Cray customers have come to expect from NC1 filesystems needs to be provided by XFS. Some of the behaviors currently identified include the following:

  • cbits/cbytes
  • Primary and secondary partitions


Joint Cellular IRIX core development

Currently, Cray personnel are working with the core Cellular IRIX development team in Mountain View. The goals of this joint effort are to bring the knowledge and experience from the UNICOS/mk project to bear on the SN effort and to look for ways to implement high levels of scalability.

This work includes the refinement of the distribution architecture as well as the distribution of some basic subsystems. Other areas for development by Cray personnel will be identified in the future.


Cray-specific features and capabilities

The following subsections describe some of the Cray-specific feature work that will be done for Cellular IRIX.


Checkpoint/restart

The checkpoint/restart functionality planned for IRIX/Cellular IRIX is very similar in scope to the functionality being implemented for CRAY T3E. The IRIX implementation currently being used by customers is know as CPR. Areas of concern for the supercomputing support include process sizes, the transparency of checkpoint/restart, and cross-release checkpoint/restart.

The scale of applications that can be recovered needs to be addressed. In particular, the mechanism for validation of file contents relies on file modification time to detect changes of contents. This is not foolproof however; a UNICOS file signature mechanism will need to be added to the IRIX implementation. In IRIX, only the file name is stored. File data can be copied if the user specifies that it should be, or if the file is unlinked.

SN-2 must also have a checkpoint/restart facility that meets the needs of Cray PVP customers. UNICOS has a mature checkpoint/restart facility (more than eight years of field experience). Considerable sophistication has been added to checkpoint/restart to make it more useful for Cray customers.

In some key respects, the plans for IRIX/Cellular IRIX resembles the early versions of checkpoint/restart in UNICOS. Actual experience with Cray customers will be used to ensure that an attractive transition is provided for PVP customers to SN-2.

Some areas of concern include the following:

  • The ability to handle processes that are not in interruptible sleeps in the kernel.

  • The transparency of checkpoint and restart to the target process, i.e., no setup is necessary to be checkpointed and no reconstruction is necessary on restart.

  • New releases of system software should not invalidate existing checkpoint images.

The Cellular IRIX checkpoint/restart mechanism will transition relatively easily from the IRIX version from a functionality perspective due to its dependence on the /proc filesystem. Attention will be given to ensure that its performance aspects also transition well on the very large high-end systems.


MLS security

The IRIX MLS implementation has the following in common with the UNICOS implementation:

  • B1 evaluated

  • Mandatory access control (MAC)

  • Discretionary access control (DAC) and access control lists (ACLs)

  • Security auditing

  • Identification/Authentication mechanisms

  • Fine-grained privilege support

  • Secure networking with the IP security option

Some key differences between the systems include the following:

  • UNICOS provides a more extensible mechanism for assigning privileges to administrators.

  • API incompatibilities: the IRIX API will be the primary API. Cray extensions will be added, as necessary.

  • MAC labels are significantly different in format and scope (this leads to some of the API differences).

  • IRIX provides an integrity policy.

  • IRIX does not have a MAC revocation mechanism.

While there are some fundamental differences between the IRIX and UNICOS approaches, they are both targeted at providing the same capabilities. Both Eagan and Mountain View can benefit from pooling their experience in these areas. Moreover, this is not an area that is expected to impact users significantly, especially application developers. The greatest area of impact will be in the area of administration.


Accounting

Both IRIX and UNICOS have extensions to traditional UNIX accounting. However, the UNICOS emphasis on large installation support has led to a more extensive set of requirements. In order to support the accounting capabilities that Cray customers expect, the following will be provided:

  • UNICOS accounts for a significantly larger number of activities than IRIX does, so many of these will need to be added to IRIX.

  • IRIX project accounting largely provides similar capabilities as are provided by UNICOS account IDs, but there will be a need to expand the scope of account IDs outside of the domain of accounting.

  • Tools will be added to IRIX to increase the usability of accounting information to end users.

  • Subsystem, in particular network, accounting capabilities will be added to IRIX.
Overall, the IRIX accounting system provides a good basis for the capabilities that Cray customers expect.


Commands

The basic commands set for both UNICOS and IRIX is based on the System V commands. The plan is to use the IRIX command set as the base and add Cray functionality as needed. The IRIX command set is being used as a base for the following reasons:

  • The IRIX commands are based on a more up-to-date version of the System V commands than UNICOS.

  • The IRIX commands should be able to run on both 32 and 64-bit platforms, whereas the UNICOS commands may have 64-bit assumptions; thus the IRIX commands are the path to maximal leverage.

Some issues that will need to be addressed include the following:

  • Compatibility of command-line interfaces (since both Cray and SGI have made extensions).

  • There will be some unique Cray commands and each must be evaluated as to whether they need to be carried forward.

  • Cray has made performance enhancements to commands for vectorization and I/O. These will be evaluated individually to decide if they should be carried forward.
Since the IRIX command set is based on the same standard as the Cray command set (X/Open XPG4 Base 95 Profile), the IRIX commands provide existing Cray customers with the vast majority of the commands they expect. More specific information on the command changes will be available in the future.


Limits

The limits functionality on Cray systems has developed in accordance with the needs of large systems with many competing constituencies. The requirements for limits are as follows:

  • Support for thousands of users and processes

  • Dynamic tuning for batch throughput or interactive response

  • The ability to ensure that critical users of projects have needed system resources

The following key activities have been identified to preserve the traditional Cray capabilities:
  • Port the Cray user database (UDB) library and associated commands or equivalent

  • Add Cray limit extensions to the getrlimit/setrlimit interfaces in IRIX

  • Design and implement a distribution architecture for limits in Cellular IRIX

This effort will be similar to the one done to port Cray limits functionality into UNICOS/mk, since the native interfaces supported in IRIX are virtually the same as those that came with the version of Chorus on which UNICOS/mk is based.


Resource management

The current resource management capabilities of IRIX will be enhanced to support the needed Cray functionality. This work will include the following:

  • Allocation of the hardware barrier tree on both multicell and multihosted systems for CRAY T3E style applications

  • A special scheduling mode for CRAY T3E style applications that will guarantee that all members of this application will execute synchronized across all members

  • Porting the multilayered user/fair scheduling environment from UNICOS/mk to IRIX for both multicell and multihosted systems

  • Porting the high-level load balancing from UNICOS/mk to IRIX for multicell systems


UNICOS API

The common API for the SN project is being documented in more detail by a separate SGI/Cray Research group and will be available in the near future. The developers defining this API have been analyzing the IRIX and UNICOS APIs to determine what detailed work needs to be done to provide the functionality Cray customers expect. This document should be available soon.


Asynchronous I/O and listio

IRIX currently supports asynchronous I/O and listio based on the POSIX 1003.1b:1993 interfaces. These interfaces are implemented in a user-level library using threads, whereas UNICOS and UNICOS/mk support native asynchronous operation in the operating system. Native support for this reduces the latency in the activation of an I/O operation, which can significantly impact the execution of highly-tuned applications.

The addition of asynchronous I/O infrastructure into Cellular IRIX will be a large-scale effort that will encompass the system call layer, the XFS filesystem, volume manager, and device drivers. Current plans for listio are to leave it at the POSIX level of functionality.


Tapes and DMF

The tape and DMF work currently being implemented for IRIX is expected to carry forward to Cellular IRIX. The details of this work are not included in this document.


Preliminary milestones

As stated previously in this document, in its final form, Cellular IRIX will offer single-system image, scalability to a large number of processors, and fault containment. The realization of this final system will take several years to complete, so in order to address the needs of the various SGI/Cray market segments in the meantime, intermediate deliverables will be produced.

In addition to providing the shorter-term immediate deliverables, these intermediate steps also represent the stages toward the full Cellular IRIX system. The following subsections briefly describe these deliverables


Intermediate deliverable

The first deliverable of the Cellular IRIX project is preliminarily scheduled to occur in late 1998. The supercomputer configuration of this system will support the SN-1 platform. The supercomputer configuration of the 1998 Cellular IRIX release will address the needs of the high-end performance market.

One simplifying factor in this system will be the existence of a "golden cell." The system will tolerate failures on every cell except for the golden cell. The existence of a golden cell permits the "basic distribution" of components, where the bulk of the system runs on the golden cell, and requests are "function shipped" from all other cells to the golden cell. This simplifies the distribution and recovery work.

For example, the code for networks, ttys, pipes, and System V message queues could run on the golden cell, while the rest of the cells only run simple "clients." Recovery is simpler, as most subsystems need only to be able to recover from losing a client, but not a server.

Despite the fact that most subsystems will use basic distribution, components that are more important for the supercomputer workloads will be distributed in a more elaborate fashion. A key component to be fully distributed in this timeframe is the file/disk I/O path. The objective is to have disk I/O complete on the requesting cell as much as possible, especially for large unbuffered I/O requests. Another component that will need more elaborate distribution is virtual memory, since its performance is key to the performance of MPP applications.


Future deliverables

In future Cellular IRIX releases, more components will be fully distributed and the golden cell will be eliminated. At that time, a single-system will address the needs of all the various market segments of SGI/Cray Research.


Conclusion

The overall direction for the SN hardware and operating system architecture is being carefully defined by joint development teams from Eagan and Mountain View. This work will ensure that the functionality and features that Cray customers currently value and use on UNICOS and UNICOS/mk systems will be available for their high-performance computing needs in the future.

Contents || CUG home

Comments to: lester@ucar.edu