NCAR Mass Storage Services Strategic Plan

 

July 2006

 

Scientific Computing Division

High-End Services Section

Mass Storage Services Group

 

 

Introduction

 

NCARÕs current Mass Storage System (MSS) was designed and implemented by the MSS Group in the mid-1980Õs.  Earlier versions of the MSS included the half-inch tape library (MSS-I) and the Ampex Terabit Memory System (MSS-II).  As SCD acquired faster supercomputers, the amount of MSS data grew rapidly.  The MSS is now in its fourth generation and contains more than 1.75 Petabytes of unique data.  At the current growth rate, the total amount of MSS data doubles approximately every two years.

 

The IEEE Mass Storage Reference Model, the result of collaboration between DOE, DOD, NSF (including NCAR), and storage vendors, is the underlying basic design of the NCAR MSS.  NCAR was one of the first sites to implement the Òthird party transferÓ mechanism that was a key design feature of the IEEE Reference Model.  By the end of the 1980s, MSS-III was a high-performance high-capacity data archive that implemented a multi-level storage hierarchy for a modest (in todayÕs terms) series of Cray supercomputers. The 1990's saw an explosive increase in raw supercomputer capacity as well as the introduction of new computer vendors to the NCAR SCD environment. As a result, the MSS efforts became production centric primarily servicing supercomputing needs. Enhancements were made to the MSS to keep pace with the ever-increasing computing capacity pushing the limits of vendor-provided storage solutions. An effort was made to utilize Commercial Off The Shelf (COTS) solutions whenever possible. COTS solutions were available in the storage hardware area, but no suitable software solutions were available. A code conversion effort began to redeploy the MSS-III system in an open systems environment in which system scalability issues would be addressed and COTS software integrated wherever it made sense to do so. This effort continues today and is referred to as MSS-IV.

 

Today, the NCAR MSS continues to be one of the most capacious archives storing in excess of 1.75 PetaBytes (PB) of unique data, transferring over 5 Terabytes (TB) of data per day in response to user requests, and growing at a net rate of 40 TB of unique data per month. The typical NCAR MSS user is manipulating this data within the confines of the SCD Computer Room that provides the highest possible data access performance, reliability, and availability.  However, the computing environment is shifting from the traditional glass house paradigm, where the end user must come to a centralized site via interfaces such as ssh for supercomputing and archival storage to a much more distributed computing and storage environment.  A prime example of the latter is the rapidly expanding Teragrid.  The NCAR MSS must adapt to the changing environment and at the same time maintain the high-performance characteristics required in a glass house environment. The emphasis for the 2000's will be Grid-enabled services including but not limited to physical distribution of services, seamless data access regardless of the client platform and client location, and building tools that can be used in a larger SCD computing environment framework.

 

Vision

 

The vision for the new millennium:

 

To provide seamless data services extending beyond the bounds

 of the NCAR SCD computer room for all data managed by SCD,

 and to manage that data as efficiently as possible.

 

 

Definitions

 

Archive:

An archive is a repository for the storage of inactive data that must be retained for long periods of time. An archive performs a core set of functions with many variations, i.e. supporting different device types. An archive is in a state of constant flux and evolution because its lifetime exceeds that of the technologies it relies on.

 

 

Mass Storage Services:

Mass Storage Services is a data storage system to manage SCD user data. Mass Storage Services includes user interfaces (web, command line, program callable), file services for sharing data, and archive services for long-term data storage, all of which are supported in a heterogeneous environment.

 

 

Objectives of the Plan

 

Objectives of this plan are:

 

á      To continue providing the best possible Mass Storage Services while balancing performance and cost.

 

á      To provide an enhanced Grid presence through the modernization of Mass Storage Services, interfaces, and tools.

 

 

 

Recent Accomplishments

 

The lack of vendor support for a heterogeneous, shared filesystem, which would also include a built-in Hierarchical Storage Manager (HSM), has made it necessary for the MSS Group to re-think its strategy for the next few years.  The plan is to continue to pursue the shared filesystem, with a global name space, and to evaluate new products that come out, but to work on other enhancements and improvement to the MSS archive at the same time, using technology that is available now.

 

One of the biggest success stories for the MSS Group over the past few years was the completion of the first phases of the Storage Manager Project.  The initial goal of this project was to reduce the backend tape traffic internal to the MSS, and at the same time improve user response time for reading and writing MSS files.  This was addressed by the diskfarm replacement/expansion project.  This project did two things:

 

á       It allowed us to replaced the aging IBM 3390 disk farm with a modern Fibre Channel attached RAID system.  The new RAID has much improved data rates and reliability, and has allowed us to get away from another dependency on MVS.

 

á       It has allowed us to greatly expand the size of the diskfarm at a reasonable cost.  We currently have 44 Terabytes of RAID deployed in production, and we have seen an overall reduction in tape traffic in the neighborhood of 70%.  Users have seen a big improvement in MSS access time.

 

This project was implemented with no changes to the MSS user interface. Users continue to use the msread/mswrite/msrcp commands exactly as they did before, and all binaries with embedded calls to msread or mswrite continue to work correctly, with no changes.

 

For more information on the new MSS disk cache, along with other recent accomplishments of the MSS Group, please refer to MSS accomplishments section of the CISL Annual Report for 2005, which can be found at http://www.cisl.ucar.edu/nar/2005/ci/mss.a.jsp.

 

Web Access

 

Providing a Web presence for the NCAR Mass Storage Services continues to be a high priority project within the MSS Group.  The explosion of e-commerce enterprises combined with the ubiquitous WEB browser has created new opportunities for providing Mass Storage Services data and metadata on the Internet.  Many subprojects are required to build the foundation. These include the deployment of MSS File Services, additional Web-based tools to access and manage MSS data, and enhanced archive capacity, performance, reliability, and availability. These are independent projects that together will provide the tools for an SCD Computing Environment framework described in other SCD Project Roadmaps.

 

 

Web Data Access

 

Web accessibility to MSS DATA requires going through an intermediate layer, in the absence of a global, shared filesystem front-ending the MSS.  Web servers still need to have their own locally attached disk (which can be shared with other similar servers).  Files stored on local disk can be directly accessed by Web clients.  Whenever a file is called for that is not on local disk, msread (or msrcp) is used to fetch the file from the MSS archive. Subsequent accesses do not require an MSS access, until the file ages off of the local disk.  Web servers of this type can be internal SCD servers, or servers maintained by other divisions. Servers connected to one of the disk cache pools described above are able to share files with other servers connected to the same pool, thereby eliminating the need to have multiple copies of the same file on several different machines.  This saves disk space, in addition to reducing accesses to the MSS archive, and also improves the response time seen by Web clients.

 

In addition to making MSS files directly visible to Web clients via the mechanism described above, it is also be possible for Web clients to request MSS files via FTP.  The MSS Group now has a production FTP server for this purpose, with support for secure FTP for access outside the UCAR security perimeter.

 

Web Metadata access

 

The other aspect of an expanded MSS Web presence centers around the metatdata associated with MSS files.  In the past, access to MSS metadata was limited to a suite of DCS (Distributed Computing Services) commands which could be used to generate lists of MSS files by filename or directory name, list and/or modify the metadata associated with a specific file or set of files, or perform other operations such as removing files.  There were many types of queries, however, which were difficult or impossible to obtain using the DCS commands.  For this reason, the MSS Group designed and implemented a Web-based MSS data management system that allows users to obtain summaries of their MSS holdings and associated GAU (General Accounting Unit) charges by scientist number, project number, or MSS root directory name.  In addition, users who subscribe to the file listing service can view information on a file-by-file basis via the same Web site.

 

The TeraGrid

 

NCAR is just now beginning to enter into the cooperative computing and data-sharing environment known as the ÒTeraGridÓ.  It is expected that portions of the NCAR MSS will be made accessible to the TeraGrid either as directly attached storage or via a server with an attached local disk cache back-ended by the MSS.  This project is just now getting started at NCAR, and many details have yet to be worked out.

 

The NSF 625 Solicitation

 

The National Science Foundation has issued a new solicitation known as NSF 05-625, the full title of which is ÒHigh Performance Computing System Acquisition: Towards a Petascale Computing Environment for Science and EngineeringÓ.  NCAR has already submitted a proposal for the first portion of this project (February 2006).  Our proposal included a 12 Petabyte Archival Storage subsystem to be co-located with the compute engines.  If we were to be awarded this contract, it would have major implications for the MSS Group.  In addition to redirecting some of our existing resources and personnel, a new FTE would need to be hired into the MSS Group to work full time on the support and integration of the new archive. 

 

Even if we arenÕt awarded this contract, there are plans for three additional increments to the original solicitation.  NCAR will be responding to these as well.  But whether or not we will be selected for any of these is totally unknown at the present time.

 

The ICESS Procurement

 

Whereas the TeraGrid connectivity and the NSF 625 solicitation are relatively uncertain as to how they will impact the MSS Group, the ICESS procurement (currently in progress) is definitely going to be completed by late calendar year 2006 or early 2007.  We know fairly accurately how this is going to impact the MSS, since the new machine will be roughly six times the computing capability of our existing ÒBlueskyÓ machine.  This will be discussed further in the section on MSS Growth Projections (below).

 

Offsite Disaster Recovery

 

For the past few years, the MSS Group has been looking into ways of saving critical MSS data in a secure, offsite location so that this data would not be lost forever in the event of a catastrophic event at the Mesa Lab.  Storing data on sets of tapes that would be moved offsite was ruled out for a number of reasons.  What we needed was a way to electronically copy data directly to/from a remote site, with no manual intervention.  The problem was that there was never a suitable site that we could use for this purpose.

 

There is a new initiative that is currently underway that has a great deal of promise in this area.  It specifies a cooperative agreement with another site (the San Diego Supercomputing Center) in which data from one site could be backed up at the other site (up to 300 Terabytes over a 5 year period).  This project is just now getting under way, but we are hopeful that it will be successful.  We are looking at ways of implementing this system in such a way as to have minimal impact on the MSS budget.

 

MSS Growth Management

 

Managing MSS growth has become a key issue that is being addressed in a number of ways by the MSS Group.  Many factors contribute to MSS growth, but the two biggest ones are data coming into NCAR from external sources, and increases in the total computing capacity in the SCD computer room.  During periods of relative stability, the rate of growth stays fairly constant, but big jumps in the growth rate can occur when large amounts of new external data are brought in, or when the Òflops on the floorÓ are increased substantially due to new computer acquisitions. 

 

One major change that has helped to keep MSS growth in check was the introduction of user selectable alternatives for how their files are stored known as ÒClass of ServiceÓ (COS).  Using COS the user can select varying levels of reliability, accessibility, and cost (GAU charges) for each MSS file.  The first COS (introduced in early FY-02) allows users to identify those MSS files where reliability is not as important as the default in which two tape copies are made. In this case, the MSS will make only one copy of the MSS file thereby reducing the storage cost.  Another COS option that was implemented was Òusage=backupÓ, which identifies short-lived system backup files that are not likely to be read back often, if at all.  This has helped immensely in segregating backup files away from other ÒnormalÓ MSS files so that they can be managed much more efficiently. Future COS options will allow the MSS user to identify inactive MSS files that will not be kept on the fastest possible access devices and true archive MSS files which may be kept in a cold storage environment.  Usage of the latter COS's will benefit the user with reduced storage costs as well as the MSS by offloading the more expensive fast-access devices.  This means a cold storage environment could be maintained outside the SCD computer room reducing the floor space requirement of the MSS.  If a higher level of reliability is required, the MSS user could elect to have the MSS create more than the default 2 copies of a file, or have one or more copies of the file kept offsite for disaster recovery.  Once an MSS file is created, its COS can be changed dynamically by the user as the userÕs needs change.

 

Incorporating New Technology

 

Capacity and performance enhancements are an ongoing process, and are critical to the continued success of the NCAR MSS.  We know there is a direct relationship between the amount of computing capacity serviced by the MSS, the amount of MSS data transferred, and the MSS data growth. Technology will continue to provide faster storage devices and denser media (at a lower and lower cost per Terabyte) that will be incorporated into the MSS. By the year 2008 we expect a minimum 4X increase in data transfer rate and a minimum 5X increase in media density for tape storage devices.  RAID disk subsystems are also expected to increase in speed and capacity in the coming years.  The MSS Group will continue to track vendor solutions and participate in vendor product beta tests in order to assess which new products are a good fit for our storage infrastructure.

 

Migration to New Media

 

Once an archive is established, the issue of keeping the data alive must be addressed or in other words, how can one be assured that the data can be retrieved? Data Ooze is the process of migrating data from older media to new media. This process should be transparent to the end user. Many issues must be considered when estimating the useful lifetime of a particular media. Not only is media shelf life important, but drive lifetime and software support is important as well. The primary driving lifetime factor is vendor planned obsolescence. All lifetime factors must be identified and estimated to compute the total useful lifetime. Then the amount of time required to ooze the data off the media must be subtracted to arrive at a usable lifetime figure. A one-half usable lifetime is the worst case because you spend the first half writing data to the media and the second half oozing it to a different media. Some vendors understand this problem but are doing little to solve it. They talk of the need to reuse media across multiple enhancements of a storage device family while maintaining backward read capabilities. However, this is a short-term solution in the sense that as technology advances media will quickly become obsolete well before its useful lifetime ends. Achieving aggressive media density increases will require new media formulations that are not backward compatible. Couple that with a Petabyte-sized archive, and the result is a data ooze process that can not complete within a given generation of a storage device. In fact, the data ooze may surpass the entire set of generation upgrades for a given device, so the end result is the loss of any benefit from generation upgrades and loss of media reuse.  It may be desirable to not start a data ooze on a given technology until approaching end-of-life for that technology, but you would need to allow enough time for the ooze to complete before EOL.

 

The bottom line is data ooze is a huge problem, which must be addressed and resolved well in advance. Solutions will be expensive, both in new device and media purchases, and supporting and managing the ooze. With Petabyte-sized archives, the data ooze process may well be the most resource consuming process of the entire system, and will become a perpetual operation.

 

MSS Growth Projections and Long Term Planning

 

Many factors go into the estimates for how much total net MSS growth we should plan for.  Traditionally, we have used the Òrule of thumbÓ that the total amount of data in the MSS (including second copies of files) doubles about every two years, taking all factors into account.  Data ÒoozeÓ (described above) doesnÕt increase the total data on the MSS, but burns up new media, so this needs to be included in future MSS budgets.

 

Even with the ICESS procurement, which is capable of increasing the total data flowing into the MSS from the supercomputers by a factor of three starting in early calendar 2007, calculations have shown that the doubling in two years rule can still be used.  The bottom line still works out to be ten Petabytes of total MSS data by the end of FY-09.  The following shows a year-by-year summary of what equipment (tapes, drives, automated cartridge libraries, and MSS servers and Fibre Channel infrastructure) we are planning to purchase each fiscal year to accommodate the projected growth:

 

Fiscal Year 2006

 

FY-06 is very unusual in the sense that we anticipate no media purchases for the entire fiscal year.  This is made possible by the technology upgrade to 9940-B tape drives, which write 200 Gigabytes onto the same media which currently hold only 60 Gigabytes of data in 9940-A format.  A data ÒoozeÓ was started in the fall of 2005 that has been freeing up media (to be rewritten in 200 Gigabyte format) fast enough to stay ahead of all new data entering the MSS.  This will continue until the January/February 2007 time frame.  At that time, all five Powderhorn silos will be full of 9940-B formatted tapes, and there will be enough empty tapes to accommodate all new data flowing into the MSS for another six to eight months (fall of 2007).

 

Ten additional 9940-B silo-attached tape drives were already purchased early in FY-06, and we do not anticipate having to buy any additional tape drives this fiscal year.

 

The Distributed Computing Services (DCS) software is undergoing a major rewrite in FY-06.  The new version (known as DCS version 4) is currently in beta test, and is scheduled to go production in July 2006.  The new version eliminates the dependency on DCE (Distributed Computing Environment), which has been losing vendor support.  It has numerous other improvements and enhancements.

 

Also in FY-06, we plan to deploy the prototype for the MSS-IV Metadata Server, which will ultimately replace the current Master File Directory subsystem known as MFDTASK.

 

Fiscal Year 2007

 

Things begin to change dramatically around the middle of FY-07.  The new ICESS machine will be in the process of ramping up to full production, which will mean that we will start using up existing 9940-B media at a much higher rate.  Our five existing Powderhorn silos will be fully populated with 9940-B media, with only enough empty tapes to last us until about the end of FY-07 (Oct 1, 2007).  This means that we will need to start installing additional automated libraries and media in mid FY-07.  The current plan is to expand into an area in the recently vacated second basement (2-B), most likely with the purchase of a Sun/STK SL8500 10,00 slot tape library, populated with 20 T10000 tape drives along with 1,400 500-Gigabyte cartridges (enough to last roughly 3 months).  In addition, we will need to expand our Fibre Channel data network to accommodate the new drives, and we will need to install additional Storage Manager servers.  Starting in FY-07, we will need to begin expanding the MSS internal disk cache to keep pace with the larger and larger volume of incoming data.  This will need to continue in FY-08 and FY-09.

 

Also, in FY-07 we will complete the production version of the MSS-IV Metadata Server and decommission MFDTASK.

 

Fiscal Year 2008

 

We will continue writing to the 500-Gigabyte cartridges for most of FY-08, and will need to purchase an additional 4,200 cartridges.  Sun/STK is now saying that the 1 Terabyte tape drives (which use the same T10000 media as the 500 Gigabyte drives) will be available in early calendar year 2008.  The MSS Group plans to take advantage of the new drives as soon as it is feasible, to start doubling the amount of data on the cartridges.  If we are able to do this, we could start writing to the Terabyte drives in the fall of 2008.  We would need to purchase 20 of the new Terabyte drives and install them in the SL8500.  We would not need to expand the SL8500, since it has enough capacity to carry us all the way through FY-08 and into FY-09.  Toward the end of FY-08 we will need to purchase 700 T10000 cartridges (a 3 monthÕs supply) to be written on the 1 Terabyte drives.  In addition, we are going to need to start a data ÒoozeÓ from 9940-B media onto the new Terabyte tapes at the same time.  We will need an additional 500 tapes for this purpose, assuming we are able to ooze at a rate of 2 Petabytes/year.

 

Fiscal Year 2009

 

At the beginning of FY-09, we hope to be writing all new data onto 1 Terabyte T10000 cartridges.  We will need additional SL8500 library space about half way through FY-09, so we will need to have money in the budget for a second library, but not necessarily the maximum size (10,000 slots).  If the new data center is ready, we could install an SL8500 there (along with all necessary Storage Manager servers and equipment), or we could simply continue to expand in the 2-B area.  Either way, the total cost will be about the same.  We will also need an additional 10-15 tape drives, as more and more read-backs start coming from the Terabyte tapes.

 

The total amount of media that we will need to purchase in FY-09 will be about 2,500 tapes for new data, and 2,000 tapes for the 9940-B data ooze.  

 

Summary

 

The technology picture becomes very cloudy when looking 2-3 years out. In this day and age, numerous vendors promise solutions before they can actually deliver. Even worse, high tech companies are born and die or are acquired by other companies on a momentÕs notice. Where does this leave the consumer? Not in an enviable position. Selecting vendors and products for long-term relationships and expecting a high level of support, enhancements, and commitment is a gamble at best. We are positioning Mass Storage Services for quick reactions to changing technology by judicious selection of Commercial-Off-The-Shelf (COTS) solutions and applying industry standards whenever possible. However, COTS and standards solutions are slow to market when youÕre on the bleeding edge of technology. Therefore, custom software and hardware solutions are required to integrate the pieces.

 

The current Mass Storage Services are production centric. We need to actively work with other peer sites and groups to design and develop state-of-the-art interfaces and functionality while continuing to maintain the high-quality production services our users expect.