NCAR
Mass Storage Services Strategic Plan
July 2006
Scientific Computing Division
High-End Services Section
Mass Storage Services Group
Introduction
NCARÕs current Mass Storage
System (MSS) was designed and implemented by the MSS Group in the mid-1980Õs. Earlier versions of the MSS included
the half-inch tape library (MSS-I) and the Ampex Terabit Memory System
(MSS-II). As SCD acquired faster
supercomputers, the amount of MSS data grew rapidly. The MSS is now in its fourth generation and contains more
than 1.75 Petabytes of unique data.
At the current growth rate, the total amount of MSS data doubles
approximately every two years.
The IEEE Mass Storage Reference
Model, the result of collaboration between DOE, DOD, NSF (including NCAR), and
storage vendors, is the underlying basic design of the NCAR MSS. NCAR was one of the first sites to
implement the Òthird party transferÓ mechanism that was a key design feature of
the IEEE Reference Model. By the
end of the 1980s, MSS-III was a high-performance high-capacity data archive
that implemented a multi-level storage hierarchy for a modest (in todayÕs
terms) series of Cray supercomputers. The 1990's saw an explosive increase in
raw supercomputer capacity as well as the introduction of new computer vendors
to the NCAR SCD environment. As a result, the MSS efforts became production
centric primarily servicing supercomputing needs. Enhancements were made to the
MSS to keep pace with the ever-increasing computing capacity pushing the limits
of vendor-provided storage solutions. An effort was made to utilize Commercial
Off The Shelf (COTS) solutions whenever possible. COTS solutions were available
in the storage hardware area, but no suitable software solutions were
available. A code conversion effort began to redeploy the MSS-III system in an
open systems environment in which system scalability issues would be addressed
and COTS software integrated wherever it made sense to do so. This effort
continues today and is referred to as MSS-IV.
Today, the NCAR MSS continues
to be one of the most capacious archives storing in excess of 1.75 PetaBytes
(PB) of unique data, transferring over 5 Terabytes (TB) of data per day in
response to user requests, and growing at a net rate of 40 TB of unique data
per month. The typical NCAR MSS user is manipulating this data within the
confines of the SCD Computer Room that provides the highest possible data
access performance, reliability, and availability. However, the computing environment is shifting from the
traditional glass house paradigm, where the end user must come to a centralized
site via interfaces such as ssh for supercomputing and archival storage to a
much more distributed computing and storage environment. A prime example of the latter is the
rapidly expanding Teragrid. The
NCAR MSS must adapt to the changing environment and at the same time maintain
the high-performance characteristics required in a glass house environment. The
emphasis for the 2000's will be Grid-enabled services including but not limited
to physical distribution of services, seamless data access regardless of the
client platform and client location, and building tools that can be used in a
larger SCD computing environment framework.
Vision
The vision for the new
millennium:
To provide seamless data
services extending beyond the bounds
of the NCAR SCD computer room for all
data managed by SCD,
and to manage that data as efficiently
as possible.
Definitions
Archive:
An archive is a repository for
the storage of inactive data that must be retained for long periods of time. An
archive performs a core set of functions with many variations, i.e. supporting
different device types. An archive is in a state of constant flux and evolution
because its lifetime exceeds that of the technologies it relies on.
Mass Storage Services:
Mass Storage Services is a data
storage system to manage SCD user data. Mass Storage Services includes user
interfaces (web, command line, program callable), file services for sharing
data, and archive services for long-term data storage, all of which are
supported in a heterogeneous environment.
Objectives
of the Plan
Objectives of this plan are:
á
To
continue providing the best possible Mass Storage Services while balancing
performance and cost.
á
To
provide an enhanced Grid presence through the modernization of Mass Storage Services, interfaces, and tools.
Recent
Accomplishments
The lack of vendor support for
a heterogeneous, shared filesystem, which would also include a built-in
Hierarchical Storage Manager (HSM), has made it necessary for the MSS Group to
re-think its strategy for the next few years. The plan is to continue to
pursue the shared filesystem, with a global name space, and to evaluate new
products that come out, but to work on other enhancements and improvement to
the MSS archive at the same time, using technology that is available now.
One of the biggest success
stories for the MSS Group over the past few years was the completion of the
first phases of the Storage Manager Project. The initial goal of this project was to reduce the backend
tape traffic internal to the MSS, and at the same time improve user response
time for reading and writing MSS files.
This was addressed by the diskfarm replacement/expansion project.
This project did two things:
á
It allowed us to replaced the aging IBM 3390 disk
farm with a modern Fibre Channel attached RAID system. The new RAID has
much improved data rates and reliability, and has allowed us to get away from
another dependency on MVS.
á
It has allowed us to greatly expand the size of
the diskfarm at a reasonable cost.
We currently have 44 Terabytes of RAID deployed in production, and we
have seen an overall reduction in tape traffic in the neighborhood of 70%. Users have seen a big improvement in
MSS access time.
This project was implemented
with no changes to the MSS user interface. Users continue to use the
msread/mswrite/msrcp commands exactly as they did before, and all binaries with
embedded calls to msread or mswrite continue to work correctly, with no
changes.
For more information on the new
MSS disk cache, along with other recent accomplishments of the MSS Group,
please refer to MSS accomplishments section of the CISL Annual Report for 2005,
which can be found at http://www.cisl.ucar.edu/nar/2005/ci/mss.a.jsp.
Web
Access
Providing a Web presence for
the NCAR Mass Storage Services continues to be a high priority project within
the MSS Group. The explosion of e-commerce enterprises combined with the
ubiquitous WEB browser has created new opportunities for providing Mass Storage
Services data and metadata on the Internet. Many subprojects are required
to build the foundation. These include the deployment of MSS File Services,
additional Web-based tools to access and manage MSS data, and enhanced archive
capacity, performance, reliability, and availability. These are independent
projects that together will provide the tools for an SCD Computing Environment
framework described in other SCD Project Roadmaps.
Web
Data Access
Web accessibility to MSS DATA
requires going through an intermediate layer, in the absence of a global,
shared filesystem front-ending the MSS.
Web servers still need to have their own locally attached disk (which
can be shared with other similar servers). Files stored on local disk can
be directly accessed by Web clients. Whenever a file is called for that
is not on local disk, msread (or msrcp) is used to fetch the file from the MSS
archive. Subsequent accesses do not require an MSS access, until the file ages
off of the local disk. Web servers of this type can be internal SCD
servers, or servers maintained by other divisions. Servers connected to one of
the disk cache pools described above are able to share files with other servers
connected to the same pool, thereby eliminating the need to have multiple
copies of the same file on several different machines. This saves disk
space, in addition to reducing accesses to the MSS archive, and also improves
the response time seen by Web clients.
In addition to making MSS files
directly visible to Web clients via the mechanism described above, it is also
be possible for Web clients to request MSS files via FTP. The MSS Group
now has a production FTP server for this purpose, with support for secure FTP
for access outside the UCAR security perimeter.
Web
Metadata access
The other aspect of an expanded
MSS Web presence centers around the metatdata associated with MSS files.
In the past, access to MSS metadata was limited to a suite of DCS (Distributed
Computing Services) commands which could be used to generate lists of MSS files
by filename or directory name, list and/or modify the metadata associated with
a specific file or set of files, or perform other operations such as removing
files. There were many types of queries, however, which were difficult or
impossible to obtain using the DCS commands. For this reason, the MSS
Group designed and implemented a Web-based MSS data management system that
allows users to obtain summaries of their MSS holdings and associated GAU
(General Accounting Unit) charges by scientist number, project number, or MSS
root directory name. In addition,
users who subscribe to the file listing service can view information on a
file-by-file basis via the same Web site.
The
TeraGrid
NCAR is just now beginning to
enter into the cooperative computing and data-sharing environment known as the
ÒTeraGridÓ. It is expected that
portions of the NCAR MSS will be made accessible to the TeraGrid either as
directly attached storage or via a server with an attached local disk cache
back-ended by the MSS. This
project is just now getting started at NCAR, and many details have yet to be
worked out.
The
NSF 625 Solicitation
The National Science Foundation
has issued a new solicitation known as NSF 05-625, the full title of which is
ÒHigh Performance Computing System Acquisition: Towards a Petascale Computing
Environment for Science and EngineeringÓ.
NCAR has already submitted a proposal for the first portion of this
project (February 2006). Our
proposal included a 12 Petabyte Archival Storage subsystem to be co-located
with the compute engines. If we
were to be awarded this contract, it would have major implications for the MSS
Group. In addition to redirecting
some of our existing resources and personnel, a new FTE would need to be hired
into the MSS Group to work full time on the support and integration of the new
archive.
Even if we arenÕt awarded this
contract, there are plans for three additional increments to the original
solicitation. NCAR will be
responding to these as well. But
whether or not we will be selected for any of these is totally unknown at the
present time.
The
ICESS Procurement
Whereas the TeraGrid
connectivity and the NSF 625 solicitation are relatively uncertain as to how
they will impact the MSS Group, the ICESS procurement (currently in progress)
is definitely going to be completed by late calendar year 2006 or early
2007. We know fairly accurately
how this is going to impact the MSS, since the new machine will be roughly six
times the computing capability of our existing ÒBlueskyÓ machine. This will be discussed further in the section
on MSS Growth Projections (below).
Offsite
Disaster Recovery
For the past few years, the MSS
Group has been looking into ways of saving critical MSS data in a secure,
offsite location so that this data would not be lost forever in the event of a catastrophic
event at the Mesa Lab. Storing
data on sets of tapes that would be moved offsite was ruled out for a number of
reasons. What we needed was a way
to electronically copy data directly to/from a remote site, with no manual
intervention. The problem was that
there was never a suitable site that we could use for this purpose.
There is a new initiative that
is currently underway that has a great deal of promise in this area. It specifies a cooperative agreement
with another site (the San Diego Supercomputing Center) in which data from one
site could be backed up at the other site (up to 300 Terabytes over a 5 year
period). This project is just now
getting under way, but we are hopeful that it will be successful. We are looking at ways of implementing
this system in such a way as to have minimal impact on the MSS budget.
MSS
Growth Management
Managing MSS growth has become a key issue that is being
addressed in a number of ways by the MSS Group. Many factors contribute to MSS growth, but the two biggest
ones are data coming into NCAR from external sources, and increases in the
total computing capacity in the SCD computer room. During periods of relative stability, the rate of growth
stays fairly constant, but big jumps in the growth rate can occur when large
amounts of new external data are brought in, or when the Òflops on the floorÓ
are increased substantially due to new computer acquisitions.
One major change that has
helped to keep MSS growth in check was the introduction of user selectable
alternatives for how their files are stored known as ÒClass of ServiceÓ
(COS). Using COS the user can
select varying levels of reliability, accessibility, and cost (GAU charges) for
each MSS file. The first COS
(introduced in early FY-02) allows users to identify those MSS files where
reliability is not as important as the default in which two tape copies are
made. In this case, the MSS will make only one copy of the MSS file thereby
reducing the storage cost. Another
COS option that was implemented was Òusage=backupÓ, which identifies
short-lived system backup files that are not likely to be read back often, if
at all. This has helped immensely
in segregating backup files away from other ÒnormalÓ MSS files so that they can
be managed much more efficiently. Future COS options will allow the MSS user to
identify inactive MSS files that will not be kept on the fastest possible
access devices and true archive MSS files which may be kept in a cold storage
environment. Usage of the latter
COS's will benefit the user with reduced storage costs as well as the MSS by
offloading the more expensive fast-access devices. This means a cold storage environment could be maintained
outside the SCD computer room reducing the floor space requirement of the
MSS. If a higher level of
reliability is required, the MSS user could elect to have the MSS create more
than the default 2 copies of a file, or have one or more copies of the file
kept offsite for disaster recovery.
Once an MSS file is created, its COS can be changed dynamically by the
user as the userÕs needs change.
Incorporating
New Technology
Capacity and performance
enhancements are an ongoing process, and are critical to the continued success
of the NCAR MSS. We know there is
a direct relationship between the amount of computing capacity serviced by the
MSS, the amount of MSS data transferred, and the MSS data growth. Technology
will continue to provide faster storage devices and denser media (at a lower
and lower cost per Terabyte) that will be incorporated into the MSS. By the
year 2008 we expect a minimum 4X increase in data transfer rate and a minimum
5X increase in media density for tape storage devices. RAID disk subsystems are also expected
to increase in speed and capacity in the coming years. The MSS Group will continue to track
vendor solutions and participate in vendor product beta tests in order to
assess which new products are a good fit for our storage infrastructure.
Migration
to New Media
Once an archive is established,
the issue of keeping the data alive must be addressed or in other words, how
can one be assured that the data can be retrieved? Data Ooze is the process of migrating data from older media to
new media. This process should be transparent to the end user. Many issues must
be considered when estimating the useful lifetime of a particular media. Not
only is media shelf life important, but drive lifetime and software support is
important as well. The primary driving lifetime factor is vendor planned
obsolescence. All lifetime factors must be identified and estimated to compute
the total useful lifetime. Then the amount of time required to ooze the data
off the media must be subtracted to arrive at a usable lifetime figure. A
one-half usable lifetime is the worst case because you spend the first half
writing data to the media and the second half oozing it to a different media.
Some vendors understand this problem but are doing little to solve it. They
talk of the need to reuse media across multiple enhancements of a storage
device family while maintaining backward read capabilities. However, this is a
short-term solution in the sense that as technology advances media will quickly
become obsolete well before its useful lifetime ends. Achieving aggressive
media density increases will require new media formulations that are not
backward compatible. Couple that with a Petabyte-sized archive, and the result
is a data ooze process that can not complete within a given generation of a
storage device. In fact, the data ooze may surpass the entire set of generation
upgrades for a given device, so the end result is the loss of any benefit from
generation upgrades and loss of media reuse. It may be desirable to not start a data ooze on a given technology until
approaching end-of-life for that technology, but you would need to allow enough
time for the ooze to complete before EOL.
The bottom line is data ooze is
a huge problem, which must be addressed and resolved well in advance. Solutions
will be expensive, both in new device and media purchases, and supporting and
managing the ooze. With Petabyte-sized archives, the data ooze process may well
be the most resource consuming process of the entire system, and will become a
perpetual operation.
MSS
Growth Projections and Long Term Planning
Many factors go into the
estimates for how much total net MSS growth we should plan for. Traditionally, we have used the Òrule
of thumbÓ that the total amount of data in the MSS (including second copies of
files) doubles about every two years, taking all factors into account. Data ÒoozeÓ (described above) doesnÕt
increase the total data on the MSS, but burns up new media, so this needs to be
included in future MSS budgets.
Even with the ICESS
procurement, which is capable of increasing the total data flowing into the MSS
from the supercomputers by a factor of three starting in early calendar 2007,
calculations have shown that the doubling in two years rule can still be
used. The bottom line still works
out to be ten Petabytes of total MSS data by the end of FY-09. The following shows a year-by-year
summary of what equipment (tapes, drives, automated cartridge libraries, and
MSS servers and Fibre Channel infrastructure) we are planning to purchase each
fiscal year to accommodate the projected growth:
Fiscal Year 2006
FY-06 is very unusual in the
sense that we anticipate no media purchases for the entire fiscal year. This is made possible by the technology
upgrade to 9940-B tape drives, which write 200 Gigabytes onto the same media
which currently hold only 60 Gigabytes of data in 9940-A format. A data ÒoozeÓ was started in the fall
of 2005 that has been freeing up media (to be rewritten in 200 Gigabyte format)
fast enough to stay ahead of all new data entering the MSS. This will continue until the
January/February 2007 time frame.
At that time, all five Powderhorn silos will be full of 9940-B formatted
tapes, and there will be enough empty tapes to accommodate all new data flowing
into the MSS for another six to eight months (fall of 2007).
Ten additional 9940-B
silo-attached tape drives were already purchased early in FY-06, and we do not
anticipate having to buy any additional tape drives this fiscal year.
The Distributed Computing
Services (DCS) software is undergoing a major rewrite in FY-06. The new version (known as DCS version
4) is currently in beta test, and is scheduled to go production in July
2006. The new version eliminates
the dependency on DCE (Distributed Computing Environment), which has been losing
vendor support. It has numerous
other improvements and enhancements.
Also in FY-06, we plan to
deploy the prototype for the MSS-IV Metadata Server, which will ultimately
replace the current Master File Directory subsystem known as MFDTASK.
Fiscal Year 2007
Things begin to change
dramatically around the middle of FY-07.
The new ICESS machine will be in the process of ramping up to full
production, which will mean that we will start using up existing 9940-B media
at a much higher rate. Our five
existing Powderhorn silos will be fully populated with 9940-B media, with only
enough empty tapes to last us until about the end of FY-07 (Oct 1, 2007). This means that we will need to start
installing additional automated libraries and media in mid FY-07. The current plan is to expand into an
area in the recently vacated second basement (2-B), most likely with the
purchase of a Sun/STK SL8500 10,00 slot tape library, populated with 20 T10000
tape drives along with 1,400 500-Gigabyte cartridges (enough to last roughly 3
months). In addition, we will need
to expand our Fibre Channel data network to accommodate the new drives, and we
will need to install additional Storage Manager servers. Starting in FY-07, we will need to
begin expanding the MSS internal disk cache to keep pace with the larger and
larger volume of incoming data.
This will need to continue in FY-08 and FY-09.
Also, in FY-07 we will complete
the production version of the MSS-IV Metadata Server and decommission MFDTASK.
Fiscal Year 2008
We will continue writing to the
500-Gigabyte cartridges for most of FY-08, and will need to purchase an
additional 4,200 cartridges.
Sun/STK is now saying that the 1 Terabyte tape drives (which use the
same T10000 media as the 500 Gigabyte drives) will be available in early
calendar year 2008. The MSS Group
plans to take advantage of the new drives as soon as it is feasible, to start
doubling the amount of data on the cartridges. If we are able to do this, we could start writing to the
Terabyte drives in the fall of 2008.
We would need to purchase 20 of the new Terabyte drives and install them
in the SL8500. We would not need
to expand the SL8500, since it has enough capacity to carry us all the way
through FY-08 and into FY-09.
Toward the end of FY-08 we will need to purchase 700 T10000 cartridges (a
3 monthÕs supply) to be written on the 1 Terabyte drives. In addition, we are going to need to
start a data ÒoozeÓ from 9940-B media onto the new Terabyte tapes at the same
time. We will need an additional
500 tapes for this purpose, assuming we are able to ooze at a rate of 2
Petabytes/year.
Fiscal Year 2009
At the beginning of FY-09, we
hope to be writing all new data onto 1 Terabyte T10000 cartridges. We will need additional SL8500 library
space about half way through FY-09, so we will need to have money in the budget
for a second library, but not necessarily the maximum size (10,000 slots). If the new data center is ready, we
could install an SL8500 there (along with all necessary Storage Manager servers
and equipment), or we could simply continue to expand in the 2-B area. Either way, the total cost will be
about the same. We will also need
an additional 10-15 tape drives, as more and more read-backs start coming from
the Terabyte tapes.
The total amount of media that
we will need to purchase in FY-09 will be about 2,500 tapes for new data, and
2,000 tapes for the 9940-B data ooze.
Summary
The technology picture becomes
very cloudy when looking 2-3 years out. In this day and age, numerous vendors
promise solutions before they can actually deliver. Even worse, high tech
companies are born and die or are acquired by other companies on a momentÕs
notice. Where does this leave the consumer? Not in an enviable position.
Selecting vendors and products for long-term relationships and expecting a high
level of support, enhancements, and commitment is a gamble at best. We are
positioning Mass Storage Services for quick reactions to changing technology by
judicious selection of Commercial-Off-The-Shelf (COTS) solutions and applying
industry standards whenever possible. However, COTS and standards solutions are
slow to market when youÕre on the
bleeding edge of technology. Therefore, custom software and hardware solutions
are required to integrate the pieces.
The current Mass Storage Services are production centric. We
need to actively work with other peer sites and groups to design and develop
state-of-the-art interfaces and functionality while continuing to maintain the
high-quality production services our users expect.