[Previous] [Table of contents] [Next]

Research data

Data Support activities and plans, FY1999-2001

The Data Support Section (DSS) maintains a large, organized archive of computer-accessible research data that is made available to scientists around the world. The archive represents an irreplaceable store of observed data and analyses and is used for major national and international atmospheric and oceanic research projects. The DSS group started working in 1965 and has been working on large projects and building the data archives ever since.

There are now about 500 distinct datasets in the archive, ranging in size from less than 1 MB to over 1 TB. The total volume of data in the DSS archive was 2.4 TB in August 1990 and almost 12 TB in September 1999. We have been adding a lot of reanalysis data and other analyses. The change of data storage with time has been as follows:

Data stored for Data Support, and total Mass Store

  Data Support Section

  Total NCAR Mass Store

Volume

Date Bit Files Volume   Bit Files Volume DSS/MSS
13 Aug 1990   61,335 2.437 TB   --- 14.430 TB 16.9%
4 Aug 1991 65,518 2.689 TB   715,000 19.400 TB 13.9%
3 Aug 1992 80,538 3.085 TB   1,060,000 27.270 TB 11.3%
Aug 1993 103,314 4.072 TB   1,351,271 36.280 TB 11.2%
15 Sep 1994 119,703 4.751 TB   1,849,466 47.423 TB 10.0%
14 Feb 1995 123,877 5.085 TB   1,966,990 52.456 TB 9.7%
24 Jan 1996 137,680 5.950 TB   2,486,471 67.590 TB 8.8%
28 Aug 1996 143,340 6.770 TB   2,888,639 78.964 TB 8.6%
28 Feb 1997 151,509 7.513 TB   3,289,224 91.399 TB 8.2%
17 Oct 1997 159,945 8.482 TB   4,046,678 110.359 TB 7.69%
2 Sep 1998 167,073 10.032 TB   5,038,611 147.439 TB 6.80%
7 Sep 1999 185,608 11.942 TB   6,737,448 206.885 TB 5.77%

Note: In September 1999, the total NCAR mass store archives are increasing at a rate of about 5 TB per month. This is consistent with the growth (see above) from 147.4 TB to 206.9 TB (or 59.5 TB) in the 12.2 months from September 1998 to September 1999.

The DSS staff provides assistance and expertise in using the archive and helps researchers locate data appropriate to their needs. Users may obtain copies of data by network access, on various tape media, or they may use data directly from the NCAR MSS. DSS staff also assist scientists by providing data access programs (to read and unpack data), other software for data manipulation, and dataset documentation. At a later point we will present more information about the use of the DSS archives.

Overview of Data Support efforts - Reanalysis

Our Data Support group has been very busy during 1991-99 helping several large reanalysis projects. This has pushed us to the limit, but it has been very helpful for world research. We will still be very busy during about September 1999 to January 2000 to get the last of the sets of observations to ECMWF that they must have for their new ERA-40 reanalysis (for 1957 - 2001). Then we will still have more observations to prepare and problems to fix, but the pace can be reduced.

NCAR has all the output from the NCEP/NCAR 50-year project (1948 - 98). We will keep doing a lot of user services. We must get the data from the NCEP-2 project (1979 - 98), and then we must get ready for the large amounts of data from the European ERA-40 project.

A lot of work remains to prepare documents for all of the observations used in reanalysis. This has to get a higher priority. And we also have to pull together a lot of other documents so that they are more easily accessible.

The volumes of data that people want to use are increasing. We present information about technology changes. We will plan to use new technology to help deliver increasing amounts of data to users. Some universities may want to keep a considerable amount of data in a low-cost local mass store. We will develop bulk data delivery methods, to help give them access to large datasets and to help put data into the local mass stores.

Our archives of mesoscale model data have been progressing well. They need somewhat more attention to include related products (monthly means and verification grids). Because of the reanalysis pressure, we have had time to do only about 40% of what people need in order to have better access to the recent runs from climate models. As reanalysis starts taking less time, we hope to get back to this project. Also, we have had to put off some work needed on the server (for Internet, etc.) This also is in the future plans.

The core of our data services depends on doing a lot of updates to many datasets, and on bringing in new datasets that people need.

To help guide future data planning, we have included a section about the considerations that we use to plan for a long-term archive that will last for 50 years and give good service for users. The agencies are having a fair amount of trouble in achieving good data services and in managing costs. We have been involved in the NASA planning process.

We have been very busy on the huge reanalysis projects as shown in Figure 1. This effort will continue but at a somewhat lower level. We hope that this will give us some time to get back to other projects. However, we lost one staff member in February 1999.

Reanalysis -- A crunch on our time in the 1990s

Reanalysis has been a very large project for our Data Support Section in the 1990s. We have prepared many datasets and removed a lot of old bugs. The world's observations are now in much better shape because of the project. We also brought in many new datasets that we did not have before. It has been a very intense project. The analysis outputs are very helpful to world research, but they also take time.

How did we cope with this new amount of work? We had 7 technical staff in 1990. From September 1991 to February 1999 we had 10 technical staff, and 0.85 FTE works on the high atmosphere data (70 - 1000 km), so that left 9.1 tech staff for everything else. Since February 1999, we now have 9 technical staff members. Here's how we coped:

Figure 2 shows the significance of the reanalysis data relative to all DSS data used on NCAR computers. For 1996 - 1998, nearly half of all data (in Gbytes) from the DSS archives used at NCAR was from the NCEP/NCAR Reanalysis.

In spite of this, our main archive updates and services have held up rather well. But we have catch-up work to do.

Additional major accomplishments during FY1999

We delivered observations to ECMWF for reanalysis of years 1957 - 98. We delivered 95% of the necessary data. We are also doing checks on new data (for old years) and updates not available before. Several of the new sets have also been sent.

We completed a big update of COADS (world surface marine data) for 1980 - 97. This included more data for 1980 - 95 and added years 1996 - 97.

The mesoscale model data for GCIP was almost fully populated during FY1999. Several older years of data arrived at NCAR, and the new data are almost up to date.

User interaction: We delivered 2,540 GB on tapes, 2,300 GB on CD-ROMs, and users accessed about 8,000 GB on computers at NCAR. We answered many user questions by Internet and phone.

Update the datasets: We got most of the necessary updates done and added about 20 new datasets.

What types of data are in DSS archives?

There are now over 500 distinct datasets in the archive, ranging in size from less than 1 MB to over 1 TB. The total volume of data in the DSS archive was 2.4 TB in August 1990 and 10 TB in October 1998. We have been adding a lot of reanalysis data and other analyses.

A broad summary of our data holdings (valid October 1998) Count of main datasets
Output data from NCEP/NCAR reanalysis, 50 years (4x/day) (Volume about 3.5 TB for NCEP plus ECMWF) 21
Separate CD-ROMs from reanalysis (Volume on 27 unique CDs is 18 GB) 27
Observations for reanalysis (in 1997, surface and upper air data) (About 180 GB, not level 1 satellite data) 38
Mesoscale model data (North America) 16
Datasets of surface observations and related data ~60
Datasets about the earth's surface ~20
COADS world ship and buoy data 6
Other datasets (not COADS) for ocean work ~70
Various analysis grids ~20
Main operational analyses from NCEP (was NMC) ~15
Main operational analyses from ECMWF ~7
Climate model data for assessment studies 25
Climate trends datasets ~7
Climatology and circulation statistics ~17
Cloud data ~15
Stratospheric datasets, mostly gridded ~10
Datasets for the very high atmosphere (70-1000 km) 57
Main radiance data from satellites ~25
Data for the FGGE year (1979) 8 datasets

Comment: The above list has about 464 datasets. Our actual list of datasets has over 500 items, but sometimes several logically different files of data are held within one dataset folder.

Data services from DSS

The Data Support (DSS) archives are used heavily. During 1997 and 1998, there was a total of about 12 TB of data used each year from our archives. We show figures that summarize data use during the past several years.

In September 1999, our total DSS archive was 11.9 TB. Some popular datasets, such as reanalysis, are read from the archives multiple times during a given year.

The DSS archives are used heavily by many people. Data are often most heavily used during the first 3 to 5 years of their life. This means that there needs to be new projects to add new data, while still saving most of the old data. The total rate of data delivered was about 12 TB per year during 1997 and 1998.

  1. Online users of DSS data at NCAR
  2. Many users read DSS data into their programs that are run on NCAR computers. We see (Figure 1) that this use has increased from 1,100 GB of DSS data read in 1990 to 12,505 GB in 1999. Figure 2 shows that 47% to 49% of the data used online during 1996 - 98 were data from the big NCEP/NCAR reanalysis project. In 1999, data from this project accounted for 61% of the data read.

    Table 1 gives additional statistics about the online use of DSS data on NCAR computers. This does not include the use by our staff in the DSS section. We are pleased by the amount of data being used.

    The use of DSS data on NCAR computers.

      Unique users Number of reads GB Usage by universities
    1995 391 81,476 4,666 78%
    1996 399 88,994 5,621 74%
    1997 414 95,044 8,419 69%
    1998 383 83,172 7,872 60%
    1999 429 101,216 12,505 49%

  3. Summary of volume of DSS data used

    Data Support (DSS) sends data on tapes (several types of tapes) and CD-ROMs. By September 1999, we sent 4,883 CD-ROMs based on the NCEP/NCAR project. These held about 660 MB each (total of 3,223 GB). So far, these have gone to people in a number of countries. The total use of data from DSS has grown from 5,600 GB in 1995 to 11,800 GB in 1998 (11.8 TB!) See the table and figure below.

    Total use of data from DSS (GB)

      Send tapes Send CD-ROMs Online Web Send for projects Total (TB)
    1995 385 23 4,666 30 500 5.60
    1996 1,471 7 5,621 30 100 7.29
    1997 2,069 781 8,419 30 500 11.80
    1998 2,121 1,280 7,872 30 400 11.70
    1999 2,418 1,607 12,505 30 80 16.64
    Note: Portions of the table above have been revised from what was published in previous Annual Scientific Reports. This table has been corrected to report data only from full calendar years.

A quick view of main projects for FY2000-2001

We will outline the projects that we will work on during the next two years. In later sections of the report, we will discuss most of these items in more detail.

  1. Update the archives and obtain new datasets

    It is necessary to have updated datasets and a selection of new datasets in order to provide good data services. This requires that we have information about what data people need and that we know about new data that becomes available.

  2. Keep the main data flows coming from NCEP

    Keep the main data flows from NCEP coming in to NCAR (both observations and analyses). These data are key to giving many data services for the whole community. During about 1997 - 99 NCEP has been going through many changes in their data-handling software and hardware. In 1999 they lost their person who has been helping send the "Advanced analysis archives" to NCAR. In October 1999, these advanced data has not been updated since March 1999. Also, the NCEP budgets have not been very good. All of this has made extra work for both NCEP and NCAR. Fortunately, there is a good spirit to solve the problems.
    Note: On Sep 27, 1999, NCEP had a fire in their computer room that destroyed a motor-generator necessary to operate their Cray C-90 (still the production computer). This won't help!

  3. Keep providing data services

    Several staff are always involved in the tasks of interacting with users, helping people obtain data access programs, and sending data. These tasks are closely coupled with the tasks of updating datasets and obtaining other new data that users need.

  4. Obtain the output from the NCEP-2 Reanalysis

    This reanalysis of 1979 - 98 (20 years) started production in May 1998, and it is about 80% done in September 1999. We have obtained a little of the data. Significant amounts will start coming about January 2000. Assuming that the volume per year will be about 70% of the data from the 50-year reanalysis, the total volume is about 1,116 GB, and NCAR would obtain 780 GB.

    Arrival date Total volume For NCAR
    Jan - Dec 2000 756 GB 530 GB
    Jan - Jun 2001 360 GB 250 GB

  5. Obtain output from the ECMWF ERA-40 Reanalysis

    This reanalysis will be for 1959 - 2001, about 45 years. The total output volume will be about 25 TB. Assuming that NCAR obtains about 60% of this or 15 TB. It may arrive as follows:

    Arrival date Volume
    Sep - Dec 2000 500 GB
    Jan - Dec 2001 5 TB
    Jan - Dec 2002 5 TB
    Jan - Dec 2003 4.5 TB

    6. Maintain and expand the mesoscale model archive

    About 400 GB of data will arrive per year.

    7. Obtain data from the 20-year NCEP Mesoscale Reanalysis

    This will be an analysis of North America at a resolution of about 30 km, with data each 3hours. There will be both analyses and forecasts. We do not have volume estimates. We guess that the volume will be about 4 TB.

    Arrival date Volume
    Aug - Dec 2000 0.5 TB
    Jan - Dec 2001 2.5 TB
    Jan - Jun 2003 1.0 TB

    8. Make a CD-ROM with data from 4 or 5 climate models

    We already have archives of year-month data and selected daily data from a few main climate model runs. Some of the data products (e.g., decadal means) that have heavy use should be on a CD-ROM and online.

    9. Do data services for CEDAR (data from 70 - 1000 km)

    About 0.85 FTE in DSS is needed for these data services. Also, the project has other staff in HAO.

    10. Provide more observations for reanalysis

    By looking at data counts during 1948 - 98, and by looking at forecast scores during the 50-year reanalysis, we have identified some periods (and regions) where we should provide more observations if we can obtain them. Keep working on doing the list of over 20 data tasks to improve the observations for reanalysis, and take advantage of new data opportunities that arise. But we can't spend as much time on this as we did during 1995 - 98.

    11. Status of projects to prepare and handle reanalysis data

    A considerable amount of time is still needed to prepare more datasets for reanalysis and to prepare the datasets for a proper archival future along with documents. This would take more than six months work even without other pressures. We need to complete documents for both the input and output datasets of observations and take steps to be sure that both data and information will be secure against loss. The present state of this work is about as follows:

    Task Percent done
    in July 1999
    Prepare reanalysis datasets 85%
    Protect data inputs and outputs against possible loss 40%
    Prepare documents about reanalysis data 65%
    Preserve the documents against possible loss 20%

    Also, we keep adding some data tasks as new opportunities arise. I am trying to figure out some quicker things to do to preserve most of what has already been accomplished. An update: In October 1999, we have sent ECMWF about 95% of the observations that are both desirable and possible to include in the ERA-40 project (analyze 1957 - 2001).

    12. Data Rescue

    We plan to keep doing our share of the work necessary to save old, valuable data before they are lost. Now we are working to copy a lot of old satellite data tapes (7-track tapes written about 1970).

    13. Digitize more of the world's old data

    People are interested in doing century-scale research on climate, climate trends, and regional interactions. To do this, it would certainly help to have more of the world's old weather observations put into digital form for use on computers. Most of this work should be done by other bigger data centers. But we can encourage the programs. We are trying to establish a program to digitize some old ship log data (research ships) in Russia.

    14. Many datasets are growing in size; Cope with these problems

    Many datasets will grow in volume. Users will want more volume. They will not have more money to spend. We will develop "bulk data delivery" and other ways to help solve the problems of easy access to growing amounts of data.

    15. How would we send one TB of data to a user?

    One TB of data is a lot of data. It used to be very difficult and expensive to send this much data to another place. To send one TB to a user in 1972, we would need to ship 25,000 full tapes. Newer technology is now making this task easier. In 1999, it only requires 17 to 30 tapes to hold one TB, and this number of tapes will decrease. But it still takes a lot of time to copy one TB even at today's higher data rates. To decrease the cost of sending one TB, we have to control storage costs and minimize the people and hardware costs to make the copy.

    The tape options listed for the period 1960 - 1980 were for standard 10.5-inch-diameter reels of half-inch-wide tape, 2400 feet long. The read speed in early years varied with the type of drive. We give the speed for good-quality drives. The time to copy a TB is given for a process that is 80% efficient (perhaps this is too optimistic).

    Computer data tapes during 1960 - 2003

    Date Technology Data/tape Tapes/TB Burst copy
    speed
    Copy time for
    one TB (hours)
    1960 7-track, 1/2 in. tape 12 MB 83,300 70 KB/s 4960
    1972 9-track, 1600 BPI, 150 IPS 40 MB 25,000 240 KB/s 1447
    1980 9-track, 6250, 0.5", 200 IPS 125 MB 8,000 1.25 MB/s 278
    1986 IBM 3480 cartr., 0.5" 200 MB 5,000 3.0 MB/s 116
    4/1995 DLT 4000 20 GB 50 1.5 MB/s 232
    1/1997 DLT 7000 35 GB 29 5.0 MB/s 70
    12/1999 Exb, Mammoth-2 60 GB 16.7 12.0 MB/s 29
    2000 Three types 100 GB 10.0 10.0 MB/s 35
    2003 Estimate e250 GB 4.0 10.0 MB/s 35

Prepare to handle the outputs from reanalysis

The output data from long global reanalysis projects has been very useful for science. But this is a lot of data. There was about 3.2 TB from the NCEP/NCAR project, and NCAR may obtain about 20 TB from the ECMWF ERA-40 project.

1) Summary: DSS plans for reanalysis output

Reanalysis Years Span Completed Remarks
NCEP/NCAR 51 1948 - 98 07/1998 We have output
NCEP 2 20 1979 - 98 Not yet Start getting output
ERA-15 15 1979 - 93 09/1996 We have output
ERA-40 45 1957 - 01 Not started Get ready
NCEP 3 55 1948 - 03 Not started

NCAR received about 54 GB per year from the long NCEP/NCAR reanalysis. This gives about 2.75 TB for 51 years, and a total of 3.17 TB counting older versions of reruns.

We will probably receive about 35 GB per year from NCEP 2, for a total of 700 GB.

2) Activities for user services on NCEP/NCAR 51 years and ERA-15 years of data.

These services for users are working well. We do not plan changes, but the user contacts and sending data does take a fair amount of time. Probably we need a better guide to describe the model characteristics.

During 1997 - 99, about half of the data that users access from us is reanalysis data.

3) CD-ROMs from NCEP/NCAR Reanalysis

Sales of Reanalysis annual CD-ROMs from NCAR (Cumulative)

  Unique CD Orders CD-ROMs sold
Apr 21, 1997 8 14 81
Nov 24, 1997 15 136 1,041
Feb 28, 1998 18 (1979-96) 185 1,563
Jan 4, 1999 28 (1970-97) 310 3,043
May 10, 1999 31 (1967-97) 352 3,784
Aug 3, 1999 38 (1961-98) 401 4,578
Oct 1, 1999 41 (1958-98) 424 4,978
Dec 31, 1999 41 (1958-98) 451 5,444

4) Data from the NCEP 2 Reanalysis (1979 - 98)

NCEP 2 has completed 1979-92. It is being run by Kanamitsu at NCEP on a DOE computer, a Cray J90 (1 Gflops) at Livermore. At the reanalysis conference (August 1999), there were several papers that made comparisons of the output.

5) Prepare to handle output from the ECMWF ERA-40 reanalyses.

ERA-40 (for 1957 - 2001) will start production about February to April 2000. They may start sending output data within 6 to 12 months from the start. Production will last for about 2 to 2.5 years.

6) Summary of the plans

The reanalysis output has a huge variety of data types. It has analyses of temperature, wind, and pressure. It has radiation, precipitation, snow amount, and soil temperature. We have to be able to help people with questions about the data.

Taking care of data and documents for reanalysis

We officially started preparing datasets for reanalysis in early 1991, based on a lot of other archive development work going on since 1965. It has been an intense project; for about 3 years we had to use 6 to 6.5 FTE in our group to stay ahead of NCEP needs. We also kept trying to get more observations than we promised in order to achieve still better analyses. It has worked out well, but there is a lot of clean-up work to do, and still more data to prepare.

Plan:


[Previous] [Table of contents] [Next]