SCD ASR header SCD ASR header

Research datasets for modeling climate, weather, and the oceans

The Data Support Section (DSS) maintains a large, organized archive of computer-accessible research data that is made available to scientists around the world. The archive represents an irreplaceable store of observed data and analyses and is used for major national and international atmospheric and oceanic research projects. The DSS group started working in 1965 and has been working on large projects and building the data archives ever since.

There are now about 530 distinct datasets in the archive, ranging in size from less than 1 MB to over 1 TB. The total volume of data in the DSS archive was 2.4 TB in August 1990 and 13.9 TB in August 2000. We have been adding a lot of reanalysis data and other analyses. The change of data storage with time has been as follows:

Data Stored for Data Support, and Total Mass Store

  Data Support Section

  Total NCAR Mass Store

Volume

Date Bit Files Volume   Bit Files Volume DSS/MSS
13 Aug 1990 61,335 2.437 TB   --- 14.430 TB 16.9%

4 Aug 1991

65,518

2.689 TB

 

715,000

19.400 TB

13.9%

3 Aug 1992

80,538

3.085 TB

 

1,060,000

27.270 TB

11.3%

Aug 1993

103,314

4.072 TB

 

1,351,271

36.280 TB

11.2%

15 Sep 1994

119,703

4.751 TB

 

1,849,466

47.423 TB

10.0%

14 Feb 1995

123,877

5.085 TB

 

1,966,990

52.456 TB

9.7%

24 Jan 1996

137,680

5.950 TB

 

2,486,471

67.590 TB

8.8%

28 Aug 1996

143,340

6.770 TB

 

2,888,639

78.964 TB

8.6%

28 Feb 1997

151,509

7.513 TB

 

3,289,224

91.399 TB

8.2%

17 Oct 1997

159,945

8.482 TB

 

4,046,678

110.359 TB

7.69%

2 Sep 1998

167,073

10.032 TB

 

5,038,611

147.439 TB

6.80%

7 Sep 1999

185,608

11.942 TB

 

6,737,448

206.885 TB

5.77%

25 Aug 2000

192,404

13.875 TB

 

8,187,688

267.796 TB

5.18%

Note: In September 1999, the total NCAR MSS archives were increasing at a rate of about 5 TB per month. This is consistent with the 59.5 TB growth (see above) from 147.4 TB to 206.9 TB in the 12.2 months from September 1998 to September 1999. Also, in the 11.5 months from September 1999 to August 2000, the total mass store grew by 61 TB to 267.8 TB.

DSS staff provides assistance and expertise in using the archive and helps researchers locate data appropriate to their needs. Users may obtain copies of data by network access, on various tape media, or they may use data directly from the NCAR MSS to their computer program. DSS staff also assist scientists by providing data access programs (to read and unpack data), other software for data manipulation, and dataset documentation. Later in this report, we present more information about the use of the DSS archives.

Main DSS accomplishments

These are the most important accomplishments completed by DSS in FY2000.
  1. Enhance the observations for reanalysis
    • Three new aircraft datasets were done by February 2000.
    • Prepare more raobs and pibals, several new datasets (70% done).
    • This has been a large project during the year. In 6 more months we may finish Version 3 of all of the reanalysis data.
    • More information about these accomplishments appears below as Highlight: Seven sets of world observations upgraded

  2. Make a big gain in world station location information
    • Version 3 of library was done March 2000.
    • Insert into all the raob and pibal upper air data (90% done).
    • Send to NCEP and ECMWF.
    • 80% done by June 2000; 90% done by September 30, 2000.

  3. Send out a lot of data to many users
    • Work with about 800 main users.
    • Give access to 16.6 TB during CY99.
    • Send 7,502 CD-ROMs during past 3.5 years.

  4. Consulting and data planning
    Many users want some consulting help with the data along with help on which dataset might be most useful for their needs.

  5. Make a big enhancement to DSS web server
    This was mostly done by August 2000.

  6. Add much data to mesoscale model archives
    Keep the GCIP grant activities going. We are the model data center. There are now four or five years and over 500 GB of data.

  7. Get more data products from reanalysis observations
    Did more data planning for this. NCEP calculated monthly raobs. Delivered monthly raobs to University of Washington in May 2000.

  8. Progress in the COADS (surface ocean) dataset
    • Receive and prepare 10 new data sources for 1800s - 1949 reprocessing. Reprocessing is underway as of October 2000.
    • New online information and data order forms, 99 outside requests filled in 2000 (through 11 October).

  9. Start the project to gather and scan documents
    Production started March 2000, and about 3,000 pages were done by September 2000.

  10. Make more progress in developing data theory
    We consider how the aspects of new data additions, user services, data protection, and the ability to have 50-year archives and cost control need to work together for a good archive of scientific data.

Scientists use a lot of data from DSS

This section summarizes how scientists use the data from the DSS-maintained archive. A large amount of data is being used.

Number of requests for data from DSS

People ask DSS to send data on tape (or some by ftp via Internet). This number of requests has gone down some, but the volume of data has gone up (see table below). People want more data, more years of data, and the resolution of some data has increased, so the volume is higher. Our charge per gigabyte of data has gone down a lot, so that people can still afford to obtain larger amounts of data.

DSS is also distributing large numbers of CD-ROMs. From March 1997 to September 2000 (3.5 years), about 7,500 CDs have been sent out. Another large amount of data is used directly from the MSS into users' computer programs being run on the main computers at NCAR. The figure counts the unique users during each year for the MSS data to the online users. A given online user may use the DSS archive many times during the year. We see that the total number of requests each year has increased from about 665 in 1991 to 908 during 2000:

Main data requests from users

Year

Requests:
Data on tape, etc.
Requests for
CD-ROMs
(CDs sent)
Unique
online
users
Total
requests

1990

--

--

256

--

1991

370 (11 mo)

est 10

285

~665

1992

417

est 10

349

~776

1993

435

est 10

394

~839

1994

408

est 10

418

~836

1995

376

? (11)

391

~775

1996

347

? (35)

399

~770

1997

329

146 (1150)

414

889

1998

331

164 (1893)

383

878

1999

308

141 (2401)

429

878

2000

362

124 (2524)

422

908

Note: This table gives the number of data requests for data from DSS. The column for CD-ROMs shows the number of requests and the total number of CD-ROMs sent. One CD-ROM order is often for 10 CDs or more. For data use on NCAR computers, we count the number of unique users each year, not each time that data is used.

Information contacts and consulting

People want to know whether we have certain data, and they may need consulting help with the science aspects of the data or about technical storage, computing, and error recovery issues. We estimate that we have about 4,000 of these contacts per year. A given "event" with multiple contacts is only counted once. In addition to this, there are many more people who obtain information files or small amounts of data from our web interface.

Volume of data being used

The total volume of DSS scientific data that was used has increased from 5.6 TB in 1995 to 14.42 TB in 2000, as shown in the chart below. The figure gives the quantity of data sent by tapes, CD-ROMs, and by direct mass store access at NCAR. The numbers on the bar graphs give the quantities for each data component and the total, in gigabytes.

Total use of research data archives at NCAR, in GB

Research data archive usage

Note: The total volume of data sent by the SCD Data Support Section. Data sent on tape or CD-ROM are probably used two or three times on average, so this quantity is actually a greater part of total data use than is indicated by this chart. The "Total used on NCAR Computers" includes web data.

NCEP/NCAR reanalysis project

About half of the recent data use is output from the NCEP/NCAR 50-year reanalysis project. This was a huge project by NCEP and by the DSS group at NCAR. It was recognized for a special award by the American Meteorological Society in January 2000. The work on this project started February 1991; analysis production started at NCEP in June 1994; 23 years were completed in September 1996; and 40 years in October 1997. All 50 years (1948 - 97) were done in July 1998, and since then the analysis is updated each month to provide a consistent up-to-date analysis to use for seasonal forecasts and for many other purposes.

The reanalysis output (at 6-hour intervals for 53 years) has been very useful for research.

The reanalysis output data have been very popular because it is much better and more extensive than any earlier analyses that were available. It also covers many years (now 53), and the long period is very important for many research tasks. In 2000 it was estimated that about 5,000 research papers have been based on this reanalysis output.

How DSS maintains the data archives

It is necessary to keep updating the data archives and to keep adding a selection of new datasets.

Updating the archives

Many users want datasets of observations and analyses that are up to date. They also want them to extend as far back in time as possible. Many datasets continue in time. We have about 530 datasets (October 2000) and need to keep updating a number of them.

Adding new datasets and helping to create more new datasets

We often participate in projects such as reanalysis, mesoscale GCIP, the US-Russia data exchange, etc. that will help to collect data, organize data and create valuable datasets for research. By helping with large projects such as reanalysis, we create datasets of observations that are much better gathered, checked, and available. But we also help to create the reanalysis output, which is of huge benefit to science.

Also, we try to know enough about scientific activities in the community so that we can pick up valuable datasets when they emerge.

Using new technology

DSS has been monitoring the availability of new technology for many years. The power of computers, and the capacity of hard disks has improved remarkably over time. The improved capability of tapes, CD-ROMs, and networks has improved our ability to distribute data. Perhaps the main improvement is cost. While the capability has gone up, the cost has typically either gone down or stayed the same for these types of hardware.

When we send data to users, we know that they do not want to buy a new tape drive each time they obtain another dataset on tape. We also know that most users cannot afford to spend a lot of money for new tape drives. So we usually offer people a choice of three or four types of tapes to obtain data. We monitor the tape drive market, and we listen to the requests and suggestions from users. Periodically, we offer a new type of tape technology for users and delete an old one.

Using different technologies for different purposes

Sometimes there is a tendency to propose the latest, fastest, most expensive technology for all data tasks. This is a mistake because most users cannot afford the expensive new technology. For example, during 1989 - 96, we helped users in over 60 countries around the world obtain climate model data to do climate assessment studies (how will crops, or rivers, or forests change when a climate changes?) Until about 1994, we had to give the summarized model data to people on floppy disks, because many users did not have CD-ROM readers. By about 1994 - 96, almost everyone in the world had a CD-ROM reader on their PC.

The CD-ROMs have been very useful to deliver data. They are popular in research and education. An environment can be created on a CD-ROM that links the data and extra software in convenient ways.

There are also very good roles for information delivery and data delivery via networks.

Monitor tape storage technology to store large batches of data

Various users save large amounts of data. To be able to share information with them, we monitor the cost of equipment that will hold larger amounts of data, and automatically mount the tapes. These are called tape libraries or data silos. The table below shows the approximate capacity and hardware cost of a mid-sized silo that holds 588 tapes. The table below gives information for a large one (5,000 tapes). As the data capacity of each tape increases, one silo with the same number of tapes can hold more and more data.

The new StorageTek tape library holds 588 cartridges of DLT tapes. The silo and the robot cost about $55,000 total in 1999. The media costs about $70 per tape. The overall hardware, blank tape, and maintenance cost is about $35,000 per year. As the technology changes, this silo will hold the following amounts of data (not compressed):

A lower-cost tape library from StorageTek

Year

Tape drive

One tape

For 588 tapes

1998

DLT 7000

35 GB

22.6 TB

2001

DLT

100 GB

65 TB

2004

[Estimate]

e250 GB

162 TB

The hardware and maintenance cost for a large tape silo (with 5,000 tapes) is about $350,000 per year. The data capacity of one silo was about 0.87 TB in 1986, and it will hold about 1,650 TB in year 2004, a remarkable gain. These amounts of data are for no compression. There is a little gain from compression done by the tape drives. The estimate for 2004 was made in March 2000.

A big tape silo with 5,000 tapes

Year

Each tape

Total silo capacity

1986

175 MB

0.87 TB

1995

~900 MB

5 TB

1999

20 GB

110 TB

2004

e300 GB

e1,650 TB

Highlight: Seven sets of world observations upgraded

World weather history for 53 years
Worldwide weather observations are critical to many types of research in meteorology, oceanography, and related disciplines. These observations were essential to be able to prepare the 52 years of global reanalysis done under the NCEP/NCAR reanalysis project. Now, the new analyses are available each 6 hours for 52 years (1948 - 99), and it is updated each month.

Seven large composite datasets are listed in the table below. Each of these combine data from many sources. Version 3 of all of these observations will be ready by about March 2001. The work on most of the component datasets is almost done. However, it will be longer before some of the merged datasets are fully available. Also, we want to include "model tags" with some of the observations, because they give information about how well the models fit the observations, and they can tell us about certain biases in observing stations.

The table below shows the range of years for each observational dataset, as well as when the main observations were started. For example, the world's first geosynchronous weather satellite was launched by NASA over the Americas in December 1966. These satellites take pictures that permit people to see the drift of clouds. And atmospheric winds can be calculated from the drift of the clouds (these are called "Sat Cloud Winds" in the table). There are only a few winds in the digital record during 1967 - 72. This is a pity, and perhaps more can be calculated from archived photographs.

The table shows when the recent work on each type of data started at NCAR, usually about 1991 when our work for the reanalysis project began. This work has been very intense for a number of years. We were able to gather and process much more than a minimum set of data for reanalysis. NCAR ran many diagnostic checks on the data. These often discovered systemic problems. Then we had to run other computer diagnostics to figure out how to solve the problems.

Our earlier NCAR work on gathering these datasets actually started from 1967 to 1981 for different types of data. Therefore we had many sets of data already in the archives that would provide raw material for the reanalysis work. During 1991 - 2000, we have been able to gather additional observations from other sources and process it. And we did many more data checks to remove more problems from our existing archives.

Credits: Our work to prepare observations for reanalysis has been a huge project for us. We also owe a lot of credit to other organizations, individuals and countries that have helped us to obtain data. We can't list them all here, but the USAF (at Asheville) deserves much credit for key-entering many millions of weather observations for early years such as data for the 1940s to about 1971. Also, the National Climatic Data Center (NCDC) at Asheville helped by sending us some of the early datasets. The raob data gathered by MIT for 1958 - 63 was useful. Data from countries such as Argentina, Brazil, Australia, Russia, England, France, China, and others were helpful. The main partners in the COADS project are two NOAA labs (CDC-Boulder and NCDC). Also Canada, PMEL (Seattle), and the data buoy center have helped with buoy data. Many countries participate in the flow of world observations from commercial shipping.

A major thanks should go to all of the weather observers of the world. Without this work by thousands of individuals and their willingness to share data, we would not have observations for 50 years, and we would not have the important output for research from reanalysis projects.

Seven main sets of world observations

    Data
years

Number
of years

Work
started

Recent
work

Comments

a.

Rawinsondes
- upper air

1946-on

55

1967

1991

Some earlier data

b.

Pibals
- upper air

1942-on

59

1973

1991

Some earlier data

c.

Aircraft

1949-on

52

1973

1992

d.

Sat cloud
winds

1967-on

34

1973

Cover better 1973-on

e.

Satl soundings

1969-on

31

1973

1991

Better 1973-on

f.

Sfc 3-hr
synop

1948-on

53

1976

1992

Density incr 1967-on

g.

COADS ocean
surface

1854-on

145

1981

1988

Some earlier data

Will active data gathering and preparation remain necessary?

We have already discussed the work that we have to do to maintain the data archives, keep them up-to-date, and add new datasets. To prepare datasets in time for the production timelines of reanalysis, we had to put a very high priority on those necessary datasets. A lot of that work is done, but certainly not all. Now we need to give more priority to some different datasets that need attention.

The fundamental needs for a good data center are:

The world's main reanalysis projects

The work on the NCEP/NCAR reanalysis project started in February 1991. NCAR started doing a lot more work on the observations, and NCEP was developing an improved model. The actual production of analyses started in June 1994 (see table below). A friend in DOE called this the "mother of all reanalysis projects." It was ambitious because it covered so many years.

ECMWF produced a reanalysis for 15 years during 1994 - 96 (resolution T106). In June 2000, they started production of a much longer reanalysis (T159). NASA did about 15 years of reanalysis.

In May 1998, NCEP started production of another reanalysis (NCEP-2) for at least 1979 - on using an improved model. We hope that the surface ocean results will be good enough to drive ocean models. The period 1979 through 1999 was done by September 2000. The data have mostly been transferred to NCAR. Production was run by NCEP using a DOE computer (a Cray J90-16). A similar computer had been used at NCEP, and it can deliver about 1 GFLOPS of real power. During summer 2000, a proposal for another long global reanalysis in the USA was prepared.

This table lists the main global reanalysis projects. During 2000, Japan is also making plans to produce a reanalysis.

The world's main reanalysis projects

  Name

Work
started

Production
started

Production
ended

Years
completed

a.

NCEP/NCAR

1991

06/94

07/98

52
(48-99)

b.

ECMWF ERA-15

1993

06/94

09/96

15
(79-93)

c.

NASA

~1993 -- --

~15

d.

NCEP-2

--

05/98

09/00

21
(79-99)

e.

ECMWF ERA-40

~1998

06/00

~2002

44?
(58-01)?

f.

NCEP Long

~2002?

~03-04

?

58?
(48-05)?

Problem: ECMWF got Version 1 of observations from NCAR via NCEP. NCAR is completing Version 3 of the observations. Plans are almost in place to send the latest version to ECMWF.
Note: Chart valid November 2000.

Brief information about several projects

We will give a brief description of several projects. More information can be found in other documents.
  1. Mesoscale model data for North America
    We are the "model data center" for the GCIP North American experiment. There is a grant from NOAA to help do this work. We obtain mostly three-hourly analysis and forecast data from three models (Eta at NCEP, Maps at NOAA FSL, and the GEM model from Canada). There is now about four years of data and over 500 GB.

    NCAR also has earlier mesoscale model data from NWS, dating back to 1971.

  2. Ocean surface data (COADS) and other ocean data
    We have over 70 datasets for ocean research. Also, a number of the atmospheric datasets directly benefit ocean work.

    NCAR has been working on projects to prepare, improve, and update the big COADS ocean surface dataset. The main data are reports from ships and buoys. It is a joint project with other labs. These data form the basis for the world's knowledge of sea surface temperature trends for 150 years. They are also used in the reanalysis projects, and in much other research. There are now over 140 million unique records of data.

  3. Data for the very high atmosphere, 70 - 1000 km
    A project was started at NCAR in 1984 to obtain data from NSF incoherent scatter radars and prepare an archive. Now there are many other types of data, about 57 datasets total. It is called the CEDAR database. This is a joint project with HAO. DSS contributes about 0.85 FTE to this work.

  4. Amount of DSS time spent on three discipline areas
    During 1997 - 2000 (and earlier) we have been spending about 3 FTE of our time on mesoscale, ocean, and CEDAR work. We have a total of 9 science and technical staff members.

  5. Better ways to handle documents
    Many documents are being assembled into similar subject areas, scanned, and put online. The project started in June 1999. Production started March 2000. By November 2000, we had scanned 4,600 pages.

Research and development efforts in DSS

We have described how we keep monitoring the technology that will help us to help users obtain data and reduce costs. We also have to keep inventing methods that will preserve or enhance ease-of-use for users.

DSS staff use their scientific and technical skills to prepare more information about the datasets (metadata), and to invent ways to present it to users. During December 1999 - July 2000, DSS undertook a major project to design and implement improved web pages.

The development of the datasets usually involves a hefty component of R&D work. We have to run appropriate computer diagnostics to discover any systemic errors (data with wrong dates, observations at wrong locations, observations assigned to the wrong atmospheric levels, etc.). Then we have to go through a test-and-discovery process to figure out how to solve the problems.

A vision of how some data services will evolve

A lot of our past success has been that we have gathered a lot of important datasets of observations and analyses. Then we have taken the steps necessary to package these so that other people can use them without having to worry about all of the details that we must address. This means that we make data checks. It means putting correct location and time on each observation so that it can be properly used.

If the data are scientifically useful, the use of it will bubble up if the data are 1) easy to access, and 2) low enough in cost to be affordable.

A lot of the main work of gathering 50 years of observations for reanalysis is done. The coverage of the data should be improved where possible for the 1948 - 55 period. We now have enough of the data that more countries will become willing to contribute still more. There is also a science interest in having more observations for the 1930s and 1940s, especially more surface land pressure data and more of the upper-air data that may still exist. People also need more monthly and daily water data for long periods of time.

Packages of observations for Africa and Latin America

The output from reanalysis projects are being used heavily by the scientific community in the southern hemisphere. However, it still is not easily available to enough scientists. The improvements in low-cost technology will allow us to deliver the data to a lot more people if we do it right. Selections from the seven 50-year datasets of world observations will also help research in these countries. It also helps scientists if they can view satellite picture data along with the other weather observations. The volume of satellite data is often too large to be practical for most users, but we know of ways to retain most of the information and make it practical to use. We could prepare data from polar orbiters for at least 1974 - on (2x/day pictures), and three-hourly geosynchronous from at least July 1983 - on, with considerable amounts of data for the Americas for 1967 - on.

Therefore, we imagine users that have ready access to six-hourly atmospheric analyses, and monthly means from 1948 - on. And they would have the reduced satellite data plus the ability to view three-hourly or daily motion sequences of clouds. And they could also view changes in the annual cycle of clouds over many years. In addition, they could have access to the observations used in reanalysis.

It is important to be able to relate the changes in atmospheric circulation and clouds to the actual amount of precipitation that falls to the ground. Therefore, we imagine that there will be WMO and bilateral projects to obtain daily and monthly surface precipitation available from more sites, and to prepare these datasets for continental regions that are larger than typical individual countries. The precipitation estimates based on satellite views would also be made available. Monthly river discharge should also be available for a selection of rivers in the world. The development of this river dataset is making progress.

A lot of the work has already been done to make this new vision of data access possible and practical. However, much work is still needed to deliver these results.

Some of these new developments in data access could be produced rather soon. Most will take more time. Some of the problems are large. It is usually hard to gather precipitation data from many countries. Often there is very little precipitation data for the most recent 10 or 15 years.

A brief 35-year history of DSS

The work of the Data Support Section started in 1965. NCAR realized that people doing research need some help to obtain data and make it easier to access. A summarized history is available (about 10 pages).

More information about DSS projects and plans

This Annual Scientific Report provides only overview information about many of our projects. Other documents with more information will be on the DSS server or are already there.


Next

SCD ASR - Table of contents

Message from SCD Director Al Kellie

SCD's FY2000 science highlights

SCD: Providing support for large and small scientific research projects, no matter where they are located

SCD: A center for supercomputing resources and technologies

SCD: A center for data resources, data analysis, and emerging technologies

SCD research: Pushing the frontiers in high-performance computing for geosciences

SCD: Providing supercomputing and communications facilities and infrastructure

SCD community service activities

SCD educational activities

SCD publications and papers

SCD staff

SCD visitors and collaborators