PySpark for "big" atmospheric data analysis
Banihirwe, A., Paul, K., Vento, D. D.. (2018). PySpark for "big" atmospheric data analysis.
Title | PySpark for "big" atmospheric data analysis |
---|---|
Genre | Conference Material |
Author(s) | Anderson Banihirwe, Kevin Paul, Davide Del Vento |
Abstract | Using NCAR's high performance computing systems, scientists perform many kinds of atmospheric data analysis with a variety of tools and workflows. Some of these, such as climate data analysis, are time intensive from both a human and computer point of view. Often these analyses are "embarrassingly parallel," and many traditional approaches are either not parallel or excessively complex for this kind of analysis. Therefore, this research project explores an alternative approach to parallelizing them. We used the PySpark Python interface to Apache Spark, a modern framework aimed at performing fast distributed computing on Big Data. We have been successful installing, configuring, and utilizing PySpark on NCAR's HPC platforms, such as Yellowstone and Cheyenne. For this purpose, we designed and developed a Python package (spark-xarray) to bridge the I/O gap between Spark and scientific data stored in netCDF format. We applied PySpark to several atmospheric data analysis use cases, including bias correction and per-county computation of atmospheric statistics (such as rainfall and temperature). In this presentation, we will show the results of using PySpark with these cases, comparing it to more traditional approaches from both the performance and programming flexibility points of view. We will show comparison of the numerical details, such as timing, scalability, and code examples. |
Publication Title | |
Publication Date | Jan 8, 2018 |
Publisher's Version of Record | |
OpenSky Citable URL | https://n2t.org/ark:/85065/d77m0bjt |
OpenSky Listing | View on OpenSky |
CISL Affiliations | TDD, IOWA, USS, CSG |