Consulting Services Group  
   
The National Center for Atmospheric Research NCAR



 
Consulting Services Projects 2009

CSG Projects

CSG Projects

20th Oct 2009

Project Short Description
33 CSG ran ATP and verified that system is in good health after IBM worked on it.
32 CSG member presented a poster on non-lossy datacompression in AGU fall meeting at San francisco,
63 Mathew Rothstein reported a problem on lightning compiling CAM with pathscale: the compiler crashes with segmentation fault. After some work, I am able to reproduce the problem, when compiling the whole source tree with his (fairly large) configure/makefile. I'm trying to create a small source tree with a small makefile to submit as bug report for the compiler.
71 After the problems encountered with MPIp (see http://mpip.sourceforge.net/ - it requires a compatible GNU binutils installation, or libelf or libdwarf for source lookup and demangling features; we do not have any of those, so the tool is cripple; Jeph installed binutils, but the tool still does not provide the source code demangle) I am investigating Vampir, seen at SC08
80 I went through the job examples of Bluefire, studied related softwares like WRF and netCDF. I also moved the old documentation files into internal folders for security reason.
72 Timing, customizablility and report will be enhancements. After investigating different options (namely the textest python framework, a custom java solution, a custom java solution based on the TestNG framework, and of course the current script-based one), I chose the java-based based on TestNG framework. Input format defined. Output format drafted. Local monitor and remote runner implemented (some improvements needed). Evaluate possible not-java-based alternatives.
44 Support compilers such as PathScale and Intel on Linux platforms
61 As I mentioned at one of our weekly meeting, I encountered large variability of performance in some WRF runs. "Large variability of performance" means the fastest is ~ 4 times faster than the slowest! And this is exactly same code, with the very same input and parameters! And the "performance" means only computational and MPI time, no IO involved. I decided to benchmark a few low-resolution 32-node cases. Identified interesting cases (= written code to find them among the large number of WRF runs I did for the poster and the bluefire WRF benchmarking). Recreated the same environment and jobs submitted, with "-m" option to have the same topology as in the previous runs; the results were surprising: the slowest became the fastest and vice-versa. I tried, then, a different strategy: running the same job more times in the same LSF job (i.e. same nodes and possibly same topology). Results are not clear, but seems related to switch congestion. I wrote a custom code, reproducing only the basic WRF communication pattern and ran it. I also run some tests (WRF and custom) during after the shutdown, when no other jobs were running, and thus the switch congestion was ruled out. I improved the code, making it working on large node count, and I ran a 126- and 128-nodes job during the downtime. I'm writing data mining and visualization (due to the large amount of data collected, the data analysis is going to be tougher than I thought)
35 A toy problem has been created to study the I/O problem.
78 Learn and test ATP benchmark suite; Test WRF parallel benchmark.
77 Create more tests on bluefire and add modules clauses for bluefire.
29 The TIMEGCM is scaling pretty bad just at 64 tasks.
39 Early this year Nancy Collins and Jeff Anderson described their computational work flow under DART. It was clear that without being able to schedule multiple parallel jobs from a single submit they will be severely limited in their computational ability.
41 After the upgrade of bluevista to level 5300-07-01-0748 some users have reported a slowdown to the extent of about 10-12%. CSG member verified and confirmed this claim using FV CAM.
81 I have obtained and activated all necessary NCAR and CISL accounts, and also managed to get access to the super computers. In addition, I read related documentations and went through useful pages.
60 Reviewed the talk, updated a few slides, made a fast dry-run (alone) and taught students.
56 Dedicated consultant. Worked with SSG to test the special queue with 8 dedicated nodes and 4 spare non-dedicated nodes. Worked with Keith Lindsay to understand they requirements. Helped with setting things up.
73 We are participating to the WAG CMS working group bi-weekly meetings. I've written there the quickstart guide, an MPIIO documentation and I drafted the leakparser doc as well. The CISL drupal instance is available for pre-production work and I'm trasferring some content from our local instance to there and helping the sysadmins to fix some problems and identify more details in the use-cases. I partecipate in a WEG-organized Drupal discussion, training and meetings. Ironing out latest details, like access policies.... Working with Jeff Alipit and Alejandro Chaux for the CISL drupal instance
66 NCO 3.9.5 has a code bug and upgrade to 3.9.6 is strongly recommended by its developers.
76
62 TAU is a modular program and performance analysis tool framework. It can be compiled against (and interact with) many related tools/libraries. The most important ones are papi and pdt, for MPI analysis and automatic instrumentation respectively. Both have been compiled and installed in contrib. TAU provides a suite of static and dynamic tools that provide graphical user interaction and interoperation to form an integrated analysis environment for parallel Fortran, C++, C, Java, and Python applications. In particular, it has a performance profiling facility and a trace analysis section. Scalasca is an open-source toolset that can be used to analyze the performance behavior of parallel applications and to identify opportunities for optimization. Compiled the new tau-2.18.2p4 with papi and pdt support and many optional modules. Run the tau test suite (modified it to solve a couple of bugs, it will be sent upstream). Compiled scalasca-1.2. Unfortunately Scalasca visualization tool requires qt4 which is not available on any CISL machine (I installed it on my laptop). Both tools have been test with a real-life use-case (WRF) for profiling. A short documentation for our users has been written. The older installation tau-2.18.1p1 and scalasca 1.1 still in place: they will be removed.
67 After Mike suggestion, libmhd have been tested on CAM. A parser/GUI of the output file has been developed, in order to have an user-friendly tool. I also tested mtrace, the GCC equivalent of libmhd. The fortan leaks that mtrace failed to find were leaks only when compiling with xlf (with no "forced" standard compliance).
68 Helping with data-transfer.
30 Helping CHAP project for user juliam
28 Starting from next month till July twice a day i.e. 9:30 (A/P)M hurricane forcast will run in 8 be nodes.
75 Build proper module files for packages under /contrib and add modules clauses into dotfiles for bluefire.
40 CSG member started participating in Green Density project
69 I wrote an MPI IO tutorial, available here: http://www.cisl.ucar.edu/docs/pdf/MPIIO.pdf It will be available as hypertext (HTML) when the NCAR-wide Drupal CMS will be ready
59 I wrote a feature request for platform, about the "bquery" command (possible extension to bhist), platform ticket # 1-91494252. I also submitted an (completely unrelated!) LSF bug, platform ticket # 1-91359666
79 I learned how to create Cron job and make reservation on Bluefire. I also installed netCDF, WRF and WPS on local machines and Bluefire. In addition, I tried to run some benchmarks.
42 Mentoring a SOARS student in numerically solving linear adevction equation on the surface of a sphere, and possibly solve Shallow Water Equation on the same geometry provided time permits. The strategy is to use Runge-Kutta-Discontinuous-Galerkin approach as done by Nair et. al. in cubed sphere, we will extend it to squared sphere map.
36 CSG members are investigating the unusually long startup time for 0.1° POP, an ASD project from Frank Bryan.
47 Fielding bug reports and usage questions for CISL legacy math libraries
38 We want Platform computing to script for us the integration layer for using LoadLeveller as backend of LSF.
34 Analyzing sysmon data demonstrated that there is leak in one of nrcm job.
37 The asphilli was captured as inefficient user from Tom Engel's monitoring programs.
43 CSG members are working to make FFTPACK friendlier to F90 compilers.
26 CSG member modified the fortran90 course module delivered earlier to SOARS student to give a short tutorial introduction to F90.
55 Per Jim Edwards request, I compiled and installed the latest 4.0.1 (MPI/IO enabled) versions of netCDF4 in /contrib - Unfortunately, 6 of 47 tests failed, thus I contacted their support. They said that HDF5-1.8.3 is required. After several compile iterations (and talking with their support) HDF5-1.8.3 has been successfully compiled. A few tests are failing, but I installed it in /contrib but there are still (a few) tests that are failing. I'm working with HDF user support on that (and they have no clues, so far). In the meanwhile, I re-compiled netCDF-4.0.1 (MPI/IO enabled) and this time only a few tests failed. I installed it in /contrib (replacing the old one), and send the report to their user support - it looks like the failures are not-very-relevant
27 Management wants to be email notified of: 1. Request for nodes, 2. Start of a Hurricane run 2. End of a run 3. forecast published
57 Dedicated consultant for the workshop, worked with SSG to set-up a dedicated shared queue, and instructed them on how to use it. They decided to run CCSM in the regular/premium queue. To my knowledge the workshop went fine: http://websrv.cs.umt.edu/isis/index.php/Summer_Modeling_School
31 CSG member worked with SSG, IBM and Platform to isolate and suggest fixes for scheduler problem after upgrade in firefly
48 Andy Mai reported (TT#48125) that his (tcsh) history file grew until his quota was filled (and then, of course several evil things happened, like all his jobs failed). It looks like tcsh has a bug, which may cause output redirection from screen to (history!) file in case of bad network crash on the client side. I wasn't able to reproduce the issue, but I suggested a workaround which should prevent the same problem from re-happening.
65 Identified medium and large datasets. Solved issues with the large dataset. Launched ~ 500 jobs on bluefire in different conditions (e.g. node count). About ~ 100 of them failed because of wallclock time limit was hit (decided it is not worth resubmitting them, for now). Data analysis, plots, poster, for LCI done (and poster shown :-) Complete benchmark for bluefire. A short documentation about it has been written.
50 See TT#48124
52 See EV 47414
54 Worked with Cindy Bruyere for the 24 hours wallclock time limit. Still working for the preattach settings to improve MPI IO performance.
70 The MPI implementation of the bzip2 compression algorithm has been compiled on lightning, bluevista and bluefire. The same algorithm has a pthread implementation which might benefit gale/gust/brezee users. Working with PAM to have it installed. Unfortunately once in a while it seems to fail on the supers (subsequent calls with exactly the same arguments usually succeeded). I contacted the author and I discussed with him the possible reason. On our system we use a very old version of some libraries, which are needed by mpibzip2, so I recompiled the latest release available.
49 pre-testing and post-testing of the xlc and xlf compilers (see TT 48254 for details). Tested for the whole machine integrity (see emails for details)
53 Trying to install it on bluefire, see http://ctuning.org/
51 See TT#47041
21 Next steps in providing training by SCD
7 Tar on the supers
18 Support system information scripts on all platforms
16 Liaison between CISL and CGD CCSM Software Engineering Group
20 Computer training for SOARS, RESESS, and SIParCS college students
10 Postprocessing tools
11 CISL Resource Accounting update
12 Prepare documentation for users.
15 Assistance with questions sent to Consulting Office
19 Test and support Totalview usage on all supers
2 Porting assistance for bluefire
17 Implement ExtraView trouble ticket system
1 Real time and other special computing projects
14 Facilitate transition to LSF.
22 Creation of new CSG web site and collaboration tool
0 Ranger port of CCSM4
23 Miscellaneous duties
13 Maintenance and documentation of software products
6 Run benchmark tests, assist with local software
3 Software library installation policy
9 Provide user documentation for LSF batch scheduling system.
8 Test benchmark suites for regression and ATPs.
4 CTSS Testing and HelpDesk
5 Professional development
45 Fortran90 Training
25 CSG representative helped CAM-tutorial participants get going by providing one on one support in connecting to bluefire, data analysis machines from their respective laptops of flavor window, Mac, Linux with various kinds of ssh clients. There were 10 nodes reserved for the duration of tutorial hands on session.
24 Working with Bill to get all the syntax and setting, installed uberftp in bluefire, made it part of module configuration and did some prelim testing, everything seems to be working as expected, need few more testing before drafting a user doc.
74 Prepared and implemented the shutdown test for bluefire, and began to learn drupal
46 Frost User Documentation

The Changing Face of Consulting Services

One of the goals of the CISL Consulting Services Group over the past year has been to design and coordinate the implementation of a greatly enhanced new customer support system. The key to making this project succeed is to expand the consulting team to include other groups within CISL who can provide additional expertise.

In today's challenging computational environment, CISL realizes that excellent customer support is essential to making progress in computational science and scientific research. That support begins the moment a scientist decides to use CISL resources, and the collaboration may continue over months or even years until scientific results are published. It is our goal to provide customers with in-depth support using the full capability of the divisional staff.

Users desiring help getting accounts set up and beginning to compute will benefit from assistance from our Enterprise Services Section staff. Our Customer Support Services staff serve as a first point of contact to help users become familiar with the facilities. Our Outreach Group provides detailed, award-winning technical documentation on all aspects of supercomputer and mass storage system usage, while CISL Consulting Services and other CISL programming staff provide assistance with porting, math libraries, parallelization, and debugging of complex atmospheric simulation models. Our Data Support Services staff stand ready to provide assistance with obtaining scientific data, and our Data Analysis Services and Visualization groups provide expertise at high capacity postprocessing.

Telephone customer support is now available around the clock, and consulting support is available by email to consult1@ucar.edu, by walk-in at the Mesa Laboratory room 42, or by appointment. We look forward to working with you to achieve your scientific goals!

Evolution of User Services

CISL's strategic plan for user services is to provide a balanced set of services to enable researchers to easily and effectively utilize community resources.