
On 26 August the problem was isolated to hardware. Cray engineers took shavano down on 27 August to run tests to isolate and repair the problem. The machine was repaired and returned to production on Wednesday, 28 August, at 22:30.
Shavano was again shut down on 3 September, after tests revealed the repair was not entirely effective. After thorough testing, shavano was returned to production on 4 September. The system will also be down for system testing from 08:00 to 09:00 on Monday, 9 September.
Because of the possibility that this problem affected the integrity of computational results, a number of running jobs (from August 27 and September 3) were deleted from shavano's Network Queuing System (NQS) queues prior to returning the machine to general production. For a list of deleted jobs, questions, or problems, please contact the SCD Consulting Office (consult1@ucar.edu or 303-497-1278).
Specifically, the problem related to bit 2**51 (which falls within the floating-point exponent) being intermittently picked as a job was swapped in from disk. The problem may have been manifested in bits 2**35, 2**19, and 2**3 (which fall within the floating-point mantissa) as well; however, SCD has only been able to reproduce one occurrence in bit 2**19 with no occurrences in bits 2**35 or 2**3. Hardware repairs have concentrated on the path between the I/O Subsystem and disk.
Maintained by: consult1@ncar.ucar.edu