This Month's GOTCHA


Intermittent Hardware Problem Identified on Shavano


An intermittent hardware problem on the Y-MP8/864 (shavano) became noticeable to a particular group of users approximately a month ago. It was reported to SCD two weeks later by a user who thinks the problem may have been first observed as long as six months ago.

On 26 August the problem was isolated to hardware. Cray engineers took shavano down on 27 August to run tests to isolate and repair the problem. The machine was repaired and returned to production on Wednesday, 28 August, at 22:30.

Shavano was again shut down on 3 September, after tests revealed the repair was not entirely effective. After thorough testing, shavano was returned to production on 4 September. The system will also be down for system testing from 08:00 to 09:00 on Monday, 9 September.

Because of the possibility that this problem affected the integrity of computational results, a number of running jobs (from August 27 and September 3) were deleted from shavano's Network Queuing System (NQS) queues prior to returning the machine to general production. For a list of deleted jobs, questions, or problems, please contact the SCD Consulting Office (consult1@ucar.edu or 303-497-1278).

Specifically, the problem related to bit 2**51 (which falls within the floating-point exponent) being intermittently picked as a job was swapped in from disk. The problem may have been manifested in bits 2**35, 2**19, and 2**3 (which fall within the floating-point mantissa) as well; however, SCD has only been able to reproduce one occurrence in bit 2**19 with no occurrences in bits 2**35 or 2**3. Hardware repairs have concentrated on the path between the I/O Subsystem and disk.

Examine Your Results

We recommend that users examine results produced by shavano jobs over the last few months. The problem, while very intermittent, had the potential to affect any job that was either swapped to disk because of heavy workload on the machine or checkpointed. It is most noticeably manifested in floating-point values whose exponents are incorrect, leading to an error of roughly 10**2, though lower-order bits in the mantissa could also be affected (but more difficult to detect). It could also be observed as random program crashes resulting from program instructions being corrupted in these same bits.

Previous GOTCHAs!


Maintained by: consult1@ncar.ucar.edu
Comments & suggestions welcomed.