An intermittent hardware problem on the Y-MP8/864 (shavano) became noticeable to a particular group of users approximately a month ago. It was reported to SCD two weeks ago by a user who thinks the problem may have been first observed as long as six months ago.Last Monday night (26 August) the problem was isolated to hardware. Cray engineers took shavano down Tuesday morning (27 August) at 09:00 to run tests to isolate and repair the problem. The machine was repaired and returned to production on Wednesday, 28 August, at 22:30.
Because of the possibility that this problem affected the integrity of computational results, a number of running jobs were deleted from shavano's Network Queuing System (NQS) queues prior to returning the machine to general production. For a list of deleted jobs, questions, or problems, please contact the SCD Consulting Office (consult1@ucar.edu or 303-497-1278).
Specifically, the problem related to bit 2**51 (which falls within the floating-point exponent) being intermittently picked as a job was swapped in from disk. The problem may have been manifested in bits 2**35, 2**19, and 2**3 (which fall within the floating-point mantissa) as well; however, SCD has only been able to reproduce one occurrence in bit 2**19 with no occurrences in bits 2**35 or 2**3.
Users should examine results produced by shavano jobs over the last few months. The problem, while very intermittent, had the potential to affect any job that was either swapped to disk because of heavy workload on the machine or checkpointed. It is most noticeably manifested in floating-point values whose exponents are incorrect, leading to an error of roughly 10**2, though lower-order bits in the mantissa could also be affected (but more difficult to detect). It could also be observed as random program crashes resulting from program instructions being corrupted in these same bits.