Special Interest Group (SIG) report

Operations

Dan Drobnis

by Dan Drobnis

rule

. As sometimes happens at Cray User Group meetings, a theme developed during the interactions among attendees at the Operations Special Interest Group sessions. Discussions with Silicon Graphics/Cray Research representatives at the "Cray service panel" proved fruitful in showing how issues are being resolved.

The theme for this year's Cray service panel turned out to be T3E hardware reliability issues. Some aspects include:

  • Power supplies have gone through several revisions. The most recent revisions, Phase V, are thought to correct all past problems and are being retrofitted to existing sites.

  • EV 56 Alpha processors are experiencing higher-than-expected failure rates. DEC has tightened the testing screens, and new processors are expected to show greater reliability.

  • DMA timeouts. A longer timeout counter appears necessary, which requires a different Gate Array chip that has more counter chips available. Gigaring PIMs are being upgraded as this chip becomes available.

  • FDDI resets are being caused by the DMA timeouts. This appears especially on MPNs with multiple communication and SCSI channels, partly because of a limit of 32 I/O windows open concurrently. More careful balancing of loads on MPNs helps until the DMA timeouts are fixed. It is also important to stay current with new IO node software, which is being released every two weeks.

  • A bad fixed-point multiply instruction has been identified on the Alpha chip. A Software Library change eliminates use of this instruction.

SGI/CrayResearch Customer Service states that stabilization of T3E sites is currently their number-one priority.

Lest anyone believe that only T3E issues exist, T90 2.7 volt power supplies have been failing at a higher-than-expected rate. A particular diode appears to be the culprit, and its vendor has been removed from the approved list. A slightly lower rated diode with better reliability history is being substituted. Current production and replacement supplies will have the more reliable diode; use of the lower rated part may require an extra power supply to assure continued N+1 capability.

The possibility exists of inadvertant damage to the J90 Fortran 90 compiler when installing Library 1.1 using the original install procedure. A new procedure eliminates this possibility.

rule

Contents || CUG home

Comments to: cuglog@cug.org