CUG.log Contents Back issues CUG SGI Contact us

T E C H   R E P O R T S

SGI service report

Vendors and users discuss strategies and field plans for SGI customer support . . .

Tom Boyle
Tom Boyle,
SGI liaison


by Tom Boyle

About 15 representatives from SGI's Customer and Professional Services (CPS) organization attended the CUG meeting in Stuttgart. We came from field offices throughout Europe and the rest of the world, as well as from Mountain View, Eagan, and Chippewa Falls. This was the first CUG for a couple of our executives who were on hand to give presentations on the strategies and field plans for SGI's customer support efforts.

Ken Coleman, SGI senior vice president of CPS, talked in the general session on Thursday about our emphasis on customer satisfaction and the activities we have underway to build a stronger infrastructure of tools and processes for our "world class" support engineers around the world.

Bob Brooks, SGI vice president of Worldwide Customer Support, gave a talk on Wednesday in the Operations Special Interest Group (SIG) session. Bob updated customers on SGI's focus on increased technical training for our support engineers in the field and the importance having a consistent call center and technical support structure in each of our major geographic organizations.

Ken and Bob were pleased to have the opportunity to speak directly with customers and learn more about the interests and support requirements of SGI's supercomputer sites.


Customer feedback

CPS attendees started the week by zeroing in on the Operations SIG open meeting on Monday afternoon. This session always provides us with a great chance to get feedback from customers and then to determine what are the important issues and topics that we should focus on during other sessions later on in the week.

Fran Pellegrino from the Pittsburgh Supercomputing Center presented the results of the hardware reliability survey at this session, from which SGI took all the comments and questions as input to the SGI Q&A panel session held later in the week.

We believe this survey and its comments section give all CUG members a good opportunity to submit feedback that gets directly back to SGI for both discussion at CUG and future consideration, so we'd like to see all member sites participate in the survey.

The "around the room" feature of the Operations SIG open meeting is a highlight of CUG that enables customers to learn about each other's hot issues and, again, to give input directly to SGI for attention and response at the conference. The survey and the around-the-room discussion are the two main events that set the stage for the Q&A panel, which was held on Wednesday.


Issues at the Q&A

Charlie Clark, SGI's Cray hardware product support manager from Chippewa Falls, chaired the Q&A panel, which was made up of SGI representatives from service, hardware and software engineering, and software marketing. Since many CUG members are unable to attend this session to hear SGI's responses to the questions first-hand, perhaps it would be helpful to highlight here a few of the service-related issues that came up.


Spare parts availability

Several sites raised issues with spare parts availability for both the T90 and the T3E. Spare parts have not been close enough to the site when needed, and the result in some cases is extended downtime while waiting for a spare part to arrive. Extended down times were an issue for the T3E in particular, and not strictly related to spare parts availability. SGI's response to this issue involves some general discussion of sparing and maintenance strategies that are common to the T90 and T3E, plus some additional information that is unique to each product.

In general, SGI attempts to keep spare parts for all products available as needed through a combination of on-site spares and in parts depots around the world. There is a level of risk management involved, and we make every effort to keep our spare parts inventory at the right level to provide best response along with lowest possible maintenance prices.

There are times when it becomes difficult to deliver the needed spare in a timely manner, and this is when we expect the serviceability features of our products to come into play to help prevent long down times associated with waiting for a spare part. At the same time, we constantly monitor metrics for on-time parts delivery, and we will make increases in our spare parts inventory when trends indicate frequent parts shortages in the field.

Both the T90 and T3E systems have features that allow the customer to disable failing hardware components, degrade the system hardware resources and continue to operate without the faulty hardware, and then defer repairs to a time that is more convenient or when a spare part can be delivered to the site.

For the T90, SGI has significantly increased our investment in our inventory of spare modules over the past year. This has been closely monitored by our Logistics Group in Chippewa Falls, as well as by individual country service managers who see the need to increase their level of spare modules based on number of customers and product reliability.

With this increase in spare module levels, we have been able to improve the deployment of T90 spares to parts banks and depots worldwide that are closer to customer sites, and we are now in a much improved position to keep each module type in stock in all of our depots. Process changes in Chippewa Falls have also improved the turnaround time of modules coming from the repair process back into Logistics, which in turn directly increases our pool of available spare modules at any given time.


Resiliency features

Also for the T90, several resiliency features have been implemented into UNICOS over the past year. These features, in conjunction with the basic ability of the hardware to run with CPU or memory resources configured down, enable the system to ride through several types of hardware failure conditions without interruption.

CPUs can be removed from or added back into the running system, giving the customer more flexibility to determine when to schedule time to make needed repairs. If a particular spare module type is unavailable, this could be an important factor in choosing to degrade either memory size or the number of CPUs and to keep a level of service available to your users.

While degrading memory size does require restarting UNICOS to reconfigure, the interruption can be kept brief and maintenance still deferred to a later time. The set of resiliency features developed over the past year for the T90 has been released in UNICOS 10.0, and most are now available for C90 and J90 systems as well.

For the T3E, SGI has implemented a unique combination of built-in hardware redundancy along with sparing and maintenance strategies that strive to achieve the following:

  • Minimal downtime associated with hardware failures

  • Maximum hardware resource availability (eliminate the need to run in a degraded hardware resource condition)

  • Minimal level of required spares in order to keep maintenance prices as low as possible

Most customers who experienced long down times have been plagued by a lack of full maturity of some of the system features required to take advantage of the built-in hardware redundancy. SGI's software engineering division in Eagan has worked very hard to resolve the problems in this area of UNICOS/mk.

SGI field service engineers and customers are now able to reboot and reconfigure a T3E system to map out failing processor elements (PEs) and map in redundant PEs. Much of this process is now automated in the boot process. Using this capability, multiple failing PEs can be mapped out of the system over time, while still keeping system hardware resources available to the customer at the full configuration. Repairs can then be scheduled at a time that is convenient to the customer and when required spare parts can be arranged.

Good spare-parts management and effective utilization of serviceability and resiliency features by SGI are critical to our customers' success. We believe that we have taken the necessary steps to address the issues with the T90 and T3E products. Our service planning group is working with the engineering teams throughout SGI to plan for success from the outset in these areas of product and service performance on all of our future products.


Problem notification / status communication

There were a few other service-related issues raised at the Q&A panel session having to do with problem notification mechanisms and problem status communication. These issues are actively being addressed by some of the infrastructure work that Ken Coleman described during his talk, plus we are planning to give these topics increased emphasis at future CUG meetings.


New products and services

As was the case with each of the SGI divisions represented at CUG, CPS representatives brought with them a combination of presentations and information that reflect our collective push toward new and future SGI/Cray products and services:

  • A tutorial on UNICOS to IRIX differences in DMF, given by Paul Ernst

  • A talk on monitoring tools for both UNICOS and IRIX platforms, given by Randy Lambertus

  • Demos of the CRInform and SupportFolio on-line customer support systems, given by Mike Sand

  • A birds-of-a-feather session on SGI's efforts to drive toward common customer support information mechanisms for all Cray/SGI products, led by Dave Walls

  • Customer training strategies to meet the combined needs of customers running both UNICOS- and IRIX-based systems, presented by Bill Mannel


Scheduling conflicts

Unfortunately, due to scheduling difficulties and inadequate "advertising," not all of the above information reached as many customers as we would have liked. The birds of a feather (BOF) session on customer information mechanisms was scheduled opposite the popular and very well attended SV1 BOF. Also, the CRInform/ SupportFolio demos were informal and not all customers were aware they could easily get a demo.

At future CUG meetings, we will try to do a better job of getting these topics more into the mainstream to address the many questions customers have about SGI's plans to bring together all of SGI and Cray support information mechanisms. By the time of the next general CUG meeting in 1999, we expect to have a lot of progress to report in this area.

Thanks to all the customers who helped bring service issues into focus at the Stuttgart CUG. We appreciate the feedback and look forward to seeing everyone at the next CUG.

rule
Contents      Back issues      CUG      SGI      Contact us