SCDzine
H I N T S

Batch output too large?

What if it won't fit in your home directory? . . .

smiley face

Contents

Search

Article index

Back issues

Subscribe

Contact us

SCD

by Tom Parker


sad face Situation: You interactively submit a Cray job from your Cray home directory, and it generates a large output file (stdout) -- so large, in fact, that it won't fit in your home directory. (Remember that space in your home directory is normally limited to 20 megabytes on ouray and chipeta and only 10 megabytes on paiute.)


Good news, bad news

The good news is that the Network Queuing System (NQS) batch system will detect this condition, save your output so it is not lost, and then send you an e-mail about it.

The bad news is that the e-mail message is gobbledygook and just plain wrong!

The good news is that SCD is also watching out for you, and it will move the large output to even safer places and then send you an intelligible e-mail message.

The bad news is that SCD's actions are based on a "cron" job that only runs once an hour, so that you will have up to an hour from when you get the bad message from NQS before you'll get the good message from SCD.

The good news is that SCD has filed a problem report to see if Cray can improve their message, or even better, let SCD suppress it.

The bad news is that it might take a while before this improvement is made.

So, in the meantime, in situations where NQS batch job output is too large to be returned to your home directory, you will get two e-mails -- one bad and one good. The bad message should be ignored, and you will eventually (up to an hour later) get the good message.

Below are examples of both the bad and good messages that user tparker (the author) received recently:


Bad message!

From chipeta@ncar.UCAR.EDU  Wed Apr  8 11:06:24 1998
To: tparker@ncar.UCAR.EDU
Subject: NQS request:  2542.chipeta ended.

Message concerning NQS request:  2542.chipeta ended.
Request name:   myjob
Request owner:  tparker
Mail sent at:   11:06:24 MDT
Request exited normally.

_Exit() value was: 0.

 Stdout file staging event status:
 Destination: -o chipeta:/home/chipeta0/tparker/myjob.o2542
   Output file could not be returned to primary or backup
.
   destination.

   Transaction failure reason at primary destination:
   User file/inode quota limit exceeded at local host.
   Output saved in private/root/failed directory within 
   NQS spool.


Comments

The bad message above is bad because:

  • The full pathname isn't given (just private/root/failed).
  • The location of the NQS spool (/usr/spool/nqe/spool) isn't given.
  • It doesn't give the name of the file in the /usr/spool/nqe/spool/private/root/failed directory.
  • The directory has permissions set so that you still can't get the file, even with the proper path.
  • The line consisting of just "." isn't too useful.


Good message!


From chipeta@ncar.UCAR.EDU  Wed Apr  8 11:11:09 1998
To: andersnb@ncar.UCAR.EDU, barbb@ncar.UCAR.EDU,
consult1@ncar.UCAR.EDU, engel@ncar.UCAR.EDU,
fuentes@ncar.UCAR.EDU, mac@ncar.UCAR.EDU,
tparker@ncar.UCAR.EDU
Subject: chipeta output tparker

Output from your chipeta job which failed to return has been 
placed in /tmp/tparker38537/2542.chipeta.o  (19421027 bytes)
This commonly results from the job producing more output than 
can be stored in your home directory.  Check your job for the 
presence of I/O statements which may have been inserted for 
debugging purposes.In the event that the scrubber removes 
your file from /tmp before you are able to look at it, you can 
retreive it from Mass Store file:
/NTWK/FAILED/TPARKER/2542.chipeta.o
The first line of text in file 2542.chipeta.o is:
  98.04.08 11:05:42 IDENTIFIER 2542.chipeta NAME a JOBID 515

Comments

The good message is good because:

  • It tells you where the file is on disk (/tmp/tparker38537/2542.chipeta.o)
  • It also tells you where SCD made a copy of the file on the Mass Storage System (/NTWK/FAILED/TPARKER/2542.chipeta.o).


Caveat

This entire situation can only occur with interactive submissions from your home directory (and perhaps with FTP NQS jobs). It would not normally occur with submissions via Internet Remote Job Entry (IRJE), MASnet/Internet Gateway Server (MIGS), or xxjob.


See also . . .

For a variety of information to help you run your jobs better, see SCD ConsultWeb, a website maintained by the SCD consultants.

rule
Contents || Search || Article index || Back issues || Subscribe || Contact us || SCD