| Knowledge Center Contents Previous Next Index |
Understanding Platform LSF Job Exit Information
Contents
- Why did my job exit?
- How LSF translates events into exit codes
- Application and system exit values
- LSF job termination reason logging
- Job termination by LSF exit information
- LSF RMS integration exit values
Why did my job exit?
LSF collects job information and reports the final status of a job. Traditionally jobs finishing normally report a status of 0, which usually means the job has finished normally. Any non-zero status means that the job has exited abnormally.
Most of the time, the abnormal job exit is related either to the job itself or to the system it ran on and not because of an LSF error. This document explains some of the information LSF provides about the abnormal job termination.
How LSF translates events into exit codes
The following table summarizes LSF exit behavior for some common error conditions.
Host failure
If an LSF server host fails, jobs running on that host are lost. No other jobs are affected. Jobs can be submitted so that they are automatically rerun from the beginning or restarted from a checkpoint on another host if they are lost because of a host failure.
If all of the hosts in a cluster go down, all running jobs are lost. When a host comes back up and takes over as master, it reads the lsb.events file to get the state of all batch jobs. Jobs that were running when the systems went down are assumed to have exited, and email is sent to the submitting user. Pending jobs remain in their queues, and are scheduled as hosts become available.
Exited jobs
A job might terminate abnormally for various reasons. Job termination can happen from any state. An abnormally terminated job goes into EXIT state. The situations where a job terminates abnormally include:
- The job is cancelled by its owner or the LSF administrator while pending, or after being dispatched to a host.
- The job is not able to be dispatched before it reaches its termination deadline, and thus is aborted by LSF.
- The job fails to start successfully. For example, the wrong executable is specified by the user when the job is submitted.
The job exits with a non-zero exit status.
You can configure hosts so that LSF detects an abnormally high rate of job exit from a host. See Administering Platform LSF for more information.
Application and system exit values
LSF monitors a job while running and returns the exit code returned from the job itself. LSF collects this exit code via wait3() system call on UNIX platforms. The exit code is a result of the system exit values. Use bhist or bjobs to see the exit code for your job.
Application exit values
The most common cause of abnormal LSF job termination is due to application system exit values. If your application had an explicit exit value less than 128, bjobs and bhist display the actual exit code of the application; for example, Exited with exit code 3. You would have to refer to the application code for the meaning of exit code 3.
It is possible for a job to explicitly exit with an exit code greater than 128, which can be confused with the corresponding UNIX signal. Make sure that applications you write do not use exit codes greater than128.
System signal exit values
When you send a signal that terminates the job, LSF reports either the signal or the signal_value+128. If the return status is greater than 128, and the job was terminated with a signal, then return_status-128=signal. For example, return status 133 means that the job was terminated with signal 5 (SIGTRAP on most systems, 133-128=5). A job with exit status 130 was terminated with signal 2 (SIGINT on most systems, 130-128 = 2).
Some operating systems define exit codes as 0-255. As a result, negative exit values or values > 255 may have a wrap-around effect on that range. The most common example of this is a program that exits -1 will be seen with "exit code 255" in LSF.
How or why the job may have been signaled, or exited with a certain exit code, can be application and/or system specific. The application or system logs might be able to give a better description of the problem.
tip:
Termination signals are operating system dependent, so signal 5 may not be SIGTRAP and 11 may not be SIGSEGV on all UNIX and Linux systems. You need to pay attention to the execution host type in order to correct translate the exit value if the job has been signaled.bhist and bjobs output
In most cases, bjobs and bhist show the application exit value (128 + signal). In some cases, bjobs and bhist show the actual signal value.
If LSF sends catchable signals to the job, it displays the exit value. For example, if you run bkill jobID to kill the job, LSF passes SIGINT, which causes the job to exit with exit code 130 (SIGINT is 2 on most systems, 128+2 = 130).
If LSF sends uncatchable signals to the job, then the entire process group for the job exits with the corresponding signal. For example, if you run bkill -s SEGV jobID to kill the job, bjobs and bhist show
Exited by signal 7Example
The following example shows a job that exited with exit code 139, which means that the job was terminated with signal 11 (SIGSEGV on most systems, 139-128=11). This means that the application had a core dump.
bjobs -l 2012 Job <2012>, User , Project , Status , Queue , Command Fri Dec 27 22:47:28: Submitted from host , CWD <$HOME>; Fri Dec 27 22:47:37: Started on , Execution Home , Execution CWD ; Fri Dec 27 22:48:02: Exited with exit code 139. The CPU time used is 0.2 seconds. SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - cpuspeed bandwidth loadSched - - loadStop - -LSF job termination reason logging
When LSF takes action on a job, it may send multiple signals. In the case of job termination, LSF will send, SIGINT, SIGTERM and SIGKILL in succession until the job has terminated. As a result, the job may exit with any of those corresponding exit values at the system level. Other actions may send "warning" signals to applications (SIGUSR2) etc. For specific signal sequences, refer to the LSF documentation for that feature.
Run bhist to see the actions that LSF takes on a job:
bhist -l 1798 Job <1798>, User <user1>, Command <sleep 10000> Tue Feb 25 16:35:31: Submitted from host <hostA>, to Queue <normal>, CWD <$H OME/lsf_7.0/conf/lsbatch/lsf_7.0/configdir>; Tue Feb 25 16:35:51: Dispatched to <hostA>; Tue Feb 25 16:35:51: Starting (Pid 12955); Tue Feb 25 16:35:53: Running with execution home </home/user1>, Execution CWD < /home/user1/Testing/lsf_7.0/conf/lsbatch/lsf_7.0/configdir>, Execution Pid <12955>; Tue Feb 25 16:38:20: Signal <KILL> requested by user or administrator <user1>; Tue Feb 25 16:38:22: Exited with exit code 130. The CPU time used is 0.1 seconds; Summary of time in seconds spent in various states by Tue Feb 25 16:38:22 PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL 20 0 151 0 0 0 171Here we see that LSF itself sent the signal to terminate the job, and the job exits 130 (130-128 = 2 = SIGINT).
When a job finishes, LSF reports the last job termination action it took against the job and logs it into lsb.acct.
If a running job exits because of node failure, LSF sets the correct exit information in lsb.acct, lsb.events, and the job output file.
View logged job exit information (bacct -l)
- Use bacct -l to view job exit information logged to lsb.acct:
bacct -l 7265 Accounting information about jobs that are: - submitted by all users. - accounted on all projects. - completed normally or exited - executed on all hosts. - submitted to all queues. - accounted on all service classes. ------------------------------------------------------------------------------ Job <7265>, User <lsfadmin>, Project <default>, Status <EXIT>, Queue <normal>, Command <srun sleep 100000> Thu Sep 16 15:22:09: Submitted from host <hostA>, CWD <$HOME>; Thu Sep 16 15:22:20: Dispatched to 4 Hosts/Processors <4*hostA>; Thu Sep 16 15:23:21: Completed <exit>; TERM_RUNLIMIT: job killed after reaching LSF run time limit. Accounting information about this job: Share group charged </lsfadmin> CPU_T WAIT TURNAROUND STATUS HOG_FACTOR MEM SWAP 0.04 11 72 exit 0.0006 0K 0K ------------------------------------------------------------------------------ SUMMARY: ( time unit: second ) Total number of done jobs: 0 Total number of exited jobs: 1 Total CPU time consumed: 0.0 Average CPU time consumed: 0.0 Maximum CPU time of a job: 0.0 Minimum CPU time of a job: 0.0 Total wait time in queues: 11.0 Average wait time in queue: 11.0 Maximum wait time in queue: 11.0 Minimum wait time in queue: 11.0 Average turnaround time: 72 (seconds/job) Maximum turnaround time: 72 Minimum turnaround time: 72 Average hog factor of a job: 0.00 ( cpu time / turnaround time ) Maximum hog factor of a job: 0.00 Minimum hog factor of a job: 0.00Termination reasons displayed by bacct
When LSF detects that a job is terminated, bacct -l displays one of the following termination reasons:
tip:
The integer values logged to the JOB_FINISH event inlsb.acct and termination reason keywords are mapped in lsbatch.h.Restrictions
- If a queue-level JOB_CONTROL is configured, LSF cannot determine the result of the action. The termination reason only reflects what the termination reason could be in LSF.
- LSF cannot be guaranteed to catch any external signals sent directly to the job.
- In MultiCluster, a brequeue request sent from the submission cluster is translated to TERM_OWNER or TERM_ADMIN in the remote execution cluster. The termination reason in the email notification sent from the execution cluster as well as that in the lsb.acct is set to TERM_OWNER or TERM_ADMIN.
Example output of bacct and bhist
Job termination by LSF exit information
LSF also provides additional information in the POST_EXEC of the job. Use this information to detect conditions where LSF has terminated the job and take the appropriate action.
The job exit information in the POST_EXEC is defined in 2 parts:
- LSB_JOBEXIT_STAT-the raw wait3() output (converted using the wait macros /usr/include/sys/wait.h)
- LSB_JOBEXIT_INFO-defined only if the job exit was due to a defined LSF reason.
Queue-level POST_EXEC commands should be written by the cluster administrator to perform whatever task is necessary for specific exit situations.
tip:
System level enforced limits like CPU and Memory (listed above), cannot be shown in the LSB_JOBEXIT_INFO since it is the operating system performing the action and not LSF. Set appropriate parameters in the queue or at job submission to allow LSF to enforce the limits, which makes this information available to LSF.Common LSB_JOBEXIT_STAT and LSB_JOBEXIT_INFO values
The following is a table of common scenarios covered and not covered by the LSB_JOBEXIT_INFO
LSF RMS integration exit values
For the RMS integrations with LSF (HP AlphaServer SC and Linux QsNet), LSF jobs running through RMS will return rms_run() return code as the job exit code. RMS documents certain exit codes and corresponding job exit reasons.
See the rms_run() man page for more information.
Upon successful completion, rms_run() returns the global OR of the exit status values of the processes in the parallel program. If one of the processes is killed, rms_run() returns a status value of 128 plus the signal number. It can also return the following codes:
|
Platform Computing Inc.
www.platform.com |
| Knowledge Center Contents Previous Next Index |