Knowledge Center         Contents    Previous  Next    Index  
Platform Computing Corp.

Understanding Platform LSF Job Exit Information

Contents

Why did my job exit?

LSF collects job information and reports the final status of a job. Traditionally jobs finishing normally report a status of 0, which usually means the job has finished normally. Any non-zero status means that the job has exited abnormally.

Most of the time, the abnormal job exit is related either to the job itself or to the system it ran on and not because of an LSF error. This document explains some of the information LSF provides about the abnormal job termination.

How LSF translates events into exit codes

The following table summarizes LSF exit behavior for some common error conditions.

Error codition
LSF exit code
Operating system
System exit code equivalent
Meaning
Command not found
127
all
1 or 127
Command shell returns 1 if command not found. If the command cannot be found inside a job script, LSF return exit code 127.
Directory not available for output
0
all
1
LSF sends the output back to user through email if directory not available for output (bsub -o).
LSF internal error
-127, 127
all
N/A
RES returns -127 or 127 for all internal problems.
Out of memory
N/A
all
N/A
Exit code depends on the error handling of the application itself.
LSF job states
0
all
N/A
Exit code 0 is returned for all job states

Host failure

If an LSF server host fails, jobs running on that host are lost. No other jobs are affected. Jobs can be submitted so that they are automatically rerun from the beginning or restarted from a checkpoint on another host if they are lost because of a host failure.

If all of the hosts in a cluster go down, all running jobs are lost. When a host comes back up and takes over as master, it reads the lsb.events file to get the state of all batch jobs. Jobs that were running when the systems went down are assumed to have exited, and email is sent to the submitting user. Pending jobs remain in their queues, and are scheduled as hosts become available.

Exited jobs

A job might terminate abnormally for various reasons. Job termination can happen from any state. An abnormally terminated job goes into EXIT state. The situations where a job terminates abnormally include:

The job exits with a non-zero exit status.

You can configure hosts so that LSF detects an abnormally high rate of job exit from a host. See Administering Platform LSF for more information.

Application and system exit values

LSF monitors a job while running and returns the exit code returned from the job itself. LSF collects this exit code via wait3() system call on UNIX platforms. The exit code is a result of the system exit values. Use bhist or bjobs to see the exit code for your job.

Application exit values

The most common cause of abnormal LSF job termination is due to application system exit values. If your application had an explicit exit value less than 128, bjobs and bhist display the actual exit code of the application; for example, Exited with exit code 3. You would have to refer to the application code for the meaning of exit code 3.

It is possible for a job to explicitly exit with an exit code greater than 128, which can be confused with the corresponding UNIX signal. Make sure that applications you write do not use exit codes greater than128.

System signal exit values

When you send a signal that terminates the job, LSF reports either the signal or the signal_value+128. If the return status is greater than 128, and the job was terminated with a signal, then return_status-128=signal. For example, return status 133 means that the job was terminated with signal 5 (SIGTRAP on most systems, 133-128=5). A job with exit status 130 was terminated with signal 2 (SIGINT on most systems, 130-128 = 2).

Some operating systems define exit codes as 0-255. As a result, negative exit values or values > 255 may have a wrap-around effect on that range. The most common example of this is a program that exits -1 will be seen with "exit code 255" in LSF.

How or why the job may have been signaled, or exited with a certain exit code, can be application and/or system specific. The application or system logs might be able to give a better description of the problem.

tip:  
Termination signals are operating system dependent, so signal 5 may not be SIGTRAP and 11 may not be SIGSEGV on all UNIX and Linux systems. You need to pay attention to the execution host type in order to correct translate the exit value if the job has been signaled.

bhist and bjobs output

In most cases, bjobs and bhist show the application exit value (128 + signal). In some cases, bjobs and bhist show the actual signal value.

If LSF sends catchable signals to the job, it displays the exit value. For example, if you run bkill jobID to kill the job, LSF passes SIGINT, which causes the job to exit with exit code 130 (SIGINT is 2 on most systems, 128+2 = 130).

If LSF sends uncatchable signals to the job, then the entire process group for the job exits with the corresponding signal. For example, if you run bkill -s SEGV jobID to kill the job, bjobs and bhist show

Exited by signal 7 

Example

The following example shows a job that exited with exit code 139, which means that the job was terminated with signal 11 (SIGSEGV on most systems, 139-128=11). This means that the application had a core dump.

bjobs -l 2012

Job <2012>, User , Project , Status , Queue , Command 
Fri Dec 27 22:47:28: Submitted from host , CWD <$HOME>;
Fri Dec 27 22:47:37: Started on , Execution Home , Execution CWD ;
Fri Dec 27 22:48:02: Exited with exit code 139. The CPU time used is 0.2 seconds.

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -
 loadStop    -     -     -     -       -     -    -     -     -      -      - 
 		                cpuspeed    bandwidth 
 loadSched          -            - 
 loadStop           -            - 

LSF job termination reason logging

When LSF takes action on a job, it may send multiple signals. In the case of job termination, LSF will send, SIGINT, SIGTERM and SIGKILL in succession until the job has terminated. As a result, the job may exit with any of those corresponding exit values at the system level. Other actions may send "warning" signals to applications (SIGUSR2) etc. For specific signal sequences, refer to the LSF documentation for that feature.

Run bhist to see the actions that LSF takes on a job:

bhist -l 1798

Job <1798>, User <user1>, Command <sleep 10000>
Tue Feb 25 16:35:31: Submitted from host <hostA>, to Queue <normal>, CWD <$H
                     OME/lsf_7.0/conf/lsbatch/lsf_7.0/configdir>;
Tue Feb 25 16:35:51: Dispatched to <hostA>;
Tue Feb 25 16:35:51: Starting (Pid 12955);
Tue Feb 25 16:35:53: Running with execution home </home/user1>, Execution CWD <
                     /home/user1/Testing/lsf_7.0/conf/lsbatch/lsf_7.0/configdir>,
                     Execution Pid <12955>;
Tue Feb 25 16:38:20: Signal <KILL> requested by user or administrator <user1>;
Tue Feb 25 16:38:22: Exited with exit code 130. The CPU time used is 0.1 seconds;

Summary of time in seconds spent in various states by  Tue Feb 25 16:38:22
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  20       0        151      0        0        0        171 

Here we see that LSF itself sent the signal to terminate the job, and the job exits 130 (130-128 = 2 = SIGINT).

When a job finishes, LSF reports the last job termination action it took against the job and logs it into lsb.acct.

If a running job exits because of node failure, LSF sets the correct exit information in lsb.acct, lsb.events, and the job output file.

View logged job exit information (bacct -l)

  1. Use bacct -l to view job exit information logged to lsb.acct:
  2. bacct -l 7265 
     
    Accounting information about jobs that are:  
      - submitted by all users. 
      - accounted on all projects. 
      - completed normally or exited 
      - executed on all hosts. 
      - submitted to all queues. 
      - accounted on all service classes. 
    ------------------------------------------------------------------------------ 
     
    Job <7265>, User <lsfadmin>, Project <default>, Status <EXIT>, Queue <normal>,  
                         Command <srun sleep 100000> 
    Thu Sep 16 15:22:09: Submitted from host <hostA>, CWD <$HOME>; 
    Thu Sep 16 15:22:20: Dispatched to 4 Hosts/Processors <4*hostA>; 
    Thu Sep 16 15:23:21: Completed <exit>; TERM_RUNLIMIT: job killed after reaching 
                         LSF run time limit. 
     
    Accounting information about this job: 
         Share group charged </lsfadmin> 
         CPU_T     WAIT     TURNAROUND   STATUS     HOG_FACTOR    MEM    SWAP 
          0.04       11             72     exit         0.0006     0K      0K 
    ------------------------------------------------------------------------------ 
     
    SUMMARY:      ( time unit: second )  
     Total number of done jobs:       0      Total number of exited jobs:     1 
     Total CPU time consumed:       0.0      Average CPU time consumed:     0.0 
     Maximum CPU time of a job:     0.0      Minimum CPU time of a job:     0.0 
     Total wait time in queues:    11.0 
     Average wait time in queue:   11.0 
     Maximum wait time in queue:   11.0      Minimum wait time in queue:   11.0 
     Average turnaround time:        72 (seconds/job) 
     Maximum turnaround time:        72      Minimum turnaround time:        72 
     Average hog factor of a job:  0.00 ( cpu time / turnaround time ) 
     Maximum hog factor of a job:  0.00      Minimum hog factor of a job:  0.00 
    

Termination reasons displayed by bacct

When LSF detects that a job is terminated, bacct -l displays one of the following termination reasons:

Keyword displayed by bacct
Termination reason
Integer value logged to JOB_FINISH in lsb.acct
TERM_ADMIN
Job killed by root or LSF administrator
15
TERM_BUCKET_KILL
Job killed with bkill -b
23
TERM_CHKPNT
Job killed after checkpointing
13
TERM_CPULIMIT
Job killed after reaching LSF CPU usage limit
12
TERM_CWD_NOTEXIST
Current working directory is not accessible or does not exist on the execution host
25
TERM_DEADLINE
Job killed after deadline expires
6
TERM_EXTERNAL_SIGNAL
Job killed by a signal external to LSF
17
TERM_FORCE_ADMIN
Job killed by root or LSF administrator without time for cleanup
9
TERM_FORCE_OWNER
Job killed by owner without time for cleanup
8
TERM_LOAD
Job killed after load exceeds threshold
3
TERM_MEMLIMIT
Job killed after reaching LSF memory usage limit
16
TERM_OTHER
Member of a chunk job in WAIT state killed and requeued after being switched to another queue.
4
TERM_OWNER
Job killed by owner
14
TERM_PREEMPT
Job killed after preemption
1
TERM_PROCESSLIMIT
Job killed after reaching LSF process limit
7
TERM_REQUEUE_ADMIN
Job killed and requeued by root or LSF administrator
11
TERM_REQUEUE_OWNER
Job killed and requeued by owner
10
TERM_RMS
Job exited from an RMS system error
18
TERM_RUNLIMIT
Job killed after reaching LSF run time limit
5
TERM_SLURM
Job terminated abnormally in SLURM (node failure)
22
TERM_SWAP
Job killed after reaching LSF swap usage limit
20
TERM_THREADLIMIT
Job killed after reaching LSF thread limit
21
TERM_UNKNOWN
LSF cannot determine a termination reason-0 is logged but TERM_UNKNOWN is not displayed
0
TERM_WINDOW
Job killed after queue run window closed
2
TERM_ZOMBIE
Job exited while LSF is not available
19

tip:  
The integer values logged to the JOB_FINISH event inlsb.acct and termination reason keywords are mapped in lsbatch.h.
Restrictions

Example output of bacct and bhist

Example termination cause
Termination reason in bacct -l
Example bhist output
bkill -s KILL
bkill job_ID
Completed <exit>; TERM_OWNER or TERM_ADMIN
Thu Mar 13 17:32:05: Signal <KILL> requested by user or administrator <user2>;
Thu Mar 13 17:32:06: Exited by signal 2. The CPU time used is 0.1 seconds;
bkill -r
Completed <exit>; TERM_FORCE_ADMIN or TERM_FORCE_OWNER when sbatchd is not reachable.
Otherwise, TERM_USER or
TERM_ADMIN
Thu Mar 13 17:32:05: Signal <KILL> requested by user or administrator <user2>;
Thu Mar 13 17:32:06: Exited by signal 2. The CPU time used is 0.1 seconds;
TERMINATE_WHEN
Completed <exit>; TERM_LOAD/
TERM_WINDOWS/
TERM_PREEMPT
Thu Mar 13 17:33:16: Signal <KILL> requested by user or administrator <user2>;
Thu Mar 13 17:33:18: Exited by signal 2. The CPU time used is 0.1 seconds;
Memory limit reached
Completed <exit>; TERM_MEMLIMIT
Thu Mar 13 19:31:13: Exited by signal 2. The CPU time used is 0.1 seconds;
Run limit reached
Completed <exit>; TERM_RUNLIMIT
Thu Mar 13 20:18:32: Exited by signal 2. The CPU time used is 0.1 seconds.
CPU limit
Completed <exit>; TERM_CPULIMIT
Thu Mar 13 18:47:13: Exited by signal 24. The CPU time used is 62.0 seconds;
Swap limit
Completed <exit>; TERM_SWAPLIMIT
Thu Mar 13 18:47:13: Exited by signal 24. The CPU time used is 62.0 seconds;
Regular job exits when host crashes
Rusage 0,
Completed <exit>;
TERM_ZOMBIE
Thu Jun 12 15:49:02: Unknown; unable to reach the execution host;
Thu Jun 12 16:10:32: Running;
Thu Jun 12 16:10:38: Exited with exit code 143. The CPU time used is 0.0 seconds;
brequeue -r
For each requeue,
Completed <exit>;
TERM_REQUEUE_ADMIN or TERM_REQUEUE_OWNER
Thu Mar 13 17:46:39: Signal <REQUEUE_PEND> requested by user or administrator <user2>;
Thu Mar 13 17:46:56: Exited by signal 2. The CPU time used is 0.1 seconds;
bchkpnt -k
On the first run:
Completed <exit>;
TERM_CHKPNT
Wed Apr 16 16:00:48: Checkpoint succeeded (actpid 931249);
Wed Apr 16 16:01:03: Exited with exit code 137. The CPU time used is 0.0 seconds;
Kill -9 <RES> and job
Completed <exit>; TERM_EXTERNAL_SIGNAL
Thu Mar 13 17:30:43: Exited by signal 15. The CPU time used is 0.1 seconds;
Others
Completed <exit>;
Thu Mar 13 17:30:43: Exited with 3; The CPU time used is 0.1 seconds;

Job termination by LSF exit information

LSF also provides additional information in the POST_EXEC of the job. Use this information to detect conditions where LSF has terminated the job and take the appropriate action.

The job exit information in the POST_EXEC is defined in 2 parts:

Queue-level POST_EXEC commands should be written by the cluster administrator to perform whatever task is necessary for specific exit situations.

tip:  
System level enforced limits like CPU and Memory (listed above), cannot be shown in the LSB_JOBEXIT_INFO since it is the operating system performing the action and not LSF. Set appropriate parameters in the queue or at job submission to allow LSF to enforce the limits, which makes this information available to LSF.

Common LSB_JOBEXIT_STAT and LSB_JOBEXIT_INFO values

The following is a table of common scenarios covered and not covered by the LSB_JOBEXIT_INFO

Example termination cause
LSB_JOBEXIT_STAT
LSB_JOBEXIT_INFO
Example bhist output
Job killed with the SIGINT
bkill -s INT 520
33280
SIGNAL 2 INT
Fri Feb 14 16:48:00: Exited with exit code 130. The CPU time used is 0.2 seconds;
Job killed with SIGTERM
bkill -s TERM 521
36608
SIGNAL 15 TERM
Fri Feb 14 16:49:50: Exited with exit code 143. The CPU time used is 0.2 seconds;
Job killed with SIGKILL
bkill -s KILL 522
33280
SIGNAL -14 SIG_TERM_USER
Fri Feb 14 16:51:03: Exited with exit code 130. The CPU time used is 0.2 seconds;
Automatic migration when MIG is defined at queue level
33280
SIGNAL -1 SIG_CHKPNT
Fri Feb 14 17:32:17: Job has been requeued;
Fri Feb 14 17:32:17: Pending: Migrating job is waiting for rescheduling;
bsub -I "hostname;exit 130"
33280
Undefined
Fri Feb 14 14:41:51: Exited with exit code 130. The CPU time used is 0.2 seconds;
Killing the job with bkill command
bkill 210
33280
SIGNAL -14 SIG_TERM_USER
Fri Feb 14 14:45:51: Exited with exit code 130. The CPU time used is 0.2 seconds;
Job being brequeued.
brequeue -r
Job <211> is being requeued
33280
SIGNAL -23 SIG_KILL_REQUEUE
Fri Feb 14 14:48:15: Signal <REQUEUE_PEND> requested by user or administrator <iayaz>;
Fri Feb 14 14:48:18: Exited with exit code 130. The CPU time used is 0.2 second
Job being migrated
bmig -m togni
Job <213> is being migrated
33280
SIGNAL -1 SIG_CHKPNT
Fri Feb 14 15:04:42: Migration requested by user or administrator <iayaz>; Specified Hosts <togni>;
Fri Feb 14 15:04:44: Job is being requeued;
Fri Feb 14 15:05:01: Job has been requeued;
Fri Feb 14 15:05:01: Pending: Migrating job is waiting for rescheduling;
Job killed due REQUEUE_EXIT_VALUE
bsub "sleep 100;exit 34"
8704
Undefined
Fri Feb 14 15:10:21: Pending: Requeued job is waiting for rescheduling;(exit code 34)>;
Job killed by LSF when CPULIMIT enforced by LSF
158
SIGNAL -24 SIG_TERM_CPULIMIT
Wed Feb 19 14:18:13: Exited by signal 30. The CPU time used is 89.4 seconds.
Job killed because queue level CPULIMIT is reached.
40448
Undefined
Fri Feb 14 15:30:01: Exited with exit code 158. The CPU time used is 61.2 seconds;
Job killed because queue level RUNLIMIT is reached.
37120
Undefined
Fri Feb 14 15:37:44: Exited with exit code 145. The CPU time used is 0.2 seconds;
Job killed due to the check pointing.
bchkpnt -k 838
Job <838> is being checkpointed
9
SIGNAL -1 SIG_CHKPNT
Fri Feb 14 17:59:12: Checkpoint succeeded (actpid 25298);
Fri Feb 14 17:59:12: Exited by signal 9. The CPU time used is 0.1 seconds;
Job killed when reaches the MEMLIMIT
bsub -M 5 "/home/iayaz/script/memwrite -m 10 -r 2"
2
SIGNAL -25 SIG_TERM_MEMLIMIT
Fri Feb 21 10:50:50: Exited by signal 2. The CPU time used is 0.1 seconds;
Job killed when termination time approaches
bsub -t 21:11:10 sleep 500;date
37120
Undefined
Exited with exit code 145. The CPU time used is 0.2 seconds;
Job killed when TERMINATE_WHEN = LOAD
33280
SIGNAL -15 SIG_TERM_LOAD
Exited with exit code 130. The CPU time used is 7.2 seconds.
Job killed when TERMINATE_WHEN = PREEMPT
33280
SIGNAL -16 SIG_TERM_PREEMPT
Exited with exit code 130. The CPU time used is 0.3 seconds;

LSF RMS integration exit values

For the RMS integrations with LSF (HP AlphaServer SC and Linux QsNet), LSF jobs running through RMS will return rms_run() return code as the job exit code. RMS documents certain exit codes and corresponding job exit reasons.

See the rms_run() man page for more information.

Upon successful completion, rms_run() returns the global OR of the exit status values of the processes in the parallel program. If one of the processes is killed, rms_run() returns a status value of 128 plus the signal number. It can also return the following codes:

Return Code
RMS Meaning
0
A process exited with the code 127 (GLOBAL EXIT), which indicates success, causing all of the processes to exit.
123
A process exited with the code 123 (GLOBAL ERROR) causing all of the processes to exit.
124
The node the job executing on has been removed from the system.
125
One or more processes were still running when the exit timeout expired.
126
The resource is inadequate for the request.


Platform Computing Inc.
www.platform.com
Knowledge Center         Contents    Previous  Next    Index