|Knowledge Center Contents Previous Next Index|
Configuring Job Controls
After a job is started, it can be killed, suspended, or resumed by the system, an LSF user, or LSF administrator. LSF job control actions cause the status of a job to change. This chapter describes how to configure job control actions to override or augment the default job control actions.
- Default Job Control Actions
- Configuring Job Control Actions
- Customizing Cross-Platform Signal Conversion
Default Job Control Actions
After a job is started, it can be killed, suspended, or resumed by the system, an LSF user, or LSF administrator. LSF job control actions cause the status of a job to change. LSF supports the following default actions for job controls:
On successful completion of the job control action, the LSF job control commands cause the status of a job to change.
The environment variable LS_EXEC_T is set to the value JOB_CONTROLS for a job when a job control action is initiated.
See Killing Jobs for more information about job controls and the LSF commands that perform them.
Change a running job from RUN state to one of the following states:
The default action is to send the following signals to the job:
- SIGTSTP for parallel or interactive jobs. SIGTSTP is caught by the master process and passed to all the slave processes running on other hosts.
- SIGSTOP for sequential jobs. SIGSTOP cannot be caught by user programs. The SIGSTOP signal can be configured with the LSB_SIGSTOP parameter in lsf.conf.
LSF invokes the SUSPEND action when:
- The user or LSF administrator issues a bstop or bkill command to the job
- Load conditions on the execution host satisfy any of:
- The suspend conditions of the queue, as specified by the STOP_COND parameter in lsb.queues
- The scheduling thresholds of the queue or the execution host
- The run window of the queue closes
- The job is preempted by a higher priority job
Change a suspended job from SSUSP, USUSP, or PSUSP state to the RUN state. The default action is to send the signal SIGCONT.
LSF invokes the RESUME action when:
- The user or LSF administrator issues a bresume command to the job
- Load conditions on the execution host satisfy all of:
- The resume conditions of the queue, as specified by the RESUME_COND parameter in lsb.queues
- The scheduling thresholds of the queue and the execution host
- A closed run window of the queue opens again
- A preempted job finishes
Terminate a job. This usually causes the job change to EXIT status. The default action is to send SIGINT first, then send SIGTERM 10 seconds after SIGINT, then send SIGKILL 10 seconds after SIGTERM. The delay between signals allows user programs to catch the signals and clean up before the job terminates.
To override the 10 second interval, use the parameter JOB_TERMINATE_INTERVAL in the lsb.params file. See the Platform LSF Configuration Reference for information about the lsb.params file.
LSF invokes the TERMINATE action when:
- The user or LSF administrator issues a bkill or brequeue command to the job
- The TERMINATE_WHEN parameter in the queue definition (lsb.queues) causes a SUSPEND action to be redirected to TERMINATE
- The job reaches its CPULIMIT, MEMLIMIT, RUNLIMIT or PROCESSLIMIT
If the execution of an action is in progress, no further actions are initiated unless it is the TERMINATE action. A TERMINATE action is issued for all job states except PEND.
Windows job control actions
On Windows, actions equivalent to the UNIX signals have been implemented to do the default job control actions. Job control messages replace the SIGINT and SIGTERM signals, but only customized applications will be able to process them. Termination is implemented by the TerminateProcess() system call.
See Platform LSF Programmer's Guide for more information about LSF signal handling on Windows.
Configuring Job Control Actions
Several situations may require overriding or augmenting the default actions for job control. For example:
- Notifying users when their jobs are suspended, resumed, or terminated
- An application holds resources (for example, licenses) that are not freed by suspending the job. The administrator can set up an action to be performed that causes the license to be released before the job is suspended and re-acquired when the job is resumed.
- The administrator wants the job checkpointed before being:
- A distributed parallel application must receive a catchable signal when the job is suspended, resumed or terminated to propagate the signal to remote processes.
To override the default actions for the SUSPEND, RESUME, and TERMINATE job controls, specify the JOB_CONTROLS parameter in the queue definition in lsb.queues.
See the Platform LSF Configuration Reference for information about the lsb.queues file.
JOB_CONTROLS parameter (lsb.queues)
The JOB_CONTROLS parameter has the following format:Begin Queue ... JOB_CONTROLS = SUSPEND[signal | CHKPNT | command] \ RESUME[signal | command] \ TERMINATE[signal | CHKPNT | command] ... End Queue
When LSF needs to suspend, resume, or terminate a job, it invokes one of the following actions as specified by SUSPEND, RESUME, and TERMINATE.
A UNIX signal name (for example, SIGTSTP or SIGTERM). The specified signal is sent to the job.
The same set of signals is not supported on all UNIX systems. To display a list of the symbolic names of the signals (without the SIG prefix) supported on your system, use the kill -l command.
Checkpoint the job. Only valid for SUSPEND and TERMINATE actions.
- If the SUSPEND action is CHKPNT, the job is checkpointed and then stopped by sending the SIGSTOP signal to the job automatically.
- If the TERMINATE action is CHKPNT, then the job is checkpointed and killed automatically.
A /bin/sh command line.
- Do not quote the command line inside an action definition.
- Do not specify a signal followed by an action that triggers the same signal (for example, do not specify JOB_CONTROLS=TERMINATE[bkill] or JOB_CONTROLS=TERMINATE[brequeue]). This will cause a deadlock between the signal and the action.
Using a command as a job control action
- The command line for the action is run with /bin/sh -c so you can use shell features in the command.
- The command is run as the user of the job.
- All environment variables set for the job are also set for the command action. The following additional environment variables are set:
- LSB_JOBPGIDS - a list of current process group IDs of the job
- LSB_JOBPIDS - a list of current process IDs of the job
- For the SUSPEND action command, the environment variables LSB_SUSP_REASONS and LSB_SUSP_SUBREASONS are also set. Use them together in your custom job control to determine the exact load threshold that caused a job to be suspended.
- LSB_SUSP_REASONS - an integer representing a bitmap of suspending reasons as defined in lsbatch.h. The suspending reason can allow the command to take different actions based on the reason for suspending the job.
- LSB_SUSP_SUBREASONS - an integer representing the load index that caused the job to be suspended. When the suspending reason SUSP_LOAD_REASON (suspended by load) is set in LSB_SUSP_REASONS, LSB_SUSP_SUBREASONS is set to one of the load index values defined in lsf.h.
- The standard input, output, and error of the command are redirected to the NULL device, so you cannot tell directly whether the command runs correctly. The default null device on UNIX is /dev/null.
- You should make sure the command line is correct. If you want to see the output from the command line for testing purposes, redirect the output to a file inside the command line.
TERMINATE job actions
Use caution when configuring TERMINATE job actions that do more than just kill a job. For example, resource usage limits that terminate jobs change the job state to SSUSP while LSF waits for the job to end. If the job is not killed by the TERMINATE action, it remains suspended indefinitely.
TERMINATE_WHEN parameter (lsb.queues)
In certain situations you may want to terminate the job instead of calling the default SUSPEND action. For example, you may want to kill jobs if the run window of the queue is closed. Use the TERMINATE_WHEN parameter to configure the queue to invoke the TERMINATE action instead of SUSPEND.
See the Platform LSF Configuration Reference for information about the lsb.queues file and the TERMINATE_WHEN parameter.
SyntaxTERMINATE_WHEN = [LOAD] [PREEMPT] [WINDOW]
The following defines a night queue that will kill jobs if the run window closes.Begin Queue NAME = night RUN_WINDOW = 20:00-08:00 TERMINATE_WHEN = WINDOW JOB_CONTROLS = TERMINATE[ kill -KILL $LSB_JOBPIDS; echo "job $LSB_JOBID killed by queue run window" | mail $USER ] End Queue
LSB_SIGSTOP parameter (lsf.conf)
Use LSB_SIGSTOP to configure the SIGSTOP signal sent by the default SUSPEND action.
If LSB_SIGSTOP is set to anything other than SIGSTOP, the SIGTSTP signal that is normally sent by the SUSPEND action is not sent. For example, if LSB_SIGSTOP=SIGKILL, the three default signals sent by the TERMINATE action (SIGINT, SIGTERM, and SIGKILL) are sent 10 seconds apart.
See the Platform LSF Configuration Reference for information about the lsf.conf file.
Avoiding signal and action deadlock
Do not configure a job control to contain the signal or command that is the same as the action associated with that job control. This will cause a deadlock between the signal and the action.
For example, the bkill command uses the TERMINATE action, so a deadlock results when the TERMINATE action itself contains the bkill command.
Any of the following job control specifications will cause a deadlock:
Customizing Cross-Platform Signal Conversion
LSF supports signal conversion between UNIX and Windows for remote interactive execution through RES.
On Windows, the CTRL+C and CTRL+BREAK key combinations are treated as signals for console applications (these signals are also called console control actions).
LSF supports these two Windows console signals for remote interactive execution. LSF regenerates these signals for user tasks on the execution host.
Default signal conversion
In a mixed Windows/UNIX environment, LSF has the following default conversion between the Windows console signals and the UNIX signals:
For example, if you issue the lsrun or bsub -I commands from a Windows console but the task is running on an UNIX host, pressing the CTRL+C keys will generate a UNIX SIGINT signal to your task on the UNIX host. The opposite is also true.
Custom signal conversion
For lsrun (but not bsub -I), LSF allows you to define your own signal conversion using the following environment variables:
Here, SIGXXXX/SIGYYYY are UNIX signal names such as SIGQUIT, SIGTINT, etc. The conversions will then be: CTRL+C=SIGXXXX and CTRL+BREAK=SIGYYYY.
If both LSF_NT2UNIX_CLTRC and LSF_NT2UNIX_CLTRB are set to the same value (LSF_NT2UNIX_CLTRC=SIGXXXX and LSF_NT2UNIX_CLTRB=SIGXXXX), CTRL+C will be generated on the Windows execution host.
For bsub -I, there is no conversion other than the default conversion.
Platform Computing Inc.
|Knowledge Center Contents Previous Next Index|