Learn more about Platform products at http://www.platform.com

[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]



Programming with LSBLIB


This chapter shows how to use LSBLIB to access the services provided by LSF Batch and other LSF products. Since LSF Batch is built on top of LSF Base, LSBLIB relies on services provided by LSLIB. However, you only need to link your program with LSBLIB to use LSBLIB functions because the header file of LSBLIB (lsbatch.h) already includes the LSLIB (lsf.h). All other LSF products (such as Platform Parallel and Platform Make) relies on services provided by LSBLIB.

LSF Batch and Platform JobScheduler services are provided by mbatchd. Services for processing event and job log files which do not involve any daemons. LSBLIB is shared by both LSF Batch and Platform JobScheduler. The functions described for LSF Batch in this chapter also apply to other LSF products, unless explicitly indicated otherwise.

Contents

[ Top ]


Initializing LSF Batch Applications

lsb_init() function

Before accessing any of the LSF Batch services, an application must initialize LSBLIB. An application does this by calling lsb_init().

lsb_init() has the following parameter:

char *appName

On success, lsb_init() returns 0. On failure, it returns -1 and sets lsberrno to indicate the error.

The parameter appName is the name of the application. Use appName to log detailed messages about the transactions inside LSLIB for debugging purpose. If LSB_CMD_LOG_MASK is defined as LOG_DEBUG1, the messages will be logged.

Messages are logged in LSF_LOGDIR/appname. If appname is NULL, the log file is LSF_LOGDIR/bcmd.

Example

Here is an example of code showing the usage of this function:

/* Include <lsf/lsbatch.h> when using this function */

if (lsb_init(argv[0]) < 0) {
        lsb_perror("simbsub: lsb_init() failed");
        exit(-1);
}

lsb_perror()

The function lsb_perror(char *usrMsg) prints a batch LSF error message on stderr. The user message usrMsg is printed, followed by a colon (:) and the batch error message corresponding to lsberrno.

[ Top ]


Getting Information about LSF Batch Queues

LSF Batch queues hold jobs in LSF Batch and according to scheduling policies and limits on resource usage.

lsb_queueinfo()

lsb_queueinfo() gets information about the queues in LSF Batch. This includes:

The example program in this section uses lsb_queueinfo() to get the queue information:

struct queueInfoEnt *lsb_queueinfo(queues,numQueues,
                   hostname,username,options)

lsb_queueinfo() has the following parameters:

char  **queues;           Array containing names of queues of interest
int   *numQueues;         Number of queues
char  *hostname;          Specified queues using hostname
char  *username;          Specified queues enabled for user
int   options;            Reserved for future use; supply 0

To get information on all queues, set *numQueues to 0. If *numQueues is 1 and queue is NULL, information on the default system queue is returned.

If hostname is not NULL, then all queues using host hostname as a batch server host will be returned. If username is not NULL, then all queues allowing user username to submit jobs to will be returned.

On success, lsb_queueinfo() returns an array containing a queueInfoEnt structure (see below) for each queue of interest and sets *numQueues to the size of the array. On failure, lsb_queueinfo() returns NULL and sets lsberrno to indicate the error.

The queueInfoEnt structure is defined in lsbatch.h as

struct queueInfoEnt {
    char  *queue;             Name of the queue
    char  *description;       Description of the queue
    int   priority;           Priority of the queue
    short nice;               Value that runs jobs in the queue
    char  *userList;          Users allowed to submit jobs to the queue
    char  *hostList;          Hosts that can run jobs in the queue
    int   nIdx;               Size of the loadSched and loadStop arrays
    float *loadSched;         Load thresholds that control scheduling of job
                                  from the queue
    float *loadStop;           Load thresholds that control suspension of
                                  jobs from the queue
    int   userJobLimit;       Number of unfinished jobs a user can dispatch
                                  from the queue
    int   procJobLimit;       Number of unfinished jobs the queue can
                                  dispatch to a processor
    char  *windows;           Queue run window
    int   rLimits[LSF_RLIM_NLIMITS];  Per-process resource limits for
                                           jobs
    char  *hostSpec;          Obsolete. Use defaultHostSpec instead
    int   qAttrib;            Attributes of the queue
    int   qStatus;            Status of the queue
    int   maxJobs;            Job slot limit of the queue.
    int   numJobs;            Total number of job slots required by all jobs 
    int   numPEND;            Number of job slots needed by pending jobs 
    int   numRUN;             Number of jobs slots used by running jobs  
    int   numSSUSP;           Number of job slots used by system
                                  suspended jobs
    int   numUSUSP;           Number of jobs slots used by user
                                  suspended jobs 
    int   mig;                Queue migration threshold in minutes
    int   schedDelay;         Schedule delay for new jobs
    int   acceptIntvl;        Minimum interval between two jobs dispatche
d
                                  to the same host
    char  *windowsD;          Queue dispatch window
    char  *nqsQueues;         Blank-separated list of NQS queue specifiers
    char  *userShares;        Blank-separated list of user shares
    char  *defaultHostSpec;   Value of DEFAULT_HOST_SPEC for the
                                  queue in lsb.queues
    int   procLimit;          Maximum number of job slots a job can take
    char  *admins;            Queue level administrators
    char  *preCmd;            Queue level pre-exec command 
    char  *postCmd;           Queue's post-exec command 
    char  *requeueEValues;    Queue's requeue exit status 
    int   hostJobLimit;       Per host job slot limit 
    char  *resReq;            Queue level resource requirement 
    int   numRESERVE;         Reserved job slots for pending jobs 
    int   slotHoldTime;       Time period for reserving job slots
    char  *sndJobsTo;         Remote queues to forward jobs to 
    char  *rcvJobsFrom;       Remote queues which can forward to me 
    char  *resumeCond;        Conditions to resume jobs 
    char  *stopCond;          Conditions to suspend jobs 
    char  *jobStarter;        Queue level job starter 
    char  *suspendActCmd;     Action commands for SUSPEND
    char  *resumeActCmd;      Action commands for RESUME 
    char  *terminateActCmd;   Action commands for TERMINATE 
    int   sigMap[LSB_SIG_NUM];  Configurable signal mapping 
    char  *preemption;        Preemption policy
    int    maxRschedTime;     Time period for remote cluster to schedule job
    struct shareAcctInfoEnt *shareAccts;  Array of shareAcctInfoEnt
    char   *chkpntDir;        chkpnt directory
    int    chkpntPeriod;      chkpnt period
    int    imptJobBklg;       Number of important jobs kept in the queue
    int    defLimits[LSF_RLIM_NLIMITS];  LSF resource limits (soft)
    int    chunkJobSize;      Maximum number of jobs in one chunk
};

The variable nIdx is the number of load threshold values for job scheduling. This is the total number of load indices returned by LIM. The parameters sndJobsTo, rcvJobsFrom, and maxRschedTime are used with LSF MultiCluster. The variable chunkJobSize must be larger than 1.

For a complete description of the fields in the queueInfoEnt structure, see the lsb_queueinfo() man page.

Include lsbatch.h in every application that uses LSBLIB functions. lsf.h does not have to be explicitly included in your program because lsbatch.h includes lsf.h.

Like the data structures returned by LSLIB functions, the data structures returned by an LSBLIB function are dynamically allocated inside LSBLIB and are automatically freed next time the same function is called. Do not attempt to free the space allocated by LSBLIB. To keep this information across calls, make your own copy of the data structure.

Example

The program below takes a queue name as the first argument and displays information about the named queue.

/******************************************************
* LSBLIB -- Examples
*
* simbqueues
* Display information about a specific queue in the 
* cluster.
* (Queue name is given on the command line argument)
* It is similar to the command "bqueues QUEUE_NAME".
******************************************************/
# include <lsf/lsbatch.h>
int main (int argc, char *argv[])
{
    struct queueInfoEnt *qInfo;
    char *queues;
        /* take the command line argument as the queue name */
    int numQueues = 1;
        /* only 1 queue name in the array queue */
    char *host = NULL;/* all queues are of interest */
    char *user = NULL;/* all queues are of interest */
    int options = 0;

    /* check if input is in the right format: "./simbqueues
    QUEUENAME" */
    if (argc != 2) {
        printf("Usage: %s queue_name\n", argv[0]);
        exit(-1);
    }
    queues = argv[1];

/* initialize LSBLIB and get the configuration environment */
    if (lsb_init(argv[0]) < 0) {
        lsb_perror("simbqueues: lsb_init() failed");
        exit(-1);
    }

    /* get queue information about the specified queue */
    qInfo = lsb_queueinfo(&queues, &numQueues, host, user,
    options);
    if (qInfo == NULL) {
        lsb_perror("simbqueues: lsb_queueinfo() failed");
        exit(-1);
    }

    /* display the queue information (name, descriptions,
    priority, nice value, max num of jobs, num of PEND, RUN,
    SUSP and TOTAL jobs) */

    printf("Information about %s queue:\n", queues);
    printf("Description: %s\n", qInfo[0].description);
    printf("Priority: %d     Nice: %d     \n", 
           qInfo[0].priority, qInfo[0].nice);
    printf("Maximum number of job slots:");
    if (qInfo->maxJobs < INFINIT_INT)
        printf("%5d\n", qInfo[0].maxJobs);
    else
        printf("%5s\n", "unlimited");
    printf("Job slot statistics: PEND(%d) RUN(%d) SUSP(%d) 
           TOTAL(%d).\n", qInfo[0].numPEND, qInfo[0].numRUN, 
           qInfo[0].numSSUSP + qInfo[0].numUSUSP, 
           qInfo[0].numJobs);

    exit(0);
} /* main */

In the above program, INFINIT_INT is defined in lsf.h and is used to indicate that there is no limit set for maxJobs. This applies to all Platform LSF API function calls. Platform LSF will supply INFINIT_INT automatically whenever the value for the variable is either invalid (not available) or infinity. This value should be checked for all variables that are optional. For example, if you display the loadSched/loadStop values, an INFINIT_INT indicates that the threshold is not configured and is ignored.

Similarly, lsb_perror() prints error messages regarding function call failure. You can check lsberrno if you want to take different actions for different errors.

Example output

The above program will produce output similar to the following:

Information about normal queue:
Description: For normal low priority jobs
Priority: 25            Nice: 20
Maximum number of job slots : 40
Job slot statistics: PEND( 5) RUN(12) SUSP(1) TOTAL(18)

[ Top ]


Getting Information about LSF Batch Hosts

LSF Batch execution hosts execute jobs in the LSF Batch system.

lsb_hostinfo()

LSBLIB provides lsb_hostinfo() to get information about the server hosts in LSF Batch. This includes configured static and dynamic information. Examples of host information include: host name, status, job limits and statistics, dispatch windows, and scheduling parameters.

The example program in this section uses lsb_hostinfo():

struct hostInfoEnt *lsb_hostinfo(hostsnumHosts)

lsb_hostinfo() gets information about LSF Batch server hosts. On success, it returns an array of hostInfoEnt structures which hold the host information and sets *numHosts to the size of the array. On failure, lsb_hostinfo() returns NULL and sets lsberrno to indicate the error.

lsb_hostinfo() has the following parameters:

char  **hosts;                Array of names of hosts of interest
int   *numHosts;               Number of names in hosts

To get information on all hosts, set *numHosts to 0. This sets *numHosts to the actual number of hostInfoEnt structures when lsb_hostinfo() returns successfully.

If *numHosts is 1 and hosts is NULL, lsb_hostinfo()returns information on the local host.

hostInfoEnt structure

The hostInfoEnt structure is defined in lsbatch.h as

struct hostInfoEnt {
    char  *host;             Name of the host
    int   hStatus;           Status of host. (see below)
    int   busySched;         Reason host will not schedule jobs
    int   busyStop;          Reason host has suspended jobs
    float cpuFactor;         Host CPU factor, as returned by LIM
    int   nIdx;              Size of the loadSched and loadStop arrays,
                                 as returned from LIM
    float *load;             Load LSF Batch used for scheduling batch jobs
    float *loadSched;        Load thresholds that control scheduling of job
                                 on host
    float *loadStop;         Load thresholds that control suspension of jobs
                                 on host
    char  *windows;          Host dispatch window
    int   userJobLimit;      Maximum number of jobs a user can run on
                                 host
    int   maxJobs;           Maximum number of jobs that host can
                                 process concurrently
    int   numJobs;           Number of jobs running or suspended on
                                 host
    int   numRUN;            Number of jobs running on host
    int   numSSUSP;          Number of jobs suspended by sbatchd on
                                 host
    int   numUSUSP;          Number of jobs suspended by a user on
                                 host
    int   mig;               Migration threshold for jobs on host
    int   attr;              Host attributes
#define H_ATTR_CHKPNTABLE  0x1
#define H_ATTR_CHKPNT_COPY 0x2
    float *realLoad;         Load mbatchd obtained from LIM
    int   numRESERVE;        Num of slots reserved for pending jobs
    int   chkSig;            Variable is obsolete
};

There are differences between the host information returned by ls_gethostinfo() and the host information returned by the lsb_hostinfo(). ls_gethostinfo() returns general information about the hosts whereas lsb_hostinfo()returns LSF Batch specific information about hosts.

For a complete description of the fields in the hostInfoEnt structure, see the lsb_hostinfo(3) man page.

Example

The following example takes a host name as an argument and displays information about the named host. It is a simplified version of the LSF Batch bhosts command.

/******************************************************
* LSBLIB -- Examples
*
* simbhosts
* Display information about the batch server host with 
* the given name in the cluster.
******************************************************/
#include <lsf/lsbatch.h>

int main (int argc, char *argv[])
{
    struct hostInfoEnt *hInfo;
        /* array holding all job info entries */
    char *hostname = argv[1]; /* given host name */
    int numHosts = 1;/* number of interested host */

    /* check if input is in the right format: "./simbhosts
    HOSTNAME" */
    if (argc!=2) {
        printf("Usage: %s hostname\n", argv[1]);
        exit(-1);
    }

    /* initialize LSBLIB and get the configuration environment */
    if (lsb_init(argv[0]) < 0) {
        lsb_perror("simbhosts: lsb_init() failed");
        exit(-1);
    }

    hInfo = lsb_hostinfo(&hostname, &numHosts);   
        /* get host info */
    if (hInfo == NULL) {
        lsb_perror("simbhosts: lsb_hostinfo() failed");
        exit (-1);
    }

    /* display the host information (name,status, job limit,
    num of RUN/SSUP/USUSP jobs)*/
    printf("HOST_NAME          STATUS    JL/U  NJOBS  RUN 
    SSUSP USUSP\n");
    printf ("%-18.18s", hInfo->host);

    if (hInfo->hStatus & HOST_STAT_UNLICENSED)
        printf(" %-9s\n", "unlicensed");
    else if (hInfo->hStatus & HOST_STAT_UNAVAIL)
        printf(" %-9s",  "unavail");
    else if (hInfo->hStatus & HOST_STAT_UNREACH)
        printf(" %-9s", "unreach");
    else if (hInfo->hStatus & ( HOST_STAT_BUSY | HOST_STAT_WIND | 
                              HOST_STAT_DISABLED |
                              HOST_STAT_LOCKED |
                              HOST_STAT_FULL |
                              HOST_STAT_NO_LIM))
        printf(" %-9s", "closed");
    else
        printf(" %-9s", "ok");

    if (hInfo->userJobLimit < INFINIT_INT)
        printf("%4d", hInfo->userJobLimit);
    else
        printf("%4s", "-");

    printf("%7d  %4d  %4d  %4d\n", hInfo->numJobs, hInfo-> 
           numRUN, hInfo->numSSUSP, hInfo->numUSUSP);

exit(0);
} /* main */

The example output from the above program follows:

Example output

a.out hostB
HOST_NAME    STATUS    JL/U  NJOBS  RUN  SSUSP USUSP
hostB           ok        -     2     1     1     0

hStatus is the status of the host. It is the bitwise inclusive OR of some of the following constants defined in lsbatch.h:

Host status

Host Status Name Host Status Description
HOST_STAT_BUSY
The host load is greater than a scheduling threshold. In this status, no new batch job is scheduled to run on this host.
HOST_STAT_WIND
The host dispatch window is closed. In this status, no new batch job is accepted.
HOST_STAT_DISABLED
The host has been disabled by the Platform LSF administrator and will not accept jobs. In this status, no new batch job will be scheduled to run on this host.
HOST_STAT_LOCKED
The host is locked by an exclusive job. In this status, no new batch job is scheduled to run on this host.
HOST_STAT_FULL
The host has reached its job limit. In this status, no new batch job is scheduled to run on this host.
HOST_STAT_UNREACH
The sbatchd on this host is unreachable.
HOST_STAT_UNAVAIL
The LIM and sbatchd on this host are unreachable.
HOST_STAT_UNLICENSED
The host does not have an LSF license.
HOST_STAT_NO_LIM
The host is running an sbatchd but not a LIM.

If none of the above holds, hStatus is set to HOST_STAT_OK to indicate that the host is ready to accept and run jobs.

The constant INFINIT_INT defined in lsf.h is used to indicate that there is no limit set for userJobLimit.

[ Top ]


Job Submission and Modification

Job submission and modification are the most common operations in LSF Batch. A user can submit jobs to the system and then modify them if the job has not been started.

lsb_submit()

LSBLIB provides lsb_submit() for job submission and lsb_modify() for job modification.

LS_LONG_INT lsb_submit(jobSubReqjobSubReply)
LS_LONG_INT lsb_modify(jobSubReqjobSubReplyjobId)

On success, these calls return the job ID. On failure, it returns -1, and lsberrno set to indicate the error. lsb_submit() is similar to lsb_modify(), except lsb_modify() modifies the parameters of an already submitted job.

Both of these functions use the same data structure:

struct submit      *jobSubReq;      Job specifications
struct submitReply *jobSubReply;    Results of job submission
LS_LONG_INT   jobId;                ID of the job to modify (lsb_modify()
                                     only)

submit structure

The submit structure is defined in lsbatch.h as:

struct submit {
    int    options;           Indicates which optional fields are present
    int    options2;          Indicates which additional fields are present
    char   *jobName;          Job name (optional)
    char   *queue;            Submit the job to this queue (optional)
    int    numAskedHosts;     Size of askedHosts (optional)
    char   **askedHosts;      Array of names of candidate hosts (optional)
    char   *resReq;           Resource requirements of the job (optional)
    int    rlimits[LSF_RLIM_NLIMITS];
                              Limits on system resource use by all of the
                                  job's processes
    char   *hostSpec;         Host model used for scaling rlimits (optional)
    int    numProcessors;     Initial number of processors needed by the job
    char   *dependCond;       Job dependency condition (optional)
    char   *timeEvent         Time event string for scheduled repetitive jobs
                                  (optional)
    time_t beginTime;         Dispatch the job on or after beginTime
    time_t termTime;          Job termination deadline
    int    sigValue;          This variable is obsolete)
    char   *inFile;           Path name of the job's standard input file
                                  (optional)
    char   *outFile;          Path name of the job's standard output file
                                 (optional)
    char   *errFile;         Path name of the job's standard error output file
                                 (optional)
    char   *command;         Command line of the job
    char   *newCommand       New command for bmod (optional)
    time_t chkpntPeriod;     Job is checkpointable with this period (optional)
    char   *chkpntDir;       Directory for this job's chk directory (optional)
    int    nxf;              Size of xf (optional)
    struct xFile *xf;        Array of file transfer specifications (optional)
    char   *preExecCmd;      Job's pre-execution command (optional)
    char   *mailUser;        User E-mail address to which the job's output
                                 are mailed (optional)
    int    delOptions;       Bits to be removed from options 
                                 (lsb_modify() only)
    char   *projectName;     Name of the job's project (optional)
    int    maxNumProcessors;  Requested maximum num of job slots for the
                                  job
    char   *loginShell;      Login shell to be used to re-initialize
                                 environment
    char   *exceptList;      Lists the exception handlers
    int    userPriority      Job priority (optional)
};

For a complete description of the fields in the submit structure, see the lsb_submit(3) man page.

submitReply structure

The submitReply structure is defined in lsbatch.h as

struct submitReply {
    char   *queue;            Queue name the job was submitted to
    LS_LONG_INT badJobId;     dependCond contains badJobId but there is
                                  no such job
    char   *badJobName;       dependCond contains badJobName but 
                                  there is no such job
    int    badReqIndx;        Index of a host or resource limit that caused
                                  an error
};

The last three variables in the structure submitReply are only used when the lsb_submit() or lsb_modify() fail.

For a complete description of the fields in the submitReply structure, see the lsb_submit(3) man page.

To submit a new job, fill out this data structure and then call lsb_submit(). The delOptions variable is ignored by LSF Batch for lsb_submit().

Example

The example job submission program below takes the job command line as an argument and submits the job to LSF Batch. For simplicity, it is assumed that the job command does not have arguments.

/******************************************************
* LSBLIB -- Examples
*
* simple bsub
* This program submits a batch job to LSF 
* It is the equivalent of using the "bsub" command without 
* any options.
******************************************************/
#include <stdio.h>
#include <stdlib.h>
#include <lsf/lsbatch.h>
#include "combine_arg.h"

int main(int argc, char **argv) 
{
    struct submit req;           /* job specifications */
    memset(&req, 0, sizeof(req)); /* initializes req */

    struct submitReply  reply;  /* results of job 
submission */ 
    int  jobId;                 /* job ID of submitted job */
    int  i;

    /* initialize LSBLIB  and  get  the  configuration
    environment */
    if (lsb_init(argv[0]) < 0) {
        lsb_perror("simbsub: lsb_init() failed");
        exit(-1);
    }

    /* check if input is in the right format: "./simbsub
    COMMAND ARGUMENTS" */
    if (argc < 2) {
    fprintf(stderr, "Usage: simbsub command\n");
    exit(-1);
    }

    /* options and options2 are bitwise inclusive OR of some 
of
    the SUB_* flags */

    req.options = 0;
    req.options2 = 0;

    for (i = 0; i < LSF_RLIM_NLIMITS; i++)    /* resource
                                              limits are
                                              initialized to
                                              default */
        req.rLimits[i] = DEFAULT_RLIMIT;

    req.beginTime = 0;
        /* specific date and time to dispatch the job */
    req.termTime  = 0;
        /* specifies job termination deadline */

    req.numProcessors = 1; 
/* initial number of processors needed by a (parallel) job */
    req.maxNumProcessors = 1;   
/* max num of processors required to run the (parallel) job */

    req.command = combine_arg(argc,argv);
        /* command line of job */

printf("----------------------------------------------\n");
    jobId = lsb_submit(&req, &reply);
        /* submit the job with specifications */

    if (jobId < 0)
        /* if job submission fails, lsb_submit returns -1 */
    switch (lsberrno) {
        /* and sets lsberrno to indicate the error */
        case LSBE_QUEUE_USE:
        case LSBE_QUEUE_CLOSED:
            lsb_perror(reply.queue);
            exit(-1);
        default:
            lsb_perror(NULL);
            exit(-1);
    }
    exit(0);
} 

/* main */

Example output

The above program will produce output similar to the following:

Job <5602> is submitted to default queue <default>.

Sample program explanations

Options and options2

    req.options = 0;
    req.options2 = 0;

The options and options2 fields of the submit structure are the bitwise inclusive OR of some of the SUB_* flags defined in lsbatch.h. These flags serve two purposes.

Some flags indicate which of the optional fields of the submit structure are present. Those that are not present have default values.

Other flags indicate submission options. For a description of these flags, see lsb_submit(3).

Since options indicate which of the optional fields are meaningful, the programmer does not need to initialize the fields that are not chosen by options. All parameters that are not optional must be initialized properly.

numProcessors and maxNumProcessors

    req.numProcessors = 1;       
/* initial number of processors needed by a (parallel) job */
    req.maxNumProcessors = 1;
/* max number of processors required to run the (parallel) job */

numProcessors and maxNumProcessors are initialized to ensure only one processor is requested. They are defined in order to synchronize the job specification in lsb_submit() to the default used by bsub.

If the resReq field of the submit structure is NULL, then LSBLIB will try to obtain resource requirements for a command from the remote task list (see Getting Task Resource Requirements). If the task does not appear in the remote task list, then NULL is passed to LSF Batch. mbatchd uses the default resource requirements with option DFT_FROMTYPE bit set when making a LSLIB call for host selection from LIM. See Handling Default Resource Requirements for more information about default resource requirements.

rLimits[LSF_RLIM_NLIMITS] and hostSpec

    for (i = 0; i < LSF_RLIM_NLIMITS; i++)
        /* resource limits are initialized to default */
        req.rLimits[i] = DEFAULT_RLIMIT;

The default resource limit (DEFAULT_RLIMIT) defined in lsf.h are for no resource limits.

The constants used to index the rlimits array of the submit structure is defined in lsf.h. The resource limits currently supported by LSF Batch are listed below.

Resource limits supported by LSF Batch

Resource Limit Index in rlimits Array
CPU time limit (in seconds)
LSF_RLIMIT_CPU
File size limit (in kilobytes)
LSF_RLIMIT_FSIZE
Data size limit (in kilobytes)
LSF_RLIMIT_DATA
Stack size limit
LSF_RLIMIT_STACK
Core file size limit (in kilobytes)
LSF_RLIMIT_CORE
Resident memory size limit (in kilobytes)
LSF_RLIMIT_RSS
Number of open files limit
LSF_RLIMIT_NOFILE
Number of open files limit (for HP-UX)
LSF_RLIMIT_OPEN_MAX
Virtual memory limit (same as max swap memory)
LSF_RLIMIT_SWAP
Wall-clock time run limit
LSF_RLIMIT_RUN
Maximum num of processes a job can fork
LSF_RLIMIT_PROCESS

The hostSpec field of the submit structure specifies the host model to use for scaling rlimits[LSF_RLIMIT_CPU] and rlimits[LSF_RLIMIT_RUN] (See lsb_queueinfo(3)). If hostSpec is NULL, the local host's model is assumed.

beginTime and termTime

    req.beginTime = 0;	 	 	 	 	 /* specific date and time to dispatch 
the                        job */
    req.termTime  = 0;	 	 	 	 	 /* specifies job termination deadline */

If the beginTime field of the submit structure is 0, start the job as soon as possible.

A USR2 signal is sent if the job is running at termTime. If the job does not terminate within 10 minutes after being sent this signal, it is killed. If the termTime field of the submit structure is 0, the job is allowed to run until it reaches a resource limit.

lsberrno

The example below checks the value of lsberrno when lsb_submit() fails:

    if (jobId < 0)
        /* if job submission fails, lsb_submit returns -1 */
    switch (lsberrno) {
        /* and sets lsberrno to indicate the error */
    case LSBE_QUEUE_USE:
    case LSBE_QUEUE_CLOSED:
    lsb_perror(reply.queue);
    exit(-1);
    default:
    lsb_perror(NULL);
    exit(-1);
}

Different actions are taken depending on the type of the error. All possible error numbers are defined in lsbatch.h. For example, error number LSBE_QUEUE_USE indicates that the user is not authorized to use the queue. The error number LSBE_QUEUE_CLOSED indicates that the queue is closed.

Since a queue name was not specified for the job, the job is submitted to the default queue. The queue field of the submitReply structure contains the name of the queue to which the job was submitted.

The above program will produce output similar to the following:

Job <5602> is submitted to default queue <default>.

The output from the job is mailed to the user because the program did not specify a file name for the outFile parameter in the submit structure.

The program assumes that uniform user names and user ID spaces exist among all the hosts in the cluster. That is, a job submitted by a given user will run under the same user's account on the execution host. For situations where non-uniform user names and user ID spaces exist, account mapping must be used to determine the account used to run a job.

If you are familiar with the bsub command, it may help to know how the fields in the submit structure relate to the bsub command options. This is provided in the following table.

submit fields and bsub options

bsub Option submit Field options
-J job_name_spec
jobName
SUB_JOB_NAME
-q queue_name
queue
SUB_QUEUE
-m host_name[+[pref_level]]
askedHosts
SUB_HOST
-n min_proc[,max_proc]
numProcessors,
maxNumProcessors


-R res_req
resReq
SUB_RES_REQ
-c cpu_limit[/host_spec]
rlimits[LSF_RLIMIT_
CPU] / hostSpec **

SUB_HOST_SPEC (if host_spec is specified)
-W run_limit[/host_spec]
rlimits[LSF_RLIMIT_
RUN] / hostSpec**

SUB_HOST_SPEC (if host_spec is specified)
-F file_limit
rlimits[LSF_RLIMIT_
        FSIZE]**


-M mem_limit
rlimits[LSF_RLIMIT_
        RSS]**


-D data_limit
rlimits[LSF_RLIMIT_
        DATA]**


-S stack_limit
rlimits[LSF_RLIMIT_
        STACK**


-C core_limit
rlimits[LSF_RLIMIT_
        CORE]**


-k "chkpnt_dir [chkpnt_period]"
chkpntDir, chkpntPeriod
SUB_CHKPNT_DIR, SUB_CHKPNT_DIR (if chkpntPeriod is specified)
-w depend_cond
dependCond
SUB_DEPEND_COND
-b begin_time
beginTime

-t term_time
TermTime

-i in_file
inFile
SUB_IN_FILE
-o out_file
outFile
SUB_OUT_FILE
-e err_file
errFile
SUB_ERR_FILE
-u mail_user
mailUser
SUB_MAIL_USER
-f "lfile op [rfile]"
xf

-E "pre_exec_cmd [arg]"
preExecCmd
SUB_PRE_EXEC
-L login_shell
loginShell
SUB_LOGIN_SHELL
-P project_name
projectName
SUB_PROJECT_NAME
-G user_group
userGroup
SUB_USER_GROUP
-H

SUB2_HOLD*
-x

SUB_EXCLUSIVE
-r

SUB_RERUNNABLE
-N

SUB_NOTIFY_END
-B

SUB_NOTIFY_
BEGIN

-I

SUB_INTERACTIVE
-Ip

SUB_PTY
-Is

SUB_PTY_SHELL
-K

SUB2_BSUB_BLOCK*
- X "except_cond::action"
exceptList
SUB_EXCEPT
-T time_event
timeEvent
SUB_TIME_EVENT

* indicates a bitwise OR mask for options2.

** indicates -1 means undefined

Even if all the options are not used, all optional string fields must be initialized to the empty string. For a complete description of the fields in the submit structure, see the lsb_submit(3) man page.

To modify an already submitted job, fill out a new submit structure to override existing parameters, and use delOptions to remove option bits that were previously specified for the job. Modifying a submitted job is like re-submitting the job. Thus a similar program can be used to modify an existing job with minor changes. One additional parameter that must be specified for job modification is the job Id. The parameter delOptions can also be set if you want to clear some option bits that were previously set.

All applications that call lsb_submit() and lsb_modify() are subject to authentication constraints described in Authentication.

[ Top ]


Getting Information about Batch Jobs

LSBLIB provides functions to get status information about batch jobs. Since there could be many thousands of jobs in the LSF Batch system, getting all of this information in one message could use a lot of memory space. LSBLIB allows the application to open a stream connection and then read the job records one by one. This insures the memory space needed is always the size of one job record.

LSF Batch Job ID

LSF version 4.1 API supports 64-bit batch job ID. The LSF Batch job ID will store in a 64-bit integer. It consists of two parts:

The base ID is stored in the lower 32 bits. The array index is shared in the top 32 bits. The top 32 bits are only used when the underlying job is an array job.

For LSF Version 3.x API, the job ID is stored in a 32-bit integer. The base ID is stored in the lower 20 bits whereas the array index in the top 12 bits.

LSBLIB provides the following C macros (defined in lsbatch.h) for manipulating job IDs:

LSB_JOBID(base_ID, array_index)   Yield an LSF Batch job ID
LSB_ARRAY_IDX(job_ID)             Yield array index part of the job ID
LSB_ARRAY_JOBID(job_ID)           Yield the base ID part of the job ID

The function calls used to get job information are:

These functions are used to open a job information connection with mbatchd, read job records, and then close the job information connection.

lsb_openjobinfo()

lsb_openjobinfo() takes the following arguments:

LS_LONG_INT  jobId;            Select job with the given job Id
char  *jobName;                Select job(s) with the given job name
char  *user;                   Select job(s) submitted by the named user
                                   or user group
char  *queue;                  Select job(s) submitted to the named queue
char  *host;                   Select job(s) that are dispatched to the
                                   named host
int   options;                 Selection flags constructed from the bits
                                   defined in lsbatch.h

options parameter

The options parameter contains additional job selection flags defined in lsbatch.h. These are:

option parameter flags

Flag Name Flag Description
ALL_JOB
Select jobs matching any status, including unfinished jobs and recently finished jobs. LSF Batch remembers finished jobs within the CLEAN_PERIOD, as defined in the lsb.params file.
CUR_JOB
Return jobs that have not finished yet
DONE_JOB
Return jobs that have finished recently.
PEND_JOB
Return jobs that are in the pending status.
SUSP_JOB
Return jobs that are in the suspended status.
LAST_JOB
Return jobs that are submitted most recently.
JGRP_ARRAY_INFO
Return job array information.

If options is 0, then the default is CUR_JOB.

lsb_openjobinfo() returns the total number of matching job records in the connection. On failure, it returns -1 and sets lsberrno to indicate the error.

lsb_readjobinfo()

lsb_readjobinfo() takes one argument:

int   *more;                 If not NULL, contains the remaining number of
                             jobs unread

Either this parameter or the return value from the lsb_openjobinfo() can be used to keep track of the number of job records that can be returned from the connection. This parameter is updated each time lsb_readjobinfo() is called.

jobInfoEnt structure

The jobInfoEnt structure returned by lsb_readjobinfo() is defined in lsbatch.h as:

struct jobInfoEnt {
    LS_LONG_INT  jobId;             job ID
    char         *user;             submission user
    int          status;            job status
    /* possible values for the status field */
#define JOB_STAT_PEND      0x01    job is pending
#define JOB_STAT_PSUSP     0x02    job is held
#define JOB_STAT_RUN       0x04    job is running
#define JOB_STAT_SSUSP     0x08    job is suspended by LSF Batch system
#define JOB_STAT_USUSP     0x10    job is suspended by user
#define JOB_STAT_EXIT      0x20    job exited
#define JOB_STAT_DONE      0x40    job is completed successfully
#define JOB_STAT_PDONE     0x80    post job process done successfully
#define JOB_STAT_PERROR    0x100   post job process error
#define JOB_STAT_WAIT      0x200   chunk job waiting its execution turn
#define JOB_STAT_UNKWN     0x1000  unknown status
    int    *reasonTb;         pending or suspending reasons
    int    numReasons;        length of reasonTb vector
    int    reasons;           reserved for future use
    int    subreasons;        reserved for future use
    int    jobPid;            process Id of the job
    time_t submitTime;        time when the job is submitted 
    time_t reserveTime;       time when job slots are reserved 
    time_t startTime;         time when job is actually started
    time_t predictedStartTime;  job's predicted start time
    time_t endTime;           time when the job finishes
    time_t lastEvent;         last time event
    time_t nextEvent;         next time event
    int    duration;          duration time (minutes)
    float  cpuTime;           CPU time consumed by the job
    int    umask;             file mode creation mask for the job
    char   *cwd;              current working directory where job is
                                  submitted
    char   *subHomeDir;       submitting user's home directory
    char   *fromHost;         host from which the job is submitted
    char   **exHosts;         host(s) on which the job executes
    int    numExHosts;        number of execution hosts
    float  cpuFactor;         CPU factor of the first execution host
    int    nIdx;              number of load indices in the loadSched and 
                              loadStop vector
    float  *loadSched;        stop scheduling new jobs if this threshold is
                              exceeded
    float  *loadStop;         stop jobs if this threshold is exceeded
    struct submit submit;     job submission parameters
    int    exitStatus;        exit status
    int    execUid;           user ID under which the job is running
    char   *execHome;         home directory of the user denoted by
                                  execUid
    char   *execCwd;          current working directory where job is
                                  running
    char   *execUsername;     user name corresponds to execUid
    time_t jRusageUpdateTime; last time job's resource usage is updated
    struct jRusage runRusage; last updated job's resource usage
    int   jType;              job type
    /* Possible values for the jType field */
#define    JGRP_NODE_JOB        1  this structure stores a normal batch job
#define    JGRP_NODE_GROUP      2  this structure stores a job group
#define    JGRP_NODE_ARRAY      3  this structure stores a job array
    char  *parentGroup;       for job group use
    char  *jName;              if jType is JGRP_NODE_GROUP, then it is
                              job group name. Otherwise, it is the job's
                              name
    int   counter[NUM_JGRP_COUNTERS];
    /* index into the counter array, only used for job array 
*/
#define     JGRP_COUNT_NJOBS   0   total jobs in the array
#define     JGRP_COUNT_PEND    1   number of pending jobs in the array
#define     JGRP_COUNT_NPSUSP  2   number of held jobs in the array
#define     JGRP_COUNT_NRUN    3   number of running jobs in the array
#define     JGRP_COUNT_NSSUSP  4   number of jobs suspended by the
                                   system in the array
#define     JGRP_COUNT_NUSUSP  5   number of jobs suspended by the
                                   user in the array
#define     JGRP_COUNT_NEXIT   6   number of exited jobs in the array
#define     JGRP_COUNT_NDONE   7   number of successfully completed jobs
    int     counter[NUM_JGRP_COUNTERS]; 
    u_short port;              service port of the job
    int    jobPriority;        job dynamic priority
    int    numExternalMsg;     number of external message(s) in the job
    struct jobExternalMsgReply **externalMsg;
};

jobInfoEnt can store a job array as well as a non-array batch job, depending on the value of jType field, which can be either JGRP_NODE_JOB or JGRP_NODE_ARRAY.

lsb_closejobinfo()

Call lsb_closejobinfo()after receiving all job records in the connection.

Example

Below is an example of a simplified bjobs command. This program displays all pending jobs belonging to all users.

/******************************************************
* LSBLIB -- Examples
*
* simple bjobs
* Submit command as an lsbatch job with no options set
* and retrieve the job info
* It is similar to the "bjobs" command with no options.
******************************************************/


#include <stdio.h>
#include <lsf/lsbatch.h>
#include "submit_cmd.h"

int main(int argc, char **argv)
{
    /* variables for simulating submission */
    struct submit req;            /* job specifications */
    memset(&req, 0, sizeof(req)); /* initializes req */
    struct submitReply  reply; /* results of job submission */
    int  jobId;                /* job ID of submitted job */

    /* variables for simulating bjobs command */
    int  options = PEND_JOB;    /* the status of the jobs
                                whose info is returned */
    char *user = "all";         /* match jobs for all users */
    struct jobInfoEnt *job;     /* detailed job info */
    int more;                   /* number of remaining jobs
                                unread */
    /* initialize LSBLIB  and  get  the  configuration
    environment */
    if (lsb_init(argv[0]) < 0) {
        lsb_perror("simbjobs: lsb_init() failed");
        exit(-1);
    }
    /* check if input is in the right format: 
     * "./simbjobs COMMAND ARGUMENTS" */
    if (argc < 2) {
        fprintf(stderr, "Usage: simbjobs command\n");
        exit(-1);
    }
    jobId = submit_cmd(&req, &reply, argc, argv); 
        /* submit a job */

    if (jobId < 0)                      /* if job submission
                                        fails, lsb_submit
                                        returns -1 */
        switch (lsberrno) {    
        /* and sets lsberrno to indicate the error */

        case LSBE_QUEUE_USE:
        case LSBE_QUEUE_CLOSED:
            lsb_perror(reply.queue);
            exit(-1);
        default:
            lsb_perror(NULL);
            exit(-1);
}

    /* gets the total number of pending job. Exits if failure */
    if (lsb_openjobinfo(0, NULL, user, NULL, NULL, options)<0) 
{
        lsb_perror("lsb_openjobinfo");
        exit(-1);
    }
    /* display all pending jobs */
    printf("All pending jobs submitted by all users:\n");
    for (;;) {
        job = lsb_readjobinfo(&more);   /* get the job details */

        if (job == NULL) {
        lsb_perror("lsb_readjobinfo");
        exit(-1);
    }

        printf("%s",ctime(&job->submitTime));   
        /* submission time of job */

        printf("Job <%s> ", lsb_jobid2str(job->jobId));   
        /* job ID */

        printf("of user <%s>, ", job->user);
        /* user that submits the job */
        printf("submitted from host <%s>\n", job->fromHost);    
        /* name of sumbission host */

        /* continue to display if there is remaining job */
        if (!more)
        /* if there are no remaining jobs undisplayed, 
           exits */
        break;
    }

    /* when finished to display the job info, close the
    connection to the mbatchd */
    lsb_closejobinfo();

    exit(0);
}

Example output

The above program will produce output similar to the following:

All pending jobs submitted by all users:
Mon Mar 1 10:34:04 EST 1996
Job <123> of user <john>, submitted from host <orange>
Mon Mar 1 11:12:11 EST 1996
Job <126> of user <john>, submitted from host <orange>
Mon Mar 1 14:11:34 EST 1996
Job <163> of user <ken>, submitted from host <apple>
Mon Mar 1 15:00:56 EST 1996
Job <199> of user <tim>, submitted from host <pear>

Use lsb_pendreason(), to print out the reasons why the job is still pending See lsb_pendreason(3) for details.

[ Top ]


Job Manipulation

Users manipulate jobs in different ways, after a job has been submitted. It can be suspended, resumed, killed, or sent arbitrary signal jobs.

All applications that manipulate jobs are subject to authentication provisions described in Authentication.

Sending a signal to a job

Users can send signals to submitted jobs. If the job has not been started, you can send KILL, TERM, INT, and STOP signals. These signals cause the job to be cancelled (KILL, TERM, INT) or suspended (STOP). If the job has already started, then any signal can be sent to the job.

lsb_signaljob()

lsb_signaljob() sends a signal to a job:

int lsb_signaljob(jobIdsigValue);
LS_LONG_INT  jobId;          Select job with the given job Id
int sigValue;                Signal sent to the job

The jobId and sigValue parameters are self-explanatory.

Example

The following example takes a job ID as the argument and sends a SIGSTOP signal to the job.

/******************************************************
* LSBLIB -- Examples
*
* simple bstop
* The program takes a job ID as the argument and sends a * SIGSTOP signal to the job
******************************************************/

#include <stdio.h>
#include <lsf/lsbatch.h>
#include <stdlib.h>
#include <signal.h>

int main(int argc, char **argv)
{
    /* check if input is in the right format: "simbstop JOBID" */
    if (argc != 2) {
        printf("Usage: %s jobId\n", argv[0]);
        exit(-1);
    }

    /* initialize LSBLIB and get the configuration environment */
    if (lsb_init(argv[0]) < 0) {
        lsb_perror("lsb_init");
        exit(-1);
    }

    /* send the SIGSTOP signal and check if lsb_signaljob()
    runs successfully */
    if (lsb_signaljob(atoi(argv[1]), SIGSTOP) <0) {
        lsb_perror("lsb_signaljob");
        exit(-1);
    }

    printf("Job %s is signaled\n", argv[1]);
    exit(0);
    } 

On success, the function returns 0. On failure, it returns -1 and sets lsberrno to indicate the error.

Switching a job to a different queue

A job can be switched to a different queue after submission. This can be done even after the job has already started.

lsb_switchjob()

Use lsb_switchjob() to switch a job from one queue to another:

int lsb_switchjob(jobIdqueue);
LS_LONG_INT jobId;           Select job with the given job Id
char *queue                  Name of the queue for the new job

Example

Below is an example program that switches a specified job to a new queue.

/******************************************************
* LSBLIB -- Examples
* 
* simple bstop
* The program switches a specified job to a new queue.
******************************************************/
#include <stdio.h>
#include <lsf/lsbatch.h>
#include <stdlib.h>

int main(int argc, char **argv)
{
    /* check if the input is in the right format: "./simbstop
    JOBID QUEUENAME" */
    if (argc != 3) {
        printf("Usage: %s jobId new_queue\n", argv[1]);
        exit(-1);
    }

    /* initialize LSBLIB and get the configuration environment */
    if (lsb_init(argv[0]) <0) {
        lsb_perror("lsb_init");
        exit(-1);
    }

    /* switch the job to the new queue and check for success 
*/
    if (lsb_switchjob(atoi(argv[1]), argv[2]) < 0) {
        lsb_perror("lsb_switchjob");
        exit(-1);
    }

    printf("Job %s is switched to new queue <%s>\n", argv[1], 
           argv[2]);

    exit(0);
}

On success, lsb_switchjob() returns 0. On failure, it returns -1 and sets lsberrno to indicate the error.

Forcing a job to run

After a job is submitted to the LSF Batch system, it remains pending until LSF Batch runs it (for details on the factors that govern when and where a job starts to run, see Administering Platform LSF).

lsb_runjob()

A job can be forced to run on a specified list of hosts immediately using the following LSBLIB function:

int lsb_runjob (struct runJobRequest *runReq)

runJobReq Structure

lsb_runjob() takes the runJobRequest structure, which is defined in lsbatch.h:

struct runJobRequest {
    LS_LONG_INT  jobId;             Job ID of the job to start
    int          numHosts;          Number of hosts to run the job on
    char         **hostname;        Host names where jobs run
#define RUNJOB_OPT_NORMAL     0x01
#define RUNJOB_OPT_NOSTOP     0x02
#define RUNJOB_OPT_PENDONLY   0x04     Pending jobs only, no finished jobs
#define RUNJOB_OPT_FROM_BEGIN 0x08     Checkpoint jobs only, from beginning
#define RUNJOB_OPT_FREE       0x10     brun to use free CPUs only
    int          options;           Run job request options
    int          *slots;            Number of slots per host
}

To force a job to run, the job must have been submitted and in either PEND or FINISHED state. Only the LSF administrator or the owner of the job can start the job. lsb_runjob() restarts a job in FINISHED status.

A job can be run without any scheduling constraints such as job slot limits. If the job is started with the options field being 0 or RUNJOB_OPT_NORMAL, then the job is subject to the:

To override a started, use RUNJOB_OPT_NOSTOP and the job will not be stopped due to the above mentioned load conditions. However, all LSBLIB's job manipulation APIs can still be applied to the job.

Example

The following is an example program that runs a specified job on a host that has no batch job running.

/******************************************************
* LSBLIB -- Examples
*
* simple brun
* The program takes a job ID as the argument and runs that
* job on a vacant hosts
******************************************************/

#include <stdio.h>
#include <lsf/lsbatch.h>
#include <stdlib.h>

int main(int argc, char **argv)
{
    struct hostInfoEnt  *hInfo;  /* host information */
    int numHosts = 0;            /* number of hosts */
    int i;
    struct runJobRequest runJobReq;
        /* specification for the job to be run */

    /* check if the input is in the right format: "./simbrun
    JOBID" */
    if (argc != 2) {
        printf("Usage: %s jobId\n", argv[0]);
        exit(-1);
    }

    /* initialize LSBLIB and get the configuration environment */
    if (lsb_init(argv[0]) < 0) {
        lsb_perror("lsb_init");
        exit(-1);
    }

    /* get host information */
    hInfo = lsb_hostinfo(NULL, &numHosts);
    if (hInfo == NULL) {
        lsb_perror("lsb_hostinfo");
        exit(-1);
    }

    /* find a vacant host */
    for (i = 0; i < numHosts; i++) {
       if (hInfo[i].hStatus & (HOST_STAT_BUSY |
                              HOST_STAT_WIND | 
                              HOST_STAT_DISABLED |
                              HOST_STAT_LOCKED |
                              HOST_STAT_FULL |
                              HOST_STAT_NO_LIM |
                              HOST_STAT_UNLICENSED |
                              HOST_STAT_UNAVAIL |
                              HOST_STAT_UNREACH))
            continue;

        /* found a vacant host */
        if (hInfo[i].numJobs == 0)
            break;
    }

    /* return error message when there is no vacant host found */
    if (i == numHosts) {
        fprintf(stderr, "Cannot find vacate host to run job
                < %s >\n", argv[1]);
        exit(-1);
    }

    /* define the specifications for the job to be run (The 
job
    can be stopped due to load conditions) */
    runJobReq.jobId = atoi(argv[1]);
    runJobReq.options = 0;
    runJobReq.numHosts = 1;
    runJobReq.hostname = (char **)malloc(sizeof(char*));
    runJobReq.hostname[0] = hInfo[i].host;

    /* run the job and check for the success */
    if (lsb_runjob(&runJobReq) < 0) {
        lsb_perror("lsb_runjob");
        exit(-1);
    }
    exit (0);
}

On success, lsb_runjob() returns 0. On failure, returns -1 and sets lsberrno to indicate the error.

[ Top ]


Processing LSF Batch Log Files

LSF Batch saves a lot of valuable information about the system and jobs. Such information is logged by mbatchd in the files lsb.events and lsb.acct under the directory $LSB_SHAREDIR/your_cluster/logdir, where LSB_SHAREDIR is defined in the lsf.conf file and your_cluster is the name of your Platform LSF cluster.

mbatchd logs such information for several purposes.

Caution


The lsb.events file contains critical user job information. Never use your program to modify lsb.events. Writing into this file may cause the loss of user jobs.

lsb_geteventrec()

LSBLIB provides a function to read information from these files into a well-defined data structure:

struct eventRec *lsb_geteventrec(log_fplineNum)
FILE  *log_fp;              File handle for either an event log
                            file or job log file
int   *lineNum;             Line number of the next event
                              record

The parameter log_fp is returned by a successful fopen() call. The content in lineNum is modified to indicate the line number of the next event record in the log file on a successful return. This value can then be used to report the line number when an error occurs while reading the log file. This value should be initiated to 0 before lsb_geteventrec() is called for the first time.

eventRec Structure

lsb_geteventrec() returns the following data structure:

struct eventRec {
    char version[MAX_VERSION_LEN];    Version number of the mbatchd
    int type;                         Type of the event
    time_t eventTime;                    Event time stamp
    union eventLog eventLog;          Event data
};

The event type is used to determine the structure of the data in eventLog. LSBLIB remembers the storage allocated for the previously returned data structure and automatically frees it before returning the next event record.

lsb_geteventrec() returns NULL and sets lsberrno to LSBE_EOF when there are no more records in the event file.

Events are logged by mbatchd for different purposes. There are job-related events and system-related events. Applications can choose to process certain events and ignore other events. For example, the bhist command processes job-related events only. The currently available event types are listed below.

Event Types

Event Type Description
EVENT_JOB_NEW
Submit new job
EVENT_JOB_START
mbatchd is trying to start a job
EVENT_JOB_STATUS
Job status change event
EVENT_JOB_SWITCH
Job switched to another queue
EVENT_JOB_MOVE
Move a pending job's position within a queue
EVENT_QUEUE_CTRL
Queue status changed by Platform LSF administrator (bqc operation)
EVENT_HOST_CTRL
Host status changed by Platform LSF administrator (bhc operation)
EVENT_MBD_START
New mbatchd start event
EVENT_MBD_DIE
Log parameters before mbatchd die
EVENT_MBD_UNFULFILL
mbatchd has an action to be fulfilled
EVENT_JOB_FINISH
Job has finished (logged in lsb.acct only)
EVENT_LOAD_INDEX
Complete list of load index names
EVENT_MIG
Job has migrated
EVENT_PRE_EXEC_START
The pre-execution command started
EVENT_JOB_ROUTE
The job has been routed to NQS
EVENT_JOB_MODIFY
The job's parameters have been modified
EVENT_JOB_SIGNAL
Signal/delete a job
EVENT_CAL_NEW
Add new calendar to the system *
EVENT_CAL_MODIFY
Calendar modified *
EVENT_CAL_DELETE
Calendar deleted *
EVENT_JOB_FORCE
Forcing a job to start on specified hosts (brun operation)
EVENT_JOB_FORWARD
Job forwarded to another cluster
EVENT_JOB_ACCEPT
Job from a remote cluster dispatched
EVENT_STATUS_ACK
Job status successfully sent to submission cluster
EVENT_JOB_EXECUTE
Job started successfully on the execution host
EVENT_JOB_MSG
Send a message to a job
EVENT_JOB_MSG_ACK
The message has been delivered.
EVENT_JOB_REQUEUE
Job is requeued
EVENT_JOB_OCCUPY_REQ
Submission mbatchd logs this after sending an occupy request to execution mbatchd
EVENT_JOB_VACATED
Submission mbatchd logs this event after all execution mbatchds have vacated the occupied hosts for the job.
EVENT_JOB_SIGACT
An signal action on a job has been initiated or finished
EVENT_JOB_START_ACCEPT
Job accepted by sbatchd
EVENT_SBD_JOB_STATUS
sbatchd's new job status
EVENT_CAL_UNDELETE
Undeleted a calendar in the system
EVENT_JOB_CLEAN
Job is cleaned out of the core
EVENT_JOB_EXCEPTION
Job exception was detected
EVENT_JGRP_ADD
Adding a new job group
EVENT_JGRP_MOD
Modifying a job group
EVENT_JGRP_CNT
Controlling a job group
EVENT_LOG_SWITCH
Switching the event file lsb.events
EVENT_JOB_MODIFY2
Job modification request
EVENT_JGRP_STATUS
Log job group status
EVENT_JOB_ATTR_SET
Job attributes have been set
EVENT_JOB_EXT_MSG
Send an external message to a job
EVENT_JOB_ATTA_DATA
Update data status of a message for a job
EVENT_JOB_CHUNK
Insert one job to a chunk
EVENT_SBD_UNREPORTED_
STATUS

Save unreported sbatchd status

* Available only if the Platform JobScheduler component is enabled.

Note

The lsb.acct file uses only EVENT_JOB_FINISH. lsb.events file uses all other event types. For detailed formats of these log files, see lsb.events(5) and lsb.acct(5).

eventLog Union

Each event type corresponds to a different data structure in the union:

union  eventLog { 
    struct jobNewLog     jobNewLog;      EVENT_JOB_NEW
    struct jobStartLog   jobStartLog;    EVENT_JOB_START
    struct jobStatusLog  jobStatusLog;   EVENT_JOB_STATUS
    struct jobSwitchLog  jobSwitchLog;   EVENT_JOB_SWITCH
    struct jobMoveLog    jobMoveLog;     EVENT_JOB_MOVE
    struct queueCtrlLog  queueCtrlLog;   EVENT_QUEUE_CTRL
    struct hostCtrlLog   hostCtrlLog;    EVENT_HOST_CTRL
    struct mbdStartLog   mbdStartLog;    EVENT_MBD_START
    struct mbdDieLog     mbdDieLog;      EVENT_MBD_DIE
    struct unfulfillLog  unfulfillLog;   EVENT_MBD_UNFULFILL
    struct jobFinishLog  jobFinishLog;   EVENT_JOB_FINISH
    struct loadIndexLog  loadIndexLog;   EVENT_LOAD_INDEX
    struct migLog        migLog;         EVENT_MIG
    struct calendarLog   calendarLog;    Shared by all calendar events
    struct jobForceRequestLog jobForceRequestLog  
                                               EVENT_JOB_FORCE
    struct jobForwardLog jobForwardLog;  EVENT_JOB_FORWARD
    struct jobAcceptLog  jobAcceptLog;   EVENT_JOB_ACCEPT
    struct statusAckLog  statusAckLog;   EVENT_STATUS_ACK
    struct signalLog     signalLog;      EVENT_JOB_SIGNAL
    struct jobExecuteLog jobExecuteLog;  EVENT_JOB_EXECUTE
    struct jobRequeueLog jobRequeueLog;  EVENT_JOB_REQUEUE
    struct sigactLog sigactLog;          EVENT_JOB_SIGACT
    struct jobStartAcceptLog jobStartAcceptLog
                                             EVENT_JOB_START_ACCEPT
    struct jobMsgLog     jobMsgLOg;      EVENT_JOB_MSG
    struct jobMsgAckLog  jobMsgAckLog;   EVENT_JOB_MSG_ACK
    struct chkpntLog     chkpntLog;      EVENT_CHKPNT
    struct jobOccupyReqLog jobOccupyReqLog;  
                                               EVENT_JOB_OCCUPY_REQ
    struct jobVacatedLog jobVacatedLog;  EVENT_JOB_VACATED
    struct jobCleanLog   jobCleanLog;    EVENT_JOB_CLEAN
    struct jobExceptionLog jobExceptionLog;  
                                               EVENT_JOB_EXCEPTION
    struct jgrpNewLog    jgrpNewLog;     EVENT_JGRP_ADD
    struct jgrpCtrlLog   jgrpCtrlLog;    EVENT_JGRP_CTR
    struct logSwitchLog  logSwitchLog;   EVENT_LOG_SWITCH
    struct jobModLog     jobModLog;      EVENT_JOB_MODIFY
    struct jgrpStatusLog jgrpStatusLog;  EVENT_JGRP_STATUS
    struct jobAttrSetLog jobAttrSetLog;  EVENT_JOB_ATTR_SET
    struct jobExternalMsgLog jobExternalMsgLog;
                                                EVENT_JOB_EXT_MSG
    struct jobChunkLog   jobChunkLog;    EVENT_JOB_CHUNK
    struct sbdUnreportedStatusLog sbdUnreportedStatusLog;
                                      EVENT_SBD_UNREPORTED_STATUS
};

The detailed data structures in the above union are defined in lsbatch.h and described in lsb_geteventrec(3).

Example

Below is an example program that takes an argument as job name and displays a chronological history about all jobs matching the job name. This program assumes that the lsb.events file is in /local/lsf/work/cluster1/logdir.

/******************************************************
* LSBLIB -- Examples
*
* get event record
* The program takes a job name as the argument and returns
* the information of the job with this given name

******************************************************/

#include <stdio.h>
#include <string.h>
#include <time.h>
#include <lsf/lsbatch.h>

int main(int argc, char **argv)
{
    char *eventFile = 
            "/local/lsf/mnt/work/cluster1/logdir/lsb.events";
       /*location of lsb.events*/

    FILE *fp;/* file handler for lsb.events */
    struct eventRec *record;
	 	   /* pointer to the return struct of lsb_geteventrec() */

    int  lineNum = 0;/* line number of next event */
    char *jobName = argv[1];/* specified job name */
    int  i;
    struct jobNewLog *newJob;/* new job event record */
    struct jobStartLog *startJob;/* start job event record */
    struct jobStatusLog *statusJob;
        /* job status change event record */

    /* check if the input is in the right format: 
    "./geteventrec JOBNAME" */
    if (argc != 2) {
        printf("Usage: %s job name\n", argv[0]);
        exit(-1);
    }

    /* initialize LSBLIB and get the configuration environment */
    if (lsb_init(argv[0]) < 0) {
        lsb_perror("lsb_init");
        exit(-1);
    }

    /* open the file for read */
    fp = fopen(eventFile, "r");
    if (fp == NULL) {
        perror(eventFile);
        exit(-1);
    }

    /* get events and print out the information of the event
    records with the given job name in different format */
    for (;;) {
        record = lsb_geteventrec(fp, &lineNum);
        if (record == NULL) {
            if (lsberrno == LSBE_EOF)
                exit(0);
            lsb_perror("lsb_geteventrec");
            exit(-1);
        }

        /* find the record with the given job name */
        if (strcmp(record->eventLog.jobNewLog.jobName, jobName) != 0)
            continue;
        else
            switch (record->type) {

        case EVENT_JOB_NEW:
             newJob = &(record->eventLog.jobNewLog);
                printf("%sJob <%d> submitted by <%s> from <%s> 
                      to <%s> queue\n", ctime(&record-> 
                      eventTime), newJob->jobId, newJob-> 
                     userName, newJob->fromHost, newJob-> 
                     queue);
            continue;
        case EVENT_JOB_START:
            startJob = &(record->eventLog.jobStartLog);
                printf("%sJob <%d> started on ", ctime(&record-
>                      eventTime), newJob->jobId);
            for (i=0; i<startJob->numExHosts; i++)
                printf("<%s> ", startJob->execHosts[i]);
            printf("\n");
            continue;
        case EVENT_JOB_STATUS:
            statusJob = &(record->eventLog.jobStatusLog);
                printf("%sJob <%d> status changed to: ", 
                      ctime(&record->eventTime), statusJob-> 
                      jobId);
                switch(statusJob->jStatus) {
        case JOB_STAT_PEND:
                printf("pending\n");
                continue;
                case JOB_STAT_RUN: 
                printf("running\n");
                continue;
        case JOB_STAT_SSUSP:
        case JOB_STAT_USUSP:
        case JOB_STAT_PSUSP:
                printf("suspended\n");
                continue; 
        case JOB_STAT_UNKWN:
                printf("unknown (sbatchd unreachable)\n");
                continue;
        case JOB_STAT_EXIT:
                printf("exited\n");
                continue;
        case JOB_STAT_DONE:
                printf("done\n");
                continue;
        default: 
                printf("\nError: unknown job status %d\n", 
                      statusJob->jStatus);
                continue;
        }
        default:            
        /* Only display a few selected event types */
            continue;
            }
    } 

    exit(0);
}

Note

In the above program, events that are of no interest are skipped. The job status codes are defined in lsbatch.h. The lsb.acct file stores job accounting information, which allows lsb.acct to be processed similarly. Since currently there is only one event type (EVENT_JOB_FINISH) in lsb.acct, processing is simpler than in the above example.

[ Top ]


[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]


      Date Modified: May 12, 2008
Platform Computing: www.platform.com

Platform Support: support@platform.com
Platform Information Development: doc@platform.com

Copyright © 1994-2008 Platform Computing Corporation. All rights reserved.