lightning user document heading  
NCAR
Last update: 03/01/2005

Lightning user doc contents

Save-restart for long-running jobs

Adding save-restart capability to any application that runs on CISL systems for longer than 30 minutes is recommnded. Adding save-restart capability will help to prevent a loss of computer output and cycles due to system crash, unexpected job timeout, running past queue time limits, internal errors, or any other kind of unexpected run-time interruption.

A good save-restart procedure periodically writes information (a complete set of program state variables) to nonvolatile, offline storage such as NCAR's Mass Storage System (MSS). If the job should unexpectedly terminate, then it becomes possible to restart the job from the last point that where internal state data was saved.

Any code that uses a save-restart mechanism should be thoroughly tested to demonstrate that the application will produce the same results for long runtimes whether started from initial conditions or an intermediate results file.

Save-restart code example:

This flow chart shows the required logic:

                          +--------------+       +-----+   +-----------+   +---+
+-----+   +--------+ Yes  |Read history  |       | Run |   |Periodic   |   |End|
|Start|-->|Do a    |----->|file from MSS |------>|     |-->|save to MSS|-->|   |
|     |   |restart?|      +--------------+  ^  ^ +-----+   +-----------+   +---+
+-----+   +--------+                        |  |                  |
              |       No   +--------------+ |  |                  |
              +----------->|Do normal     |-+  +------------------+
                           |initialization|
                           +--------------+

An example of an application that might require save-restart capability is a simple time-dependent "marching program" that solves the wave equation.

       program solve
       real  c, dt, dx, u_old(1001), u_cur(1001), u_new(1001)

       call init
       do iter = 1,5000
         call march(u_old, u_cur, u_new, dt, dx, c)
         u_old  = u_cur
         u_cur  = u_new
       enddo
       stop
       end

For any iteration, the internal state of this program is described by the arrays u_old and u_cur and the parameters dx, dt and c. Saving these state variables would ensure that the program could restart from that iteration and continue without losing the cumulative result of any past computations.

To implement the save-restart capability, the program might be rewritten this way:

       program solve

       logical restart
       integer nits
       real    c, dt, dx, u_old(1001), u_cur(1001), u_new(1001)
       common  c, dt, dx, u_old(1001), u_cur(1001), it_cur
!
! Read restart flag and integer number of iterations on standard input
!
       if (restart) then
          call rest_mss            ! Restart from last run
          read(./cookie,*) nits    ! Read file named cookie for number
                                   !   of current iteration
          nits=nits-1              ! Decrement the iteration count
          write(./cookie, *) nits  ! Update cookie, the counting file
       else
          call init                ! Start from very beginning
       endif
!
! Integrate, and save program state every 25 iterations
!
       do iter = it_cur+1, it_cur+nits
         call march(u_new)
           u_old  = u_cur
           u_cur  = u_new
         if (mod(iter,25) .eq. 0) then
           it_cur = iter
           call save_mss           ! Save common block
         endif
       enddo

       stop
       end
!--------------------------------------------------------------------

       subroutine rest_mss
       real    c, dt, dx, u_old(1001), u_cur(1001)
       common  c, dt, dx, u_old(1001), u_cur(1001), it_cur
!
! Read MSS save file
!
       call msread(ier,'save.old','/JOEUSER/wave/save', ' ','replace')
       open (unit=38, file='save.old', status='old', form='unformatted')

       if (ier.ne.0) then
          print *,'Error from msread.'
          stop
       endif

       read(38)  it_cur, c, dt, dx, u_old(1001), u_cur(1001)
       call unlink('save.old')     ! Free disk space
       return
       end

!--------------------------------------------------------------------
       subroutine save_mss
       real    c, dt, dx, u_old(1001), u_cur(1001), u_new(1001)
       common  c, dt, dx, u_old(1001), u_cur(1001), it_cur
!
! Save all variables and data needed to restart program, to MSS.
!
       open (unit=39, file='unknown', status='new', form='unformatted')
       write(39) it_cur, c, dt, dx, u_old, u_cur
       close(39)                   ! must close before MSWRITE
       call mswrite(ier,'save.new','/JOEUSER/wave/save', & ' ',10,'NOWAIT')
                                   ! Retain on MSS for 10 days.
                                   ! async write
       if(ier.ne.0) then
          print *,'Error from mswrite.'
          stop
       endif
       return
       end
#
# Here is the section of the LSF run script (in csh) that must
#   accompany logic in the application itself
#
# It depends on an integer value stored in the file 'resubmit' that
#   exists in the application's $PWD
#
          .
          .
          .
       set RESUBMIT = 'FALSE'
       if ( -e ./resubmit ) then
          @ N = `cat ./resubmit`
          echo "file resubmit exists and requests $N more job submissions"
          if ( $N > 0 ) then
            set RESUBMIT = 'TRUE'
            @ N--
            echo $N >! ./resubmit
          endif
        endif

        if ( $RESUBMIT == 'TRUE' ) then
          echo "Note: resubmitting job"

          bsub < this_script_name
        else
          echo "Note: not resubmitting job"
        endif

        exit

In this example, only one save file with one record is written to the MSS, and since it is overwritten many times, only the last timestep of the wave equation solution is available on restart. You may use a more sophisticated strategy, where several save records are maintained in one file, or where more than one save file is maintained. You may also specify a longer retention period for MSS files.


Next page | Table of contents - Lightning user guide

If you have questions about this document, please contact CISL Customer Support. You can also reach us by telephone 24 hours a day, seven days a week at 303-497-1278. Additional contact methods: consult1@ucar.edu and during business hours in NCAR Mesa Lab Suite 39.

© Copyright 2005. University Corporation for Atmospheric Research (UCAR). All Rights Reserved.

Address of this page: http://www.cisl.ucar.edu/docs/lightning/save.jsp