|
||
|
|
| Last update: 03/01/2005 | |
Save-restart for long-running jobsAdding save-restart capability to any application that runs on CISL systems for longer than 30 minutes is recommnded. Adding save-restart capability will help to prevent a loss of computer output and cycles due to system crash, unexpected job timeout, running past queue time limits, internal errors, or any other kind of unexpected run-time interruption. A good save-restart procedure periodically writes information (a complete set of program state variables) to nonvolatile, offline storage such as NCAR's Mass Storage System (MSS). If the job should unexpectedly terminate, then it becomes possible to restart the job from the last point that where internal state data was saved. Any code that uses a save-restart mechanism should be thoroughly tested to demonstrate that the application will produce the same results for long runtimes whether started from initial conditions or an intermediate results file. Save-restart code example: This flow chart shows the required logic:
+--------------+ +-----+ +-----------+ +---+
+-----+ +--------+ Yes |Read history | | Run | |Periodic | |End|
|Start|-->|Do a |----->|file from MSS |------>| |-->|save to MSS|-->| |
| | |restart?| +--------------+ ^ ^ +-----+ +-----------+ +---+
+-----+ +--------+ | | |
| No +--------------+ | | |
+----------->|Do normal |-+ +------------------+
|initialization|
+--------------+
An example of an application that might require save-restart capability is a simple time-dependent "marching program" that solves the wave equation.
program solve
real c, dt, dx, u_old(1001), u_cur(1001), u_new(1001)
call init
do iter = 1,5000
call march(u_old, u_cur, u_new, dt, dx, c)
u_old = u_cur
u_cur = u_new
enddo
stop
end
For any iteration, the internal state of this program is described by the arrays u_old and u_cur and the parameters dx, dt and c. Saving these state variables would ensure that the program could restart from that iteration and continue without losing the cumulative result of any past computations. To implement the save-restart capability, the program might be rewritten this way:
program solve
logical restart
integer nits
real c, dt, dx, u_old(1001), u_cur(1001), u_new(1001)
common c, dt, dx, u_old(1001), u_cur(1001), it_cur
!
! Read restart flag and integer number of iterations on standard input
!
if (restart) then
call rest_mss ! Restart from last run
read(./cookie,*) nits ! Read file named cookie for number
! of current iteration
nits=nits-1 ! Decrement the iteration count
write(./cookie, *) nits ! Update cookie, the counting file
else
call init ! Start from very beginning
endif
!
! Integrate, and save program state every 25 iterations
!
do iter = it_cur+1, it_cur+nits
call march(u_new)
u_old = u_cur
u_cur = u_new
if (mod(iter,25) .eq. 0) then
it_cur = iter
call save_mss ! Save common block
endif
enddo
stop
end
!--------------------------------------------------------------------
subroutine rest_mss
real c, dt, dx, u_old(1001), u_cur(1001)
common c, dt, dx, u_old(1001), u_cur(1001), it_cur
!
! Read MSS save file
!
call msread(ier,'save.old','/JOEUSER/wave/save', ' ','replace')
open (unit=38, file='save.old', status='old', form='unformatted')
if (ier.ne.0) then
print *,'Error from msread.'
stop
endif
read(38) it_cur, c, dt, dx, u_old(1001), u_cur(1001)
call unlink('save.old') ! Free disk space
return
end
!--------------------------------------------------------------------
subroutine save_mss
real c, dt, dx, u_old(1001), u_cur(1001), u_new(1001)
common c, dt, dx, u_old(1001), u_cur(1001), it_cur
!
! Save all variables and data needed to restart program, to MSS.
!
open (unit=39, file='unknown', status='new', form='unformatted')
write(39) it_cur, c, dt, dx, u_old, u_cur
close(39) ! must close before MSWRITE
call mswrite(ier,'save.new','/JOEUSER/wave/save', & ' ',10,'NOWAIT')
! Retain on MSS for 10 days.
! async write
if(ier.ne.0) then
print *,'Error from mswrite.'
stop
endif
return
end
#
# Here is the section of the LSF run script (in csh) that must
# accompany logic in the application itself
#
# It depends on an integer value stored in the file 'resubmit' that
# exists in the application's $PWD
#
.
.
.
set RESUBMIT = 'FALSE'
if ( -e ./resubmit ) then
@ N = `cat ./resubmit`
echo "file resubmit exists and requests $N more job submissions"
if ( $N > 0 ) then
set RESUBMIT = 'TRUE'
@ N--
echo $N >! ./resubmit
endif
endif
if ( $RESUBMIT == 'TRUE' ) then
echo "Note: resubmitting job"
bsub < this_script_name
else
echo "Note: not resubmitting job"
endif
exit
In this example, only one save file with one record is written to the MSS, and since it is overwritten many times, only the last timestep of the wave equation solution is available on restart. You may use a more sophisticated strategy, where several save records are maintained in one file, or where more than one save file is maintained. You may also specify a longer retention period for MSS files. Next page | Table of contents - Lightning user guide If you have questions about this document, please contact CISL Customer Support. You can also reach us by telephone 24 hours a day, seven days a week at 303-497-1278. Additional contact methods: consult1@ucar.edu and during business hours in NCAR Mesa Lab Suite 39. © Copyright 2005. University Corporation for Atmospheric Research (UCAR). All Rights Reserved. Address of this page: http://www.cisl.ucar.edu/docs/lightning/save.jsp |
|
|
|
|