SpECTRE
v2024.08.03
|
SpECTRE executables can write checkpoints that save their instantaneous state to disc; the execution can be restarted later from a saved checkpoint. This feature is useful for expensive simulations that would run longer than the wallclock limits on a supercomputer system.
Executables can checkpoint when:
default_phase_order
member variable in the Metavariables
includes a WriteCheckpoint
phase.WriteCheckpoint
phase is run by a PhaseControl
specified in the Metavariables
and the input file. The two supported ways of running the checkpoint phase are:CheckpointAndExitAfterWallclock
. This is the recommended phase control for checkpointing, because it writes only one checkpoint before cleanly terminating the code. This reduces the disc space taken up by checkpoint files and stops using up the allocation's CPU-hours on work that would be redone anyway after the run is restarted. The executable will return exit code 2 when it terminates from CheckpointAndExitAfterWallclock
, meaning it is incomplete and should continue from the checkpoint. See Parallel::ExitCode
for a definition of all exit code.VisitAndReturn(WriteCheckpoint)
. This is useful for writing more frequent checkpoint files, which could help when debugging a run by restarting it from just before the failure.To restart an executable from a checkpoint file, run a command like this:
where the 0123
should be the number of the checkpoint to restart from. You can also use the command-line interface (CLI) for restarting:
There are a number of caveats in the current implementation of checkpointing and restarting:
CheckpointAndExitAfterWallclock
to trigger checkpoints, note that the elapsed wallclock time is checked only when the PhaseControl
is run, i.e., at global synchronization points defined in the input file. This means that to write a checkpoint in the 30 minutes before the end of a job's queue time, the triggers in the input file must trigger global synchronizations at least once every 30 minutes (and probably 2-3 times so there is a margin for the time to write files to disc, etc). It is currently up to the user to find the balance between too-frequent synchronizations (that slow the code) and too-infrequent synchronizations (that won't allow checkpoints to be written).