SpECTRE
v2024.12.16
|
SpECTRE executables can write checkpoints that save their instantaneous state to disc; the execution can be restarted later from a saved checkpoint. This feature is useful for expensive simulations that would run longer than the wallclock limits on a supercomputer system.
Executables can checkpoint when:
default_phase_order
member variable in the Metavariables
includes a WriteCheckpoint
phase.WriteCheckpoint
phase is run by a PhaseControl
specified in the Metavariables
and the input file. The two supported ways of running the checkpoint phase are:CheckpointAndExitAfterWallclock
. This is the recommended phase control for checkpointing, because it writes only one checkpoint before cleanly terminating the code. This reduces the disc space taken up by checkpoint files and stops using up the allocation's CPU-hours on work that would be redone anyway after the run is restarted. The executable will return exit code 2 when it terminates from CheckpointAndExitAfterWallclock
, meaning it is incomplete and should continue from the checkpoint. See Parallel::ExitCode
for a definition of all exit code.VisitAndReturn(WriteCheckpoint)
. This is useful for writing more frequent checkpoint files, which could help when debugging a run by restarting it from just before the failure.To restart an executable from a checkpoint file, run a command like this:
where the 0123
should be the number of the checkpoint to restart from. You can also use the command-line interface (CLI) for restarting:
There are a number of caveats in the current implementation of checkpointing and restarting:
CheckpointAndExitAfterWallclock
to trigger checkpoints, note that the elapsed wallclock time is checked only when the PhaseControl
is run, i.e., at global synchronization points defined in the input file. This means that to write a checkpoint in the 30 minutes before the end of a job's queue time, the triggers in the input file must trigger global synchronizations at least once every 30 minutes (and probably 2-3 times so there is a margin for the time to write files to disc, etc). It is currently up to the user to find the balance between too-frequent synchronizations (that slow the code) and too-infrequent synchronizations (that won't allow checkpoints to be written).Certain simulation parameters can be modified when restarting from a checkpoint file. This is done by parsing a new input file containing just those options to modify; all other options will preserve their value from the original run.
Note, however, that not all tags are permitted to be modified: in the current implementation, only tags from the const_global_cache_tags
that also have a member variable static constexpr bool is_overlayable = true;
can be modified. The reason for this "opt-in" design is that in general, most tags interact with past or current simulation data in a way that would invalidate the simulation state if the tag were modified on restart (example: changing the domain invalidates all spatial data, changing the timestepper invalidates the history). Only tags that do not interact with the state should be permitted to be updated. For example: activation thresholds on various algorithms, or frequency of data observation, are safe parameters to modify.
The executable will update the global cache with new input file values during the phase UpdateOptionsAtRestartFromCheckpoint
. The CheckpointAndExitAfterWallclock
phase control automatically directs code flow to this phase after a restart.
In this option-updating phase, the code tries to read an "overlay" input file whose name is computed from the original input file and the number of the checkpoint used to restart. Say the original input file is path/to/Input.yaml
and the code is restarted using a checkpoint +restart Checkpoints/Checkpoint_0123
, then the overlay input file to read has name path/to/Input.overlay_0123.yaml
. If this file does not exist, the executable continues with previous parameter values.