Line data Source code
1 0 : \cond NEVER 2 : Distributed under the MIT License. 3 : See LICENSE.txt for details. 4 : \endcond 5 : # %Setting up checkpoints and restarts {#tutorial_checkpoint_restart} 6 : 7 : \tableofcontents 8 : 9 : SpECTRE executables can write checkpoints that save their instantaneous state to 10 : disc; the execution can be restarted later from a saved checkpoint. This feature 11 : is useful for expensive simulations that would run longer than the wallclock 12 : limits on a supercomputer system. 13 : 14 : Executables can checkpoint when: 15 : 1. The `default_phase_order` member variable in the `Metavariables` includes a 16 : `WriteCheckpoint` phase. 17 : 2. The `WriteCheckpoint` phase is run by a `PhaseControl` specified in the 18 : `Metavariables` and the input file. The two supported ways of running the 19 : checkpoint phase are: 20 : - with `CheckpointAndExitAfterWallclock`. This is the recommended phase 21 : control for checkpointing, because it writes only one checkpoint before 22 : cleanly terminating the code. 23 : This reduces the disc space taken up by checkpoint files and stops using 24 : up the allocation's CPU-hours on work that would be redone anyway after the 25 : run is restarted. 26 : The executable will return exit code 2 when it terminates from 27 : `CheckpointAndExitAfterWallclock`, meaning it is incomplete and should 28 : continue from the checkpoint. See `Parallel::ExitCode` for a definition of 29 : all exit code. 30 : - using `VisitAndReturn(WriteCheckpoint)`. This is useful for writing more 31 : frequent checkpoint files, which could help when debugging a run by 32 : restarting it from just before the failure. 33 : 34 : To restart an executable from a checkpoint file, run a command like this: 35 : ``` 36 : ./MySpectreExecutable +restart Checkpoints/Checkpoint_0123 37 : ``` 38 : where the `0123` should be the number of the checkpoint to restart from. You can 39 : also use the \ref tutorial_cli "command-line interface (CLI)" for restarting: 40 : ``` 41 : ./spectre run INPUT_FILE --from-last-checkpoint Checkpoints/ 42 : ``` 43 : 44 : There are a number of caveats in the current implementation of checkpointing 45 : and restarting: 46 : 47 : 1. The same binary must be used when writing the checkpoint and when restarting 48 : from the checkpoint. If a different binary is used to restart the code, 49 : there are no guarantees that the code will restart or that the continued 50 : execution will be correct. 51 : 2. The code must be restarted on the same hardware configuration used when 52 : writing the checkpoint --- this means the same number of nodes with the same 53 : number of processors per node. 54 : 3. Currently, there is no support for modifying any parameters during a restart. 55 : The restart only extends a simulation's runtime beyond wallclock limits. 56 : 4. When using `CheckpointAndExitAfterWallclock` to trigger checkpoints, note 57 : that the elapsed wallclock time is checked only when the `PhaseControl` is 58 : run, i.e., at global synchronization points defined in the input file. 59 : This means that to write a checkpoint in the 30 minutes before the end of a 60 : job's queue time, the triggers in the input file must trigger global 61 : synchronizations at least once every 30 minutes (and probably 2-3 times so 62 : there is a margin for the time to write files to disc, etc). It is currently 63 : up to the user to find the balance between too-frequent synchronizations 64 : (that slow the code) and too-infrequent synchronizations (that won't allow 65 : checkpoints to be written).