LCOV - 8a30970c95b504394e26ac2f71f8d3dacf8a8e8f - /__w/spectre/spectre/docs/Tutorials/CheckpointRestart.md

SpECTRE Documentation Coverage Report

Current view:	top level - __w/spectre/spectre/docs/Tutorials - CheckpointRestart.md		Hit	Total	Coverage
Commit:	8a30970c95b504394e26ac2f71f8d3dacf8a8e8f	Lines:	0	1	0.0 %
Date:	2024-04-19 21:29:46
Legend:	Lines: hit not hit

          Line data    Source code

       1           0 : \cond NEVER
       2             : Distributed under the MIT License.
       3             : See LICENSE.txt for details.
       4             : \endcond
       5             : # %Setting up checkpoints and restarts {#tutorial_checkpoint_restart}
       6             : 
       7             : \tableofcontents
       8             : 
       9             : SpECTRE executables can write checkpoints that save their instantaneous state to
      10             : disc; the execution can be restarted later from a saved checkpoint. This feature
      11             : is useful for expensive simulations that would run longer than the wallclock
      12             : limits on a supercomputer system.
      13             : 
      14             : Executables can checkpoint when:
      15             : 1. The `default_phase_order` member variable in the `Metavariables` includes a
      16             :    `WriteCheckpoint` phase.
      17             : 2. The `WriteCheckpoint` phase is run by a `PhaseControl` specified in the
      18             :    `Metavariables` and the input file. The two supported ways of running the
      19             :    checkpoint phase are:
      20             :    - with `CheckpointAndExitAfterWallclock`. This is the recommended phase
      21             :      control for checkpointing, because it writes only one checkpoint before
      22             :      cleanly terminating the code.
      23             :      This reduces the disc space taken up by checkpoint files and stops using
      24             :      up the allocation's CPU-hours on work that would be redone anyway after the
      25             :      run is restarted.
      26             :      The executable will return exit code 2 when it terminates from
      27             :      `CheckpointAndExitAfterWallclock`, meaning it is incomplete and should
      28             :      continue from the checkpoint. See `Parallel::ExitCode` for a definition of
      29             :      all exit code.
      30             :    - using `VisitAndReturn(WriteCheckpoint)`. This is useful for writing more
      31             :      frequent checkpoint files, which could help when debugging a run by
      32             :      restarting it from just before the failure.
      33             : 
      34             : To restart an executable from a checkpoint file, run a command like this:
      35             : ```
      36             : ./MySpectreExecutable +restart Checkpoints/Checkpoint_0123
      37             : ```
      38             : where the `0123` should be the number of the checkpoint to restart from. You can
      39             : also use the \ref tutorial_cli "command-line interface (CLI)" for restarting:
      40             : ```
      41             : ./spectre run INPUT_FILE --from-last-checkpoint Checkpoints/
      42             : ```
      43             : 
      44             : There are a number of caveats in the current implementation of checkpointing
      45             : and restarting:
      46             : 
      47             : 1. The same binary must be used when writing the checkpoint and when restarting
      48             :    from the checkpoint. If a different binary is used to restart the code,
      49             :    there are no guarantees that the code will restart or that the continued
      50             :    execution will be correct.
      51             : 2. The code must be restarted on the same hardware configuration used when
      52             :    writing the checkpoint --- this means the same number of nodes with the same
      53             :    number of processors per node.
      54             : 3. Currently, there is no support for modifying any parameters during a restart.
      55             :    The restart only extends a simulation's runtime beyond wallclock limits.
      56             : 4. When using `CheckpointAndExitAfterWallclock` to trigger checkpoints, note
      57             :    that the elapsed wallclock time is checked only when the `PhaseControl` is
      58             :    run, i.e., at global synchronization points defined in the input file.
      59             :    This means that to write a checkpoint in the 30 minutes before the end of a
      60             :    job's queue time, the triggers in the input file must trigger global
      61             :    synchronizations at least once every 30 minutes (and probably 2-3 times so
      62             :    there is a margin for the time to write files to disc, etc). It is currently
      63             :    up to the user to find the balance between too-frequent synchronizations
      64             :    (that slow the code) and too-infrequent synchronizations (that won't allow
      65             :    checkpoints to be written).

Generated by: LCOV version 1.14