Phase control object that runs the WriteCheckpoint and Exit phases after a specified amount of wallclock time has elapsed.
More...
|
|
| CheckpointAndExitAfterWallclock (const std::optional< double > wallclock_hours, const Options::Context &context={}) |
|
| CheckpointAndExitAfterWallclock (CkMigrateMessage *msg) |
|
template<typename... DecisionTags> |
| void | initialize_phase_data_impl (const gsl::not_null< tuples::TaggedTuple< DecisionTags... > * > phase_change_decision_data) const |
|
template<typename ParallelComponent, typename ArrayIndex, typename Metavariables> |
| void | contribute_phase_data_impl (Parallel::GlobalCache< Metavariables > &cache, const ArrayIndex &array_index) const |
|
template<typename... DecisionTags, typename Metavariables> |
| std::optional< std::pair< Parallel::Phase, ArbitrationStrategy > > | arbitrate_phase_change_impl (const gsl::not_null< tuples::TaggedTuple< DecisionTags... > * > phase_change_decision_data, const Parallel::Phase current_phase, const Parallel::GlobalCache< Metavariables > &) const |
|
void | pup (PUP::er &p) override |
|
| PhaseChange (CkMigrateMessage *msg) |
|
| WRAPPED_PUPable_abstract (PhaseChange) |
|
template<typename ParallelComponent, typename DbTags, typename Metavariables, typename ArrayIndex> |
| void | contribute_phase_data (const gsl::not_null< db::DataBox< DbTags > * > box, Parallel::GlobalCache< Metavariables > &cache, const ArrayIndex &array_index) const |
| | Send data from all participating_components to the Main chare for determining the next phase.
|
|
template<typename... DecisionTags, typename Metavariables> |
| std::optional< std::pair< Parallel::Phase, PhaseControl::ArbitrationStrategy > > | arbitrate_phase_change (const gsl::not_null< tuples::TaggedTuple< DecisionTags... > * > phase_change_decision_data, const Parallel::Phase current_phase, const Parallel::GlobalCache< Metavariables > &cache) const |
| | Determine a phase request and PhaseControl::ArbitrationStrategy based on aggregated phase_change_decision_data on the Main Chare.
|
|
template<typename Metavariables, typename... Tags> |
| void | initialize_phase_data (const gsl::not_null< tuples::TaggedTuple< Tags... > * > phase_change_decision_data) const |
| | Initialize the phase_change_decision_data on the main chare to starting values.
|
Phase control object that runs the WriteCheckpoint and Exit phases after a specified amount of wallclock time has elapsed.
When the executable exits from here, it does so with Parallel::ExitCode::ContinueFromCheckpoint.
This phase control is useful for running SpECTRE executables performing lengthy computations that may exceed a supercomputer's wallclock limits. Writing a single checkpoint at the end of the job's allocated time allows the computation to be continued, while minimizing the disc space taken up by checkpoint files.
Note that this phase control is not a trigger on wallclock time. Rather, it checks the elapsed wallclock time when called, likely from a global sync point triggered by some other mechanism, e.g., at some slab boundary. Therefore, the WriteCheckpoint and Exit phases will run the first time this phase control is called after the specified wallclock time has been reached.
- Warning
- the global sync points must be triggered often enough to ensure there will be at least one sync point (i.e., one call to this phase control) in the window between the requested checkpoint-and-exit time and the time at which the batch system will kill the executable. To make this more concrete, consider this example: when running on a 12-hour queue with a checkpoint-and-exit requested after 11.5 hours, there is a 0.5-hour window for a global sync to occur, the checkpoint files to be written to disc, and the executable to clean up. In this case, triggering a global sync every 2-10 minutes might be desirable. Matching the global sync frequency with the time window for checkpoint and exit is the responsibility of the user!
- Warning
- If modifying the phase-change logic on a checkpoint-restart, this PhaseChange must remain in the list after modification so that the end of the restart logic will run. The WallclockHours can be changed to None to disable further restarts.