LCOV - aabde07399ba7837e5db64eedfd0a21f31f96922 - /__w/spectre/spectre/docs/DevGuide/Parallelization.md

SpECTRE Documentation Coverage Report

Current view:	top level - __w/spectre/spectre/docs/DevGuide - Parallelization.md		Hit	Total	Coverage
Commit:	aabde07399ba7837e5db64eedfd0a21f31f96922	Lines:	0	1	0.0 %
Date:	2024-04-26 02:38:13
Legend:	Lines: hit not hit

          Line data    Source code

       1           0 : \cond NEVER
       2             : Distributed under the MIT License.
       3             : See LICENSE.txt for details.
       4             : \endcond
       5             : # Parallelization, Charm++, and Core Concepts {#dev_guide_parallelization_foundations}
       6             : 
       7             : \tableofcontents
       8             : 
       9             : # Introduction {#dev_guide_parallelization_introduction}
      10             : 
      11             : SpECTRE builds a layer on top of Charm++ that performs various safety checks and
      12             : initialization for the user that can otherwise lead to difficult-to-debug
      13             : undefined behavior. The central concept is what is called a %Parallel
      14             : Component. A %Parallel Component is a struct with several type aliases that
      15             : is used by SpECTRE to set up the Charm++ chares and allowed communication
      16             : patterns. %Parallel Components are input arguments to the compiler, which then
      17             : writes the parallelization infrastructure that you requested for the executable.
      18             : There is no restriction on the number of %Parallel Components, though
      19             : practically it is best to have around 10 at most.
      20             : 
      21             : Here is an overview of what is described in detail in the sections below:
      22             : 
      23             : - \ref dev_guide_parallelization_metavariables_class "Metavariables": Provides
      24             :   high-level configuration to the compiler, e.g. the physical system to be
      25             :   simulated.
      26             : - \ref dev_guide_parallelization_phases_of_execution "Phases": Defines distinct
      27             :   simulation phases separated by a global synchronization point,
      28             :   e.g. `Initialization`, `Evolve` and `Exit`.
      29             : - \ref dev_guide_parallelization_core_algorithm "Algorithm": In each phase,
      30             :   repeatedly iterates over a list of actions until the current phase ends.
      31             : - \ref dev_guide_parallelization_parallel_components "Parallel component":
      32             :   Maintains and executes its algorithm.
      33             : - \ref dev_guide_parallelization_actions "Action": Performs a computational
      34             :   task, e.g. evaluating the right hand side of the time evolution equations. May
      35             :   require data to be received from another action potentially being executed on
      36             :   a different core or node.
      37             : 
      38             : # The Metavariables Class {#dev_guide_parallelization_metavariables_class}
      39             : 
      40             : SpECTRE takes a different approach to input options passed to an executable than
      41             : is common. SpECTRE not only reads an input file at runtime but also has many
      42             : choices made at compile time. The compile time options are specified by what is
      43             : referred to as the metavariables. What exactly the metavariables struct
      44             : specifies depends on the executable, but all metavariables structs must
      45             : specify the following:
      46             : 
      47             : - `help`: a `static constexpr Options::String` that will be printed as part of
      48             :   the help message. It should describe the executable and basic usage of it, as
      49             :   well as any non-standard options that must be specified in the metavariables
      50             :   and their current values. An example of a help string for one of the testing
      51             :   executables is:
      52             :   \snippet Test_AlgorithmCore.cpp help_string_example
      53             : - `component_list`: a `tmpl::list` of the parallel components (described below)
      54             :   that are to be created. Most evolution executables will have the
      55             :   `DgElementArray` parallel component listed. An example of a `component_list`
      56             :   for one of the test executables is:
      57             :   \snippet Test_AlgorithmCore.cpp component_list_example
      58             : - `using const_global_cache_tags`: a `tmpl::list` of tags that are
      59             :   used to place const items in the GlobalCache.  The alias may be
      60             :   omitted if the list is empty.
      61             : - `using mutable_global_cache_tags`: a `tmpl::list` of tags that are
      62             :   used to place mutable items in the GlobalCache.  The alias may be
      63             :   omitted if the list is empty.
      64             : - `default_phase_order`: an array of Parallel::Phase that must contain at least
      65             :   `Initialization` as the first element and `Exit`as the last element. Phases
      66             :   are described in the next section.
      67             : 
      68             : There are also several optional members:
      69             : 
      70             : - `input_file`: a `static constexpr Options::String` that is the default name of
      71             :   the input file that is to be read. This can be overridden at runtime by
      72             :   passing the `--input-file` argument to the executable.
      73             : - `ignore_unrecognized_command_line_options`: a `static constexpr bool` that
      74             :   defaults to `false`. If set to `true` then unrecognized command line options
      75             :   are ignored. Ignoring unrecognized options is generally only necessary for
      76             :   tests where arguments for the testing framework,
      77             :   [Catch](https://github.com/catchorg/Catch2/), are passed to the executable.
      78             : 
      79             : # Phases of an Execution {#dev_guide_parallelization_phases_of_execution}
      80             : 
      81             : Global synchronization points, where all cores wait for each other, are
      82             : undesirable for scalability reasons. However, they are sometimes inevitable for
      83             : algorithmic reasons. That is, in order to actually get a correct solution you
      84             : need to have a global synchronization. SpECTRE executables can have multiple
      85             : phases, where after each phase a global synchronization occurs. By global
      86             : synchronization we mean that no parallel components are executing or have more
      87             : tasks to execute: everything is waiting to be told what tasks to perform next.
      88             : 
      89             : Every executable must have at least two phases, `Initialization` and
      90             : `Exit`. The next phase is decided by the member function
      91             : Parallel::Main<Metavariables>::execute_next_phase.
      92             : Usually the next phase is determined
      93             : from the `default_phase_order` provided by the metavariables.  If more
      94             : complex decision making is desired, various components can send data
      95             : to Parallel::Main via the PhaseControl infrastructure.  This allows the next
      96             : Phase to be determined dynamically. Here is an example of a
      97             : `default_phase_order` member variable:
      98             : \snippet LinearSolverAlgorithmTestHelpers.hpp default_phase_order_array
      99             : 
     100             : In contrast, an evolution executable might have phases
     101             : `Initialization`, `SetInitialData`, `Evolve`, and `Exit`.
     102             : The first phase that is entered is always
     103             : `Initialization`. During the `Initialization` phase the
     104             : `Parallel::GlobalCache` is created, all (node)group components are created,
     105             : and empty array and singleton components are created.  Next, the function
     106             : `allocate_remaining_components_and_execute_initialization_phase` is called
     107             : which allocates singleton components and the elements of each array component,
     108             : and then starts the `Initialization` phase on all parallel components. Once all
     109             : parallel components' `Initialization` phase is complete, the next
     110             : phase is determined and the `execute_next_phase` function is called on
     111             : all the parallel components.
     112             : 
     113             : At the end of an execution the `Exit` phase has the executable wait to make sure
     114             : no \ref dev_guide_parallelization_parallel_components "parallel components" are
     115             : performing or need to perform any more tasks, and then exits. An example where
     116             : this approach is important is if we are done evolving a system but still need to
     117             : write data to disk. We do not want to exit the simulation until all data has
     118             : been written to disk, even though we've reached the final time of the evolution.
     119             : 
     120             : If we reach the `Exit` phase, but some \ref
     121             : dev_guide_parallelization_parallel_components "parallel components" have not
     122             : terminated properly, this means a deadlock has occurred. A deadlock usually
     123             : implies some error in the order messages have been sent/received. For example,
     124             : if core 0 was paused and waiting to receive a message from core 1, but core 1
     125             : was also paused and waiting to receive a message from core 0, this would be
     126             : considered a deadlock. We detect deadlocks during the `Exit` phase. All
     127             : executables have the option to specify a function with the following signature
     128             : 
     129             : \snippet Test_DetectHangArray.cpp deadlock_analysis_function
     130             : 
     131             : If this function is specified in the metavariables and a deadlock occurs, this
     132             : function and all the simple actions in it will run. The information printed
     133             : during this function call is executable dependent, but it should print enough
     134             : information for you to determine why the deadlock occurred. If this function
     135             : isn't specified and a deadlock occurs, a message about how to add this function
     136             : to your metavariables is printed, but nothing else. After this, the executable
     137             : aborts.
     138             : 
     139             : # The Algorithm {#dev_guide_parallelization_core_algorithm}
     140             : 
     141             : Since most numerical algorithms repeat steps until some criterion such as the
     142             : final time or convergence is met, SpECTRE's parallel components are designed to
     143             : do such iterations for the user. An Algorithm executes an ordered list of
     144             : actions until one of the actions cannot be evaluated, typically because it is
     145             : waiting on data from elsewhere. When an algorithm can no longer evaluate actions
     146             : it passively waits by handing control back to Charm++. Once an algorithm
     147             : receives data, typically done by having another parallel component call the
     148             : `receive_data` function, the algorithm will try again to execute the next
     149             : action. If the algorithm is still waiting on more data then the algorithm will
     150             : again return control to Charm++ and passively wait for more data. This is
     151             : repeated until all required data is available. The actions that are iterated
     152             : over by the algorithm are called iterable actions and are described below.
     153             : Since the action list is phase dependent we refer to them generally as
     154             : phase-dependent action lists (PDALs, pronounced "pedals").
     155             : 
     156             : # Parallel Components {#dev_guide_parallelization_parallel_components}
     157             : 
     158             : Building off the introduction, a %Parallel Component is essentially a wrapper
     159             : around Charm++ chares that makes it easy for a user to add parallel objects into
     160             : their program. Charm++ chares can be confusing to work with which is why we wrap
     161             : them. Each parallel component runs its own \ref
     162             : dev_guide_parallelization_core_algorithm "Algorithm". Data can be sent from one
     163             : parallel component to another and the receiving parallel components' Algorithm
     164             : will be able to take that data and continue the program.
     165             : 
     166             : ## 1. Types of Parallel Components {#dev_guide_parallelization_component_types}
     167             : 
     168             : There are four types of %Parallel Components in SpECTRE:
     169             : 
     170             : 1. `Parallel::Algorithms::Singleton`s have one object in the entire execution of
     171             :    the program. They are implemented as single element Charm++ chare arrays.
     172             :    Charm++ does offer a distributed object called a singleton, however, we
     173             :    explicitly don't use this for various reasons (see
     174             :    Parallel::Algorithms::Singleton). Henceforth and throughout SpECTRE, a
     175             :    `singleton` will refer to Parallel::Algorithms::Singleton and *not* a Charm++
     176             :    singleton.
     177             : 2. `Parallel::Algorithms::Array`s hold zero or more elements, each of which is
     178             :    an object distributed to some core. An array can grow and shrink in size
     179             :    dynamically if need be and can also be bound to another array. A bound array
     180             :    has the same number of elements as the array it is bound to, and elements
     181             :    with the same ID are on the same core. See Charm++'s chare arrays for
     182             :    details.
     183             : 3. `Parallel::Algorithms::Group`s are arrays with one element per core which are
     184             :    not able to be moved around between cores. These are typically useful for
     185             :    gathering data from elements of a Parallel::Algorithms::Array on their core,
     186             :    and then processing or reducing the data further. See
     187             :    [Charm++'s](http://charm.cs.illinois.edu/help) group chares for details.
     188             : 4. `Parallel::Algorithms::Nodegroup`s are similar to groups except that there is
     189             :    one element per node. See [parallel component
     190             :    placement](#dev_guide_parallelization_component_placement) for the definition
     191             :    of a cores and nodes. We ensure that all entry method calls done through the
     192             :    Algorithm's `simple_action` and `receive_data` functions are threadsafe.
     193             :    User-controlled threading is possible by calling the entry method member
     194             :    function `threaded_action`, which is like `simple_action` except it passes a
     195             :    node lock to the `Action`'s apply function. Note that unlike
     196             :    `simple_action`s, multiple `threaded_action`s can be executing simultaneously
     197             :    on the same chare, but on different cores of the node.
     198             : 
     199             : \note Technically there is a fifth type of parallel component called a MainChare
     200             : which is placed and executed on the global zeroth core. However, there can only
     201             : be one of these in a executable, it is entirely different from a Singleton, and
     202             : it is not specifiable by the user so we omit it from the count here.
     203             : 
     204             : ## 2. Requirements {#dev_guide_parallelization_component_requirements}
     205             : 
     206             : Each %Parallel Component struct must have the following type aliases:
     207             : 1. `using chare_type` is set to one of the four \ref
     208             :    dev_guide_parallelization_component_types "types of Parallel Components".
     209             : 2. `using metavariables` is set to the Metavariables struct that stores the
     210             :    global metavariables. It is often easiest to have the %Parallel
     211             :    Component struct have a template parameter `Metavariables` that is the
     212             :    global metavariables struct. Examples of this technique are given below.
     213             : 3. `using phase_dependent_action_list` is set to a `tmpl::list` of
     214             :    `Parallel::PhaseActions<Phase, tmpl::list<Actions...>>`
     215             :    where each `PhaseAction` represents a PDAL that will be executed on
     216             :    the parallel component during the specified phase. The %Actions are
     217             :    executed in the order that they are given in the `tmpl::list`s of
     218             :    the PDALs, but the phases need not be run in linear order. However,
     219             :    `db::DataBox` types are constructed assuming the phases are
     220             :    performed from first in the `phase_dependent_action_list` to the
     221             :    last. Simple actions (described below) can be executed in any
     222             :    phase. If there are no iterable actions in a phase then a
     223             :    `PhaseAction` need not be specified for that phase. However, at
     224             :    least one `PhaseAction`, even if it is empty, must be specified.
     225             : 4. `using simple_tags_from_options` which is a `tmpl::list` of all the tags
     226             :    that will be inserted into the initial `db::DataBox` of each component.
     227             :    These tags are db::SimpleTag%s that have have a `using option_tags`
     228             :    type alias and a static function `create_from_options` (see the
     229             :    example below).  This list can usually be constructed from the
     230             :    initialization actions of the component (i.e. the list of actions
     231             :    in the `PhaseAction` list for the `Initialization` phase) using the
     232             :    helper function `Parallel::get_simple_tags_from_options` (see the
     233             :    examples of components below).  Each initialization action may
     234             :    specify a type alias `using simple_tags_from_options` which are a
     235             :    `tmpl::list` of tags that will be fetched from the db::DataBox by the
     236             :    action.
     237             : 5. `using const_global_cache_tags` is set to a `tmpl::list` of tags
     238             :    that are required by the `allocate_array` function of an array
     239             :    component, or simple actions called on the parallel component.
     240             :    These tags correspond to const items that are stored in the
     241             :    Parallel::GlobalCache (of which there is one copy per Charm++
     242             :    node).  The alias can be omitted if the list is empty.
     243             : 6. `using mutable_global_cache_tags` is set to a `tmpl::list` of tags
     244             :    that correspond to mutable items that are stored in the
     245             :    Parallel::GlobalCache (of which there is one copy per Charm++
     246             :    core).  The alias can be omitted if the list is empty.
     247             : 7. `array_allocation_tags` is set to a `tmpl::list` of tags that will be
     248             :    constructed from options and will only be used in the `allocate_array`
     249             :    function of an array component. This type alias is only required for array
     250             :    components.
     251             : 
     252             : \parblock
     253             : \note Array parallel components must also specify the type alias `using
     254             : array_index`, which is set to the type that indexes the %Parallel Component
     255             : Array. Charm++ allows arrays to be 1 through 6 dimensional or be indexed by a
     256             : custom type. The Charm++ provided indexes are wrapped as
     257             : `Parallel::ArrayIndex1D` through `Parallel::ArrayIndex6D`. When writing custom
     258             : array indices, the [Charm++ manual](http://charm.cs.illinois.edu/help) tells you
     259             : to write your own `CkArrayIndex`, but we have written a general implementation
     260             : that provides this functionality (see `Parallel::ArrayIndex`); all that you need
     261             : to provide is a plain-old-data
     262             : ([POD](http://en.cppreference.com/w/cpp/concept/PODType)) struct of the size of
     263             : at most 3 integers.
     264             : \endparblock
     265             : 
     266             : \parblock
     267             : \note Singletons use an `array_index` of type `int`, but users need not specify
     268             : this. It is already specified in the implementation of a singleton.
     269             : \endparblock
     270             : 
     271             : %Parallel array components have a static `allocate_array` function
     272             : that is used to construct the elements of the array. The
     273             : signature of the `allocate_array` functions must be:
     274             : \code
     275             : static void allocate_array(
     276             :     Parallel::CProxy_GlobalCache<metavariables>& global_cache,
     277             :     const tuples::tagged_tuple_from_typelist<simple_tags_from_options>&
     278             :     initialization_items,
     279             :     const tuples::tagged_tuple_from_typelist<array_allocation_tags>&
     280             :     array_allocation_items,
     281             :     const std::unordered_set<size_t>& procs_to_ignore);
     282             : \endcode
     283             : The `allocate_array` function is called by the Main parallel component
     284             : when the execution starts and will typically insert elements into
     285             : array parallel components. If the `allocate_array` function depends
     286             : upon input options that are not in the GlobalCache, those tags should be
     287             : added to the `array_allocation_tags` type alias. A TaggedTuple is constructed
     288             : from this type alias and its input options and is only available in the
     289             : `allocate_array` function. All other tags that will be constructed from options
     290             : and used during the %Initialization phase should be placed in the
     291             : `simple_tags_from_options` type alias.
     292             : This type alias is a `tmpl::list` of tags which
     293             : are db::SimpleTag%s that have have a `using option_tags` type alias
     294             : and a static function `create_from_options`. They only need to be explicitly
     295             : added to the list if no initialization action has added them to its
     296             : `simple_tags_from_options` type alias.  If you want to ignore specific
     297             : processors when placing array elements, you can pass in a
     298             : `std::unordered_set<size_t>` to `allocate_array` that contains all the
     299             : processors that shouldn't have array elements on them.
     300             : 
     301             : The `allocate_array` functions of different
     302             : array components are called in random order and so it is not safe to
     303             : have them depend on each other.
     304             : 
     305             : Each parallel component must also decide what to do in the different phases of
     306             : the execution. This is controlled by an `execute_next_phase` function with
     307             : signature:
     308             : \code
     309             : static void execute_next_phase(
     310             :     const Parallel::Phase next_phase,
     311             :     const Parallel::CProxy_GlobalCache<metavariables>& global_cache);
     312             : \endcode
     313             : Parallel::Main<Metavariables>::execute_next_phase`
     314             : determines the next phase, after
     315             : which the `execute_next_phase` function of each component gets called. The
     316             : `execute_next_phase` function determines what the parallel component should do
     317             : during the next phase. Typically the `execute_next_phase` function should just
     318             : call `start_phase(phase)` on the parallel component.
     319             : 
     320             : ## 3. Examples {#dev_guide_parallelization_component_examples}
     321             : 
     322             : An example of a singleton parallel component is:
     323             : \snippet Test_AlgorithmParallel.cpp singleton_parallel_component
     324             : 
     325             : An example of an array parallel component is:
     326             : \snippet Test_AlgorithmParallel.cpp array_parallel_component
     327             : 
     328             : There are some parallel components that are common to many executables.
     329             : 
     330             : - Parallel::GlobalCache (a Parallel::Algorithms::Nodegroup)
     331             : - DgElementArray (a Parallel::Algorithms::Array)
     332             : - observers::Observer (a Parallel::Algorithms::Group)
     333             : - observers::ObserverWriter (a Parallel::Algorithms::Nodegroup)
     334             : 
     335             : The MutableGlobalCache deserves special mention, which is why is has its own
     336             : section with instructions on how to use it. See [Mutable items in the
     337             : GlobalCache](#dev_guide_parallelization_mutable_global_cache).
     338             : 
     339             : ## 4. Placement {#dev_guide_parallelization_component_placement}
     340             : 
     341             : The user has some control over where parallel components get placed on the
     342             : resources it is running on. Here is a figure that illustrates how one may place
     343             : parallel components.
     344             : 
     345             : \image html charm_node_structure.png "Parallel component placement."
     346             : 
     347             : In this example we are running on three (3) nodes that have four (4) cores each.
     348             : For all our executables, we reserve one core of each node purely for
     349             : communication purposes. Nothing else is run on this core. Because of this, what
     350             : Charm++ calls a node, doesn't correspond to a full node on a supercomputer. A
     351             : charm-node simply corresponds to a collection of cores on a physical node. In
     352             : our case, a charm-node is represented by the remaining cores on a node not used
     353             : for communication (i.e. the first charm-node corresponds to cores 1-3 on the
     354             : first physical node). Also the definition of a charm-core doesn't necessarily
     355             : have to correspond to an actual core (it could correspond to a hyperthreaded
     356             : virtual core), however, for our purposes, it does.
     357             : 
     358             : SpECTRE offers wrappers around Charm++ functions that will tell you the total
     359             : number of charm-nodes/cores in an executable and what charm-node/core a parallel
     360             : component is on. (In the following examples, the type `T` is an `int` or a
     361             : `size_t`)
     362             : 
     363             : - `Parallel::my_node<T>()` returns the charm-node that the parallel component is
     364             :   on. In the figure, `Sing. 4` would return `2`.
     365             : - `Parallel::my_proc<T>()` returns the charm-core that the parallel component is
     366             :   on. In the figure, `Sing. 4` would return `6` (*not* `9`).
     367             : - `Parallel::number_of_nodes<T>()` returns the total number of charm-nodes in an
     368             :   executable. The above figure would have `3` charm-nodes.
     369             : - `Parallel::number_of_procs<T>()` returns the total number of charm-cores in an
     370             :   executable. The above figure would have `9` charm-cores (*not* `12`).
     371             : 
     372             : \note For Charm++ SMP (shared memory parallelism) builds, a node corresponds to
     373             : a collection of cores on a physical node, and a core corresponds to a processor
     374             : on that physical node. However, for non-SMP builds, nodes and cores are
     375             : equivalent. All of our builds are done with Charm++ SMP so nodes and cores have
     376             : their usual definitions.
     377             : 
     378             : The placement of Groups and Nodegroups are determined by Charm++. This is
     379             : because a Group is on every charm-core and a Nodegroup is on every charm-node.
     380             : Even though Nodegroups are one per charm-node, the user can't choose which core
     381             : is used on the charm-node. They run on the next available charm-core on the
     382             : charm-node.
     383             : 
     384             : The Elements of an Array, however, can be placed on specific charm-cores. They
     385             : are inserted into the Array by using the Charm++ `insert` member function of the
     386             : CProxy for the Array. The `insert` function is documented in the Charm++ manual.
     387             : In the Array example in the
     388             : [Examples](#dev_guide_parallelization_component_examples) section, `array_proxy`
     389             : is a `CProxy` and so all the documentation for Charm++ array proxies applies.
     390             : SpECTRE always creates empty arrays with the constructor and requires users to
     391             : insert however many elements they want and on which charm-cores they want them
     392             : to be placed. Note that load balancing calls may result in array elements being
     393             : moved.
     394             : 
     395             : In a similar fashion, Singletons can also be placed on specific charm-cores.
     396             : This can be specified in the input file.
     397             : 
     398             : From an input file, there are two ways to specify where Array/Singleton parallel
     399             : components can be placed.
     400             : 
     401             : ```yaml
     402             : ResourceInfo:
     403             :   AvoidGlobalProc0: true
     404             :   Singletons:
     405             :     AhA:
     406             :       Proc: 12
     407             :       Exclusive: true
     408             :     AhB:
     409             :       Proc: Auto
     410             :       Exclusive: false
     411             : ```
     412             : 
     413             : First is the `AvoidGlobalProc0` option. This option will tell the program to not
     414             : put *any* Array Elements or Singletons on the global zeroth charm-core. This
     415             : core is sometimes used to write data to disk which is typically much slower
     416             : than the program execution. The second is the `Singletons:` option. You can
     417             : set the value to `Auto`, and then each singleton will have their proc be chosen
     418             : automatically and they won't be exclusive.  Otherwise, you must specify options
     419             : for each singleton as in the example above.  `AhA` is the `pretty_type::name()`
     420             : of a Singleton in the program and the user has a choice of which proc to place
     421             : the Singleton on (`Auto` will let the program decide) and whether to exclude
     422             : Array Elements or other Singletons from being put on this core. This is useful
     423             : in case the Singleton does some expensive computation that shouldn't be slowed
     424             : down by having lots of Array Elements on the same core. In the figure above,
     425             : `AvoidGlobalProc0` is true, and `Sing. 2` requested to be exclusively on core
     426             : `2`.
     427             : 
     428             : # Actions {#dev_guide_parallelization_actions}
     429             : 
     430             : %Actions are structs with a static `apply` method and come in five
     431             : variants: simple actions, iterable actions, reduction actions,
     432             : threaded actions, and local synchronous actions.
     433             : 
     434             : The signature of `apply` methods differs for the different types of
     435             : actions, but all types have the same general form.  Actions receive a
     436             : `db::DataBox`, the Parallel::GlobalCache, and their element's index
     437             : and parallel component, as well as arguments specific to the action
     438             : type.
     439             : 
     440             : The `db::DataBox` should be thought of as the member data of the parallel
     441             : component while the actions are the member functions. The combination of a
     442             : `db::DataBox` and actions allows building up classes with arbitrary member data
     443             : and methods using template parameters and invocation of actions. This approach
     444             : allows us to eliminate the need for users to work with Charm++'s interface
     445             : files, which can be error prone and difficult to use.
     446             : 
     447             : The Parallel::GlobalCache is passed to each action so that the
     448             : action has access to global data and is able to invoke actions on
     449             : other parallel components. The `ParallelComponent` template parameter
     450             : is the tag of the parallel component that invoked the action. A proxy
     451             : to the calling parallel component can then be retrieved from the
     452             : Parallel::GlobalCache. The remote entry method invocations are
     453             : slightly different for different types of actions, so they will be
     454             : discussed below. However, one thing that is disallowed for all actions
     455             : is calling an action locally from within an action on the same
     456             : parallel component.  Specifically,
     457             : 
     458             : \snippet Test_AlgorithmNestedApply1.cpp bad_recursive_call
     459             : 
     460             : Here `Parallel::local()` is a wrapper around `ckLocal()` which is a Charm++
     461             : provided method that returns a pointer to the local (currently executing)
     462             : parallel component. See the
     463             : [Charm++ manual](http://charm.cs.illinois.edu/help) for more
     464             : information.  However, you are able to queue a new action to be
     465             : executed later on the same parallel component by getting your own
     466             : parallel component from the Parallel::GlobalCache
     467             : (`Parallel::get_parallel_component<ParallelComponent>(cache)`).  The
     468             : difference between the two calls is that by calling an action through
     469             : the parallel component you will first finish the series of actions you
     470             : are in, then when they are complete Charm++ will call the next queued
     471             : action.
     472             : 
     473             : Array, group, and nodegroup parallel components can have actions invoked in two
     474             : ways. First is a broadcast where the action is called on all elements of the
     475             : array:
     476             : 
     477             : \snippet Test_AlgorithmParallel.cpp broadcast_to_group
     478             : 
     479             : The second case is invoking an action on a specific array element by using the
     480             : array element's index. The below example shows how a broadcast would be done
     481             : manually by looping over all elements in the array:
     482             : 
     483             : \snippet Test_AlgorithmParallel.cpp call_on_indexed_array
     484             : 
     485             : Note that in general you will not know what all the elements in the array are
     486             : and so a broadcast is the correct method of sending data to or invoking an
     487             : action on all elements of an array parallel component.
     488             : 
     489             : The `array_index` argument passed to all `apply` methods is the index into the
     490             : parallel component array. If the parallel component is not an array the value
     491             : and type of `array_index` is implementation defined and cannot be relied on.
     492             : 
     493             : ## 1. Simple Actions {#dev_guide_parallelization_simple_actions}
     494             : 
     495             : Simple actions can be thought of as member functions of remote objects
     496             : (chares/parallel components).  They are the direct analog of entry
     497             : methods in Charm++ except that the member data is stored in the
     498             : `db::DataBox` that is passed in as the first argument. A simple action
     499             : must return void but can use `db::mutate` to change values of items in
     500             : the `db::DataBox` if the `db::DataBox` is taken as a non-const
     501             : reference. In some cases you will need specific items to be in the
     502             : `db::DataBox` otherwise the action won't compile. To restrict which
     503             : `db::DataBox`es can be passed you should use `Requires` in the
     504             : action's `apply` function template parameter list. For example,
     505             : \snippet Test_AlgorithmCore.cpp requires_action
     506             : checks that `CountActionsCalled` is available in the box.
     507             : 
     508             : Simple actions can be called using a `CProxy` (see the [Charm++
     509             : manual](http://charm.cs.illinois.edu/help)), which is retrieved from
     510             : the Parallel::GlobalCache using the parallel component struct and the
     511             : `Parallel::get_parallel_component()` function.  For example, the
     512             : action above could be called as
     513             : \snippet Test_AlgorithmCore.cpp simple_action_call
     514             : Any arguments after the proxy are passed as additional arguments to
     515             : the action's `apply` function.
     516             : 
     517             : ## 2. Iterable Actions {#dev_guide_parallelization_iterable_actions}
     518             : 
     519             : Iterable actions make up the algorithms described by the PDALs.  These
     520             : actions are executed one after the other until one of them cannot be
     521             : evaluated.  Their `apply` methods signature is
     522             : \snippet Test_AlgorithmCore.cpp apply_iterative
     523             : The `ActionList` type is the `tmpl::list` of iterable actions in the
     524             : current phase.  That is, it is equal to the `action_list` type alias
     525             : in the current PDAL.  The `inboxes` is a collection of the tags
     526             : specified as `tmpl::list`s in the iterable actions' member type
     527             : aliases `inbox_tags`.  This collection represents data received from
     528             : other chares using the `receive_data` function.
     529             : 
     530             : Iterable actions can request that the algorithm be paused or halted
     531             : for the current phase, and control which action in the current PDAL
     532             : will be executed next. This is all done via the return value from the
     533             : `apply` function.  The `apply` function for iterable actions must
     534             : return a Parallel::iterable_action_return_t which is a
     535             : `std::tuple<Parallel::AlgorithmExecution, std::optional<size_t>>`. The
     536             : first element of the tuple controls how algorithm execution continues.
     537             : See the documentation of `Parallel::AlgorithmExecution` for the
     538             : meanings of different values of that enum.  The second element of the
     539             : tuple is usually set to `std::nullopt` in order to continue iterating
     540             : through the algorithm, but can be used to jump to a different action
     541             : in the current PDAL. Most iterable actions will simply return
     542             : 
     543             : \snippet Test_AlgorithmParallel.cpp iterable_action_return_continue_next_action
     544             : 
     545             : An action that pauses the algorithm will return
     546             : 
     547             : \snippet Test_AlgorithmParallel.cpp return_with_termination
     548             : 
     549             : After an algorithm has been paused it can be restarted by passing
     550             : `false` to the `set_terminate` method or by calling `receive_data(...,
     551             : true)`. Since the order in which messages are received is undefined in
     552             : most cases the `receive_data(..., true)` call should be used to
     553             : restart the algorithm.
     554             : 
     555             : The return value `Parallel::AlgorithmExecution::Retry` deserves
     556             : special mention.  It is intended for use by actions that use data
     557             : supplied by other parallel objects to indicate that they have not
     558             : received all of the data they require.  The algorithm will stop until
     559             : an operation that could supply data has occurred and then the action
     560             : will be retried.  An example of waiting because of missing data is
     561             : 
     562             : \snippet Test_AlgorithmCore.cpp retry_example
     563             : 
     564             : In order to jump to a specific action, the metafunction
     565             : `tmpl::index_of<list, element>` can be used to get an
     566             : `tmpl::integral_constant` with the value of the index of the element
     567             : `element` in the typelist `list`. For example,
     568             : 
     569             : \snippet Test_AlgorithmCore.cpp out_of_order_action
     570             : 
     571             : The metafunction call `tmpl::index_of<ActionList,
     572             : iterate_increment_int0>::%value` returns a `size_t` whose value is
     573             : that of the action `iterate_increment_int0` in the PDAL.  The indexing
     574             : of actions in the PDAL starts at `0`.
     575             : 
     576             : Iterable actions are invoked as part of the algorithm and so the only way
     577             : to request they be invoked is by having the algorithm run on the parallel
     578             : component. The algorithm can be explicitly evaluated in a new phase by calling
     579             : `start_phase(Phase::TheCurrentPhase)`:
     580             : 
     581             : \snippet Test_AlgorithmCore.cpp start_phase
     582             : 
     583             : Alternatively, to evaluate the algorithm without changing phases the
     584             : `perform_algorithm()` method can be used.
     585             : 
     586             : By passing `true` to `perform_algorithm` the algorithm will be restarted if it
     587             : was paused.
     588             : 
     589             : The algorithm is also evaluated by calling the `receive_data` function, either
     590             : on an entire array or singleton (this does a broadcast), or an on individual
     591             : element of the array. Here is an example of a broadcast call:
     592             : 
     593             : \snippet Test_AlgorithmParallel.cpp broadcast_to_group
     594             : 
     595             : and of calling individual elements:
     596             : 
     597             : \snippet Test_AlgorithmParallel.cpp call_on_indexed_array
     598             : 
     599             : The `receive_data` function always takes a `ReceiveTag`, which is set
     600             : in the actions' `inbox_tags` type aliases.  The `inbox_tags` must have
     601             : two member type aliases, a `temporal_id` which is used to identify
     602             : when the data was sent, and a `type` which is the type of the data to
     603             : be stored in the `inboxes`.  The types are typically a
     604             : `std::unordered_map<temporal_id, DATA>`.  In the discussed scenario of
     605             : waiting for neighboring elements to send their data the `DATA` type
     606             : would be a `std::unordered_map<TheElementId, DataSent>`.  Inbox tags
     607             : must also specify a `static void insert_into_inbox()` function.  For
     608             : example,
     609             : 
     610             : \snippet Test_AlgorithmParallel.cpp int_receive_tag
     611             : 
     612             : For common types of `DATA`, such as a `map`, a data structure with an `insert`
     613             : function, a data structure with a `push_back` function, or copy/move assignment
     614             : that is used to insert the received data, inserters are available in
     615             : `Parallel::InboxInserters`. For example, there is
     616             : `Parallel::InboxInserters::Map` for `map` data structures. The inbox tag can
     617             : inherit publicly off the inserters to gain the required insertion capabilities:
     618             : 
     619             : \snippet Test_AlgorithmCore.cpp int receive tag insert
     620             : 
     621             : Any inbox tag that uses Charm++ messages must also specify a `message_type` type
     622             : alias which is the object that will be sent. An example is:
     623             : 
     624             : \snippet Test_AlgorithmMessages.cpp charm message inbox tag
     625             : 
     626             : The `inbox_tags` type alias for the action is:
     627             : 
     628             : \snippet Test_AlgorithmParallel.cpp int_receive_tag_list
     629             : 
     630             : An inbox tag can also optionally specify a static function called `output_inbox`
     631             : that returns a `std::string`. This function can be used for printing the
     632             : contents of the inbox in a nice way as the types can sometimes get complicated.
     633             : You can also use the `Parallel::output_inbox` function to output a specific
     634             : inbox from all the inboxes. See an above example for the signature of the
     635             : `output_inbox` function.
     636             : 
     637             : \warning
     638             : It is the responsibility of the iterable action to remove data from the inboxes
     639             : that will no longer be needed.
     640             : 
     641             : Normally when remote functions are invoked they go through the Charm++ runtime
     642             : system, which adds some overhead. The `receive_data` function tries to elide
     643             : the call to the Charm++ RTS for calls into array components. Charm++ refers to
     644             : these types of remote calls as "inline entry methods". With the Charm++ method
     645             : of eliding the RTS, the code becomes susceptible to stack overflows because
     646             : of infinite recursion. The `receive_data` function is limited to at most 64 RTS
     647             : elided calls, though in practice reaching this limit is rare. When the limit is
     648             : reached the remote method invocation is done through the RTS instead of being
     649             : elided.
     650             : 
     651             : ## 3. Reduction Actions {#dev_guide_parallelization_reduction_actions}
     652             : 
     653             : Reduction actions are the targets of reducing data over an array. For
     654             : example, you may want to know the sum of a `int` from every element in
     655             : the array. You can do this as follows:
     656             : 
     657             : \snippet Test_AlgorithmReduction.cpp contribute_to_reduction_example
     658             : 
     659             : This reduces over the parallel component
     660             : `ArrayParallelComponent<Metavariables>`, reduces to the parallel component
     661             : `SingletonParallelComponent<Metavariables>`, and calls the action
     662             : `ProcessReducedSumOfInts` after the reduction has been performed. The reduction
     663             : action is:
     664             : 
     665             : \snippet Test_AlgorithmReduction.cpp reduce_sum_int_action
     666             : 
     667             : As you can see, the last argument to the `apply` function is of type `int`, and
     668             : is the reduced value.
     669             : 
     670             : You can also broadcast the result back to an array, even yourself. For example,
     671             : 
     672             : \snippet Test_AlgorithmReduction.cpp contribute_to_broadcast_reduction
     673             : 
     674             : It is often necessary to reduce custom data types, such as `std::vector` or
     675             : `std::unordered_map`. Charm++ supports such custom reductions, and so does our
     676             : layer on top of Charm++.
     677             : Custom reductions require one additional step to calling
     678             : `contribute_to_reduction`, which is writing a reduction function to reduce the
     679             : custom data. We provide a generic type that can be used in custom reductions,
     680             : `Parallel::ReductionData`, which takes a series of `Parallel::ReductionDatum` as
     681             : template parameters and `ReductionDatum::value_type`s as the arguments to the
     682             : constructor. Each `ReductionDatum` takes up to four template parameters (two
     683             : are required). The first is the type of data to reduce, and the second is a
     684             : binary invokable that is called at each step of the reduction to combine two
     685             : messages. The last two template parameters are used after the reduction has
     686             : completed. The third parameter is an n-ary invokable that is called once the
     687             : reduction is complete, whose first argument is the result of the reduction. The
     688             : additional arguments can be any `ReductionDatum::value_type` in the
     689             : `ReductionData` that are before the current one. The fourth template parameter
     690             : of `ReductionDatum` is used to specify which data should be passed. It is a
     691             : `std::index_sequence` indexing into the `ReductionData`.
     692             : 
     693             : \warning
     694             : All elements of the array must call the same reductions in the same order. It is
     695             : *defined* behavior to do multiple reductions at once as long as all contribute
     696             : calls on all array elements occurred in the same order. It is **undefined**
     697             : behavior if the contribute calls are made in different orders on different array
     698             : elements.
     699             : 
     700             : ## 4. Threaded Actions {#dev_guide_parallelization_threaded_actions}
     701             : 
     702             : Threaded actions are similar to simple actions, with the difference
     703             : being that multiple threaded actions may be running on the same chare
     704             : at the same time (potentially in parallel with one simple or reduction
     705             : action).  The `apply` function for a threaded actions has the same
     706             : signature as that for a simple action, except that it also receives a
     707             : `NodeLock` intended to control access to the chare's `db::DataBox`.
     708             : All access to the `db::DataBox`, including read-only access, must
     709             : occur while the action owns this lock.  (Simple and reduction actions
     710             : implicitly hold the lock for their entire execution.)
     711             : 
     712             : \snippet Test_AlgorithmNodelock.cpp threaded_action_example
     713             : 
     714             : Threaded actions can only be run on nodegroup chares.
     715             : 
     716             : ## 5. Local Synchronous Actions {#dev_guide_parallelization_local_synchronous_actions}
     717             : 
     718             : There is limited ability to retrieve data held by another parallel component via
     719             : a direct synchronous call. Unlike the above actions, the invocation of a
     720             : synchronous action is precisely a call to a member function of another parallel
     721             : component; therefore, these invocations will run to completion, and return their
     722             : result before the calling code proceeds in execution.
     723             : 
     724             : Aside from being synchronous and being able to return data, local
     725             : synchronous actions behave the same as threaded actions, except that
     726             : they will only run on the chare of a nodegroup that is on the local
     727             : node.
     728             : 
     729             : Local synchronous actions' `apply` functions follow a signature motivated by
     730             : threaded actions, but take fewer arguments.  This may be a bug.
     731             : 
     732             : Local synchronous actions must specify their return type in a
     733             : `return_type` type alias. This is to help simplify the logic with the
     734             : variant `db::DataBox` held by the parallel component.
     735             : 
     736             : An example of a definition of a local synchronous action:
     737             : 
     738             : \snippet Test_AlgorithmLocalSyncAction.cpp synchronous_action_example
     739             : 
     740             : And the corresponding invocation:
     741             : 
     742             : \snippet Test_AlgorithmLocalSyncAction.cpp synchronous_action_invocation_example
     743             : 
     744             : \warning Say an action is being run on a component on a node where a mutable
     745             : item in the GlobalCache is up-to-date. This action then calls another action on
     746             : a different component on a different node. It is **NOT** guaranteed that this
     747             : mutable item in the GlobalCache is up-to-date on this new node. It is up to the
     748             : user to ensure the mutable item is up-to-date on whatever node they run an
     749             : action on, even if it was up-to-date on the node that sent the message. The
     750             : \ref dev_guide_parallelization_mutable_global_cache
     751             : "Mutable items in the GlobalCache" section gives details about mutable
     752             : GlobalCache items.
     753             : 
     754             : # Mutable items in the GlobalCache {#dev_guide_parallelization_mutable_global_cache}
     755             : 
     756             : Most items in the GlobalCache are constant, and are specified
     757             : by type aliases called `const_global_cache_tags` as
     758             : described above. However, the GlobalCache can also store mutable
     759             : items. Because of asynchronous execution, **EXTREME** care must be taken when
     760             : mutating items in the GlobalCache, as described below.
     761             : 
     762             : A mutable item can be of any type, as long as that type is something
     763             : that can be checked for whether it is "up-to-date".  Here "up-to-date"
     764             : means that the item can be safely used (even read-only) without
     765             : needing to be mutated first. For example, a mutable item might be a
     766             : function of time that knows the range of times for which it is valid;
     767             : the mutable item is then deemed up-to-date if it will be called for a
     768             : time within its range of validity, and it is deemed not up-to-date if
     769             : it will be called for a time outside its range of validity.  Thus the
     770             : up-to-date status of a mutable item is determined by both the state of
     771             : the item itself and by the code that wishes to use that item.
     772             : 
     773             : ## 1. Specification of mutable GlobalCache items
     774             : 
     775             : Mutable GlobalCache items are specified by a
     776             : type alias `mutable_global_cache_tags`, which is treated the same way
     777             : as `const_global_cache_tags` for const items.
     778             : 
     779             : ## 2. Use of mutable GlobalCache items
     780             : 
     781             : ### 1. Checking if the item is up-to-date
     782             : 
     783             : Because execution is asynchronous, any code that uses a mutable item
     784             : in the GlobalCache must first check whether that item is up-to-date.
     785             : The information about whether an item is up-to-date is assumed to be
     786             : stored in the item itself.  For example, a mutable object stored in
     787             : the GlobalCache might have type `std::map<temporal_id,T>` (for some
     788             : type `T`), and then any code that uses the stored object can check
     789             : whether an entry exists for a particular `temporal_id`.  To avoid
     790             : race conditions, it is
     791             : important that up-to-date checks are based on something that is
     792             : independent of the order of mutation (like a `temporal_id`, and not
     793             : like checking the size of a vector).
     794             : 
     795             : To check an item, use the function
     796             : `Parallel::mutable_cache_item_is_ready`, which returns a bool
     797             : indicating whether the item is up-to-date.  If the item is up-to-date,
     798             : then it can be used.  `Parallel::mutable_cache_item_is_ready` takes a
     799             : lambda as an argument.  This lambda is passed a single argument: a
     800             : const reference to the item being retrieved.  The lambda should
     801             : determine whether the item is up-to-date. If so, it should return a
     802             : default_constructed `std::unique_ptr<Parallel::Callback>`; if not, it should
     803             : return a `std::unique_ptr<Parallel::Callback>` to a callback function that will
     804             : be called on the next `Parallel::mutate` of that item. The callback
     805             : will typically check again if the item is up-to-date and if so will
     806             : execute some code that gets the item via `Parallel::get`.
     807             : 
     808             : For the case of iterable actions,
     809             : `Parallel::mutable_cache_item_is_ready` typically uses the callback
     810             : `Parallel::PerformAlgorithmCallback`.  In the example below, the
     811             : vector is considered up-to-date if it is non-empty. If the vector is
     812             : not up-to-date, then when it becomes up-to-date the callback function
     813             : will be invoked; in this case the callback function re-runs
     814             : `perform_algorithm()`, which will call the same action again.
     815             : 
     816             : \snippet Test_AlgorithmGlobalCache.cpp check_mutable_cache_item_is_ready
     817             : 
     818             : Note that `Parallel::mutable_cache_item_is_ready` is called on the local
     819             : node and does no parallel communication.
     820             : 
     821             : ### 2. Retrieving the item
     822             : 
     823             : The item is retrieved using `Parallel::get` just like for constant items.
     824             : For example, to retrieve the item `Tags::VectorOfDoubles`:
     825             : \snippet Test_AlgorithmGlobalCache.cpp retrieve_mutable_cache_item
     826             : 
     827             : Note that `Parallel::get` is called on the local node and does no
     828             : parallel communication.
     829             : 
     830             : Whereas we support getting *non-mutable* items in the GlobalCache from
     831             : a DataBox via `db::get`, we intentionally do not support
     832             : `db::get` of *mutable* items in the GlobalCache from a DataBox.
     833             : The reason is that mutable
     834             : items should be retrieved only after a `Parallel::mutable_cache_item_is_ready`
     835             : check, and being able to retrieve a mutable item from a DataBox makes it
     836             : difficult to enforce that check, especially when automatically-executing
     837             : compute items are considered.
     838             : 
     839             : ## 3. Modifying a mutable GlobalCache item
     840             : 
     841             : To modify a mutable item, pass `Parallel::mutate` two template
     842             : parameters: the tag to mutate, and a struct with an `apply` function
     843             : that does the mutating. `Parallel::mutate` takes two arguments:
     844             : a reference to the local GlobalCache, and a tuple that is passed into the
     845             : mutator function.  For the following example,
     846             : 
     847             : \snippet Test_AlgorithmGlobalCache.cpp mutate_global_cache_item
     848             : 
     849             : the mutator function is defined as below:
     850             : \snippet Test_AlgorithmGlobalCache.cpp mutate_global_cache_item_mutator
     851             : 
     852             : `Parallel::mutate` broadcasts to every node, where it calls the
     853             : mutator function and then calls all the callbacks that have been set
     854             : on that node by `Parallel::mutable_cache_item_is_ready`.  The
     855             : `Parallel::mutate` operation is guaranteed to be thread-safe without
     856             : any further action by the developer so long as the item being mutated can be
     857             : mutated in a threadsafe way. See the `Parallel::GlobalCache` docs for more
     858             : details.
     859             : 
     860             : # Charm++ Node and Processor Level Initialization Functions {#dev_guide_parallelization_charm_node_processor_level_initialization}
     861             : 
     862             : Charm++ allows running functions once per core and once per node before the
     863             : construction of any parallel components. This is commonly used for setting up
     864             : error handling and enabling floating point exceptions. Other functions could
     865             : also be run. Which functions are run on each node and core is set by calling
     866             : `Parallel::charmxx::register_init_node_and_proc` in `CkRegisterMainModule()`
     867             : with function pointers to the functions to be called. For example:
     868             : 
     869             : \snippet Test_AlgorithmPhaseControl.cpp charm_init_funcs_example

Generated by: LCOV version 1.14