Line data Source code
1 0 : \cond NEVER
2 : Distributed under the MIT License.
3 : See LICENSE.txt for details.
4 : \endcond
5 : # Parallelization, Charm++, and Core Concepts {#dev_guide_parallelization_foundations}
6 :
7 : \tableofcontents
8 :
9 : # Introduction {#dev_guide_parallelization_introduction}
10 :
11 : SpECTRE builds a layer on top of Charm++ that performs various safety checks and
12 : initialization for the user that can otherwise lead to difficult-to-debug
13 : undefined behavior. The central concept is what is called a %Parallel
14 : Component. A %Parallel Component is a struct with several type aliases that
15 : is used by SpECTRE to set up the Charm++ chares and allowed communication
16 : patterns. %Parallel Components are input arguments to the compiler, which then
17 : writes the parallelization infrastructure that you requested for the executable.
18 : There is no restriction on the number of %Parallel Components, though
19 : practically it is best to have around 10 at most.
20 :
21 : Here is an overview of what is described in detail in the sections below:
22 :
23 : - \ref dev_guide_parallelization_metavariables_class "Metavariables": Provides
24 : high-level configuration to the compiler, e.g. the physical system to be
25 : simulated.
26 : - \ref dev_guide_parallelization_phases_of_execution "Phases": Defines distinct
27 : simulation phases separated by a global synchronization point,
28 : e.g. `Initialization`, `Evolve` and `Exit`.
29 : - \ref dev_guide_parallelization_core_algorithm "Algorithm": In each phase,
30 : repeatedly iterates over a list of actions until the current phase ends.
31 : - \ref dev_guide_parallelization_parallel_components "Parallel component":
32 : Maintains and executes its algorithm.
33 : - \ref dev_guide_parallelization_actions "Action": Performs a computational
34 : task, e.g. evaluating the right hand side of the time evolution equations. May
35 : require data to be received from another action potentially being executed on
36 : a different core or node.
37 :
38 : # The Metavariables Class {#dev_guide_parallelization_metavariables_class}
39 :
40 : SpECTRE takes a different approach to input options passed to an executable than
41 : is common. SpECTRE not only reads an input file at runtime but also has many
42 : choices made at compile time. The compile time options are specified by what is
43 : referred to as the metavariables. What exactly the metavariables struct
44 : specifies depends on the executable, but all metavariables structs must
45 : specify the following:
46 :
47 : - `help`: a `static constexpr Options::String` that will be printed as part of
48 : the help message. It should describe the executable and basic usage of it, as
49 : well as any non-standard options that must be specified in the metavariables
50 : and their current values. An example of a help string for one of the testing
51 : executables is:
52 : \snippet Test_AlgorithmCore.cpp help_string_example
53 : - `component_list`: a `tmpl::list` of the parallel components (described below)
54 : that are to be created. Most evolution executables will have the
55 : `DgElementArray` parallel component listed. An example of a `component_list`
56 : for one of the test executables is:
57 : \snippet Test_AlgorithmCore.cpp component_list_example
58 : - `using const_global_cache_tags`: a `tmpl::list` of tags that are
59 : used to place const items in the GlobalCache. The alias may be
60 : omitted if the list is empty.
61 : - `using mutable_global_cache_tags`: a `tmpl::list` of tags that are
62 : used to place mutable items in the GlobalCache. The alias may be
63 : omitted if the list is empty.
64 : - `default_phase_order`: an array of Parallel::Phase that must contain at least
65 : `Initialization` as the first element and `Exit`as the last element. Phases
66 : are described in the next section.
67 :
68 : There are also several optional members:
69 :
70 : - `input_file`: a `static constexpr Options::String` that is the default name of
71 : the input file that is to be read. This can be overridden at runtime by
72 : passing the `--input-file` argument to the executable.
73 : - `ignore_unrecognized_command_line_options`: a `static constexpr bool` that
74 : defaults to `false`. If set to `true` then unrecognized command line options
75 : are ignored. Ignoring unrecognized options is generally only necessary for
76 : tests where arguments for the testing framework,
77 : [Catch](https://github.com/catchorg/Catch2/), are passed to the executable.
78 :
79 : # Phases of an Execution {#dev_guide_parallelization_phases_of_execution}
80 :
81 : Global synchronization points, where all cores wait for each other, are
82 : undesirable for scalability reasons. However, they are sometimes inevitable for
83 : algorithmic reasons. That is, in order to actually get a correct solution you
84 : need to have a global synchronization. SpECTRE executables can have multiple
85 : phases, where after each phase a global synchronization occurs. By global
86 : synchronization we mean that no parallel components are executing or have more
87 : tasks to execute: everything is waiting to be told what tasks to perform next.
88 :
89 : Every executable must have at least two phases, `Initialization` and
90 : `Exit`. The next phase is decided by the member function
91 : Parallel::Main<Metavariables>::execute_next_phase.
92 : Usually the next phase is determined
93 : from the `default_phase_order` provided by the metavariables. If more
94 : complex decision making is desired, various components can send data
95 : to Parallel::Main via the PhaseControl infrastructure. This allows the next
96 : Phase to be determined dynamically. Here is an example of a
97 : `default_phase_order` member variable:
98 : \snippet LinearSolverAlgorithmTestHelpers.hpp default_phase_order_array
99 :
100 : In contrast, an evolution executable might have phases
101 : `Initialization`, `SetInitialData`, `Evolve`, and `Exit`.
102 : The first phase that is entered is always
103 : `Initialization`. During the `Initialization` phase the
104 : `Parallel::GlobalCache` is created, all (node)group components are created,
105 : and empty array and singleton components are created. Next, the function
106 : `allocate_remaining_components_and_execute_initialization_phase` is called
107 : which allocates singleton components and the elements of each array component,
108 : and then starts the `Initialization` phase on all parallel components. Once all
109 : parallel components' `Initialization` phase is complete, the next
110 : phase is determined and the `execute_next_phase` function is called on
111 : all the parallel components.
112 :
113 : At the end of an execution the `Exit` phase has the executable wait to make sure
114 : no \ref dev_guide_parallelization_parallel_components "parallel components" are
115 : performing or need to perform any more tasks, and then exits. An example where
116 : this approach is important is if we are done evolving a system but still need to
117 : write data to disk. We do not want to exit the simulation until all data has
118 : been written to disk, even though we've reached the final time of the evolution.
119 :
120 : If we reach the `Exit` phase, but some \ref
121 : dev_guide_parallelization_parallel_components "parallel components" have not
122 : terminated properly, this means a deadlock has occurred. A deadlock usually
123 : implies some error in the order messages have been sent/received. For example,
124 : if core 0 was paused and waiting to receive a message from core 1, but core 1
125 : was also paused and waiting to receive a message from core 0, this would be
126 : considered a deadlock. We detect deadlocks during the `Exit` phase. All
127 : executables have the option to specify a function with the following signature
128 :
129 : \snippet Test_DetectHangArray.cpp deadlock_analysis_function
130 :
131 : If this function is specified in the metavariables and a deadlock occurs, this
132 : function and all the simple actions in it will run. The information printed
133 : during this function call is executable dependent, but it should print enough
134 : information for you to determine why the deadlock occurred. If this function
135 : isn't specified and a deadlock occurs, a message about how to add this function
136 : to your metavariables is printed, but nothing else. After this, the executable
137 : aborts.
138 :
139 : # The Algorithm {#dev_guide_parallelization_core_algorithm}
140 :
141 : Since most numerical algorithms repeat steps until some criterion such as the
142 : final time or convergence is met, SpECTRE's parallel components are designed to
143 : do such iterations for the user. An Algorithm executes an ordered list of
144 : actions until one of the actions cannot be evaluated, typically because it is
145 : waiting on data from elsewhere. When an algorithm can no longer evaluate actions
146 : it passively waits by handing control back to Charm++. Once an algorithm
147 : receives data, typically done by having another parallel component call the
148 : `receive_data` function, the algorithm will try again to execute the next
149 : action. If the algorithm is still waiting on more data then the algorithm will
150 : again return control to Charm++ and passively wait for more data. This is
151 : repeated until all required data is available. The actions that are iterated
152 : over by the algorithm are called iterable actions and are described below.
153 : Since the action list is phase dependent we refer to them generally as
154 : phase-dependent action lists (PDALs, pronounced "pedals").
155 :
156 : # Parallel Components {#dev_guide_parallelization_parallel_components}
157 :
158 : Building off the introduction, a %Parallel Component is essentially a wrapper
159 : around Charm++ chares that makes it easy for a user to add parallel objects into
160 : their program. Charm++ chares can be confusing to work with which is why we wrap
161 : them. Each parallel component runs its own \ref
162 : dev_guide_parallelization_core_algorithm "Algorithm". Data can be sent from one
163 : parallel component to another and the receiving parallel components' Algorithm
164 : will be able to take that data and continue the program.
165 :
166 : ## 1. Types of Parallel Components {#dev_guide_parallelization_component_types}
167 :
168 : There are four types of %Parallel Components in SpECTRE:
169 :
170 : 1. `Parallel::Algorithms::Singleton`s have one object in the entire execution of
171 : the program. They are implemented as single element Charm++ chare arrays.
172 : Charm++ does offer a distributed object called a singleton, however, we
173 : explicitly don't use this for various reasons (see
174 : Parallel::Algorithms::Singleton). Henceforth and throughout SpECTRE, a
175 : `singleton` will refer to Parallel::Algorithms::Singleton and *not* a Charm++
176 : singleton.
177 : 2. `Parallel::Algorithms::Array`s hold zero or more elements, each of which is
178 : an object distributed to some core. An array can grow and shrink in size
179 : dynamically if need be and can also be bound to another array. A bound array
180 : has the same number of elements as the array it is bound to, and elements
181 : with the same ID are on the same core. See Charm++'s chare arrays for
182 : details.
183 : 3. `Parallel::Algorithms::Group`s are arrays with one element per core which are
184 : not able to be moved around between cores. These are typically useful for
185 : gathering data from elements of a Parallel::Algorithms::Array on their core,
186 : and then processing or reducing the data further. See
187 : [Charm++'s](http://charm.cs.illinois.edu/help) group chares for details.
188 : 4. `Parallel::Algorithms::Nodegroup`s are similar to groups except that there is
189 : one element per node. See [parallel component
190 : placement](#dev_guide_parallelization_component_placement) for the definition
191 : of a cores and nodes. We ensure that all entry method calls done through the
192 : Algorithm's `simple_action` and `receive_data` functions are threadsafe.
193 : User-controlled threading is possible by calling the entry method member
194 : function `threaded_action`, which is like `simple_action` except it passes a
195 : node lock to the `Action`'s apply function. Note that unlike
196 : `simple_action`s, multiple `threaded_action`s can be executing simultaneously
197 : on the same chare, but on different cores of the node.
198 :
199 : \note Technically there is a fifth type of parallel component called a MainChare
200 : which is placed and executed on the global zeroth core. However, there can only
201 : be one of these in a executable, it is entirely different from a Singleton, and
202 : it is not specifiable by the user so we omit it from the count here.
203 :
204 : ## 2. Requirements {#dev_guide_parallelization_component_requirements}
205 :
206 : Each %Parallel Component struct must have the following type aliases:
207 : 1. `using chare_type` is set to one of the four \ref
208 : dev_guide_parallelization_component_types "types of Parallel Components".
209 : 2. `using metavariables` is set to the Metavariables struct that stores the
210 : global metavariables. It is often easiest to have the %Parallel
211 : Component struct have a template parameter `Metavariables` that is the
212 : global metavariables struct. Examples of this technique are given below.
213 : 3. `using phase_dependent_action_list` is set to a `tmpl::list` of
214 : `Parallel::PhaseActions<Phase, tmpl::list<Actions...>>`
215 : where each `PhaseAction` represents a PDAL that will be executed on
216 : the parallel component during the specified phase. The %Actions are
217 : executed in the order that they are given in the `tmpl::list`s of
218 : the PDALs, but the phases need not be run in linear order. However,
219 : `db::DataBox` types are constructed assuming the phases are
220 : performed from first in the `phase_dependent_action_list` to the
221 : last. Simple actions (described below) can be executed in any
222 : phase. If there are no iterable actions in a phase then a
223 : `PhaseAction` need not be specified for that phase. However, at
224 : least one `PhaseAction`, even if it is empty, must be specified.
225 : 4. `using simple_tags_from_options` which is a `tmpl::list` of all the tags
226 : that will be inserted into the initial `db::DataBox` of each component.
227 : These tags are db::SimpleTag%s that have have a `using option_tags`
228 : type alias and a static function `create_from_options` (see the
229 : example below). This list can usually be constructed from the
230 : initialization actions of the component (i.e. the list of actions
231 : in the `PhaseAction` list for the `Initialization` phase) using the
232 : helper function `Parallel::get_simple_tags_from_options` (see the
233 : examples of components below). Each initialization action may
234 : specify a type alias `using simple_tags_from_options` which are a
235 : `tmpl::list` of tags that will be fetched from the db::DataBox by the
236 : action.
237 : 5. `using const_global_cache_tags` is set to a `tmpl::list` of tags
238 : that are required by the `allocate_array` function of an array
239 : component, or simple actions called on the parallel component.
240 : These tags correspond to const items that are stored in the
241 : Parallel::GlobalCache (of which there is one copy per Charm++
242 : node). The alias can be omitted if the list is empty.
243 : 6. `using mutable_global_cache_tags` is set to a `tmpl::list` of tags
244 : that correspond to mutable items that are stored in the
245 : Parallel::GlobalCache (of which there is one copy per Charm++
246 : core). The alias can be omitted if the list is empty.
247 : 7. `array_allocation_tags` is set to a `tmpl::list` of tags that will be
248 : constructed from options and will only be used in the `allocate_array`
249 : function of an array component. This type alias is only required for array
250 : components.
251 :
252 : Additionally, each %Parallel Component struct must define a `static constexpr
253 : bool checkpoint_data`, indicating whether the DataBox and inbox contents should
254 : be preserved in a checkpoint. Setting this to false is intended for
255 : organizational components that only store registration data that can be
256 : regenerated during the Restart phase.
257 :
258 : \parblock
259 : \note Array parallel components must also specify the type alias `using
260 : array_index`, which is set to the type that indexes the %Parallel Component
261 : Array. Charm++ allows arrays to be 1 through 6 dimensional or be indexed by a
262 : custom type. The Charm++ provided indexes are wrapped as
263 : `Parallel::ArrayIndex1D` through `Parallel::ArrayIndex6D`. When writing custom
264 : array indices, the [Charm++ manual](http://charm.cs.illinois.edu/help) tells you
265 : to write your own `CkArrayIndex`, but we have written a general implementation
266 : that provides this functionality (see `Parallel::ArrayIndex`); all that you need
267 : to provide is a plain-old-data
268 : ([POD](http://en.cppreference.com/w/cpp/concept/PODType)) struct of the size of
269 : at most 3 integers.
270 : \endparblock
271 :
272 : \parblock
273 : \note Singletons use an `array_index` of type `int`, but users need not specify
274 : this. It is already specified in the implementation of a singleton.
275 : \endparblock
276 :
277 : %Parallel array components have a static `allocate_array` function
278 : that is used to construct the elements of the array. The
279 : signature of the `allocate_array` functions must be:
280 : \code
281 : static void allocate_array(
282 : Parallel::CProxy_GlobalCache<metavariables>& global_cache,
283 : const tuples::tagged_tuple_from_typelist<simple_tags_from_options>&
284 : initialization_items,
285 : const tuples::tagged_tuple_from_typelist<array_allocation_tags>&
286 : array_allocation_items,
287 : const std::unordered_set<size_t>& procs_to_ignore);
288 : \endcode
289 : The `allocate_array` function is called by the Main parallel component
290 : when the execution starts and will typically insert elements into
291 : array parallel components. If the `allocate_array` function depends
292 : upon input options that are not in the GlobalCache, those tags should be
293 : added to the `array_allocation_tags` type alias. A TaggedTuple is constructed
294 : from this type alias and its input options and is only available in the
295 : `allocate_array` function. All other tags that will be constructed from options
296 : and used during the %Initialization phase should be placed in the
297 : `simple_tags_from_options` type alias.
298 : This type alias is a `tmpl::list` of tags which
299 : are db::SimpleTag%s that have have a `using option_tags` type alias
300 : and a static function `create_from_options`. They only need to be explicitly
301 : added to the list if no initialization action has added them to its
302 : `simple_tags_from_options` type alias. If you want to ignore specific
303 : processors when placing array elements, you can pass in a
304 : `std::unordered_set<size_t>` to `allocate_array` that contains all the
305 : processors that shouldn't have array elements on them.
306 :
307 : The `allocate_array` functions of different
308 : array components are called in random order and so it is not safe to
309 : have them depend on each other.
310 :
311 : Each parallel component must also decide what to do in the different phases of
312 : the execution. This is controlled by an `execute_next_phase` function with
313 : signature:
314 : \code
315 : static void execute_next_phase(
316 : const Parallel::Phase next_phase,
317 : const Parallel::CProxy_GlobalCache<metavariables>& global_cache);
318 : \endcode
319 : Parallel::Main<Metavariables>::execute_next_phase`
320 : determines the next phase, after
321 : which the `execute_next_phase` function of each component gets called. The
322 : `execute_next_phase` function determines what the parallel component should do
323 : during the next phase. Typically the `execute_next_phase` function should just
324 : call `start_phase(phase)` on the parallel component.
325 :
326 : ## 3. Examples {#dev_guide_parallelization_component_examples}
327 :
328 : An example of a singleton parallel component is:
329 : \snippet Test_AlgorithmParallel.cpp singleton_parallel_component
330 :
331 : An example of an array parallel component is:
332 : \snippet Test_AlgorithmParallel.cpp array_parallel_component
333 :
334 : There are some parallel components that are common to many executables.
335 :
336 : - Parallel::GlobalCache (a Parallel::Algorithms::Nodegroup)
337 : - DgElementArray (a Parallel::Algorithms::Array)
338 : - observers::Observer (a Parallel::Algorithms::Group)
339 : - observers::ObserverWriter (a Parallel::Algorithms::Nodegroup)
340 :
341 : The MutableGlobalCache deserves special mention, which is why is has its own
342 : section with instructions on how to use it. See [Mutable items in the
343 : GlobalCache](#dev_guide_parallelization_mutable_global_cache).
344 :
345 : ## 4. Placement {#dev_guide_parallelization_component_placement}
346 :
347 : The user has some control over where parallel components get placed on the
348 : resources it is running on. Here is a figure that illustrates how one may place
349 : parallel components.
350 :
351 : \image html charm_node_structure.png "Parallel component placement."
352 :
353 : In this example we are running on three (3) nodes that have four (4) cores each.
354 : For all our executables, we reserve one core of each node purely for
355 : communication purposes. Nothing else is run on this core. Because of this, what
356 : Charm++ calls a node, doesn't correspond to a full node on a supercomputer. A
357 : charm-node simply corresponds to a collection of cores on a physical node. In
358 : our case, a charm-node is represented by the remaining cores on a node not used
359 : for communication (i.e. the first charm-node corresponds to cores 1-3 on the
360 : first physical node). Also the definition of a charm-core doesn't necessarily
361 : have to correspond to an actual core (it could correspond to a hyperthreaded
362 : virtual core), however, for our purposes, it does.
363 :
364 : SpECTRE offers wrappers around Charm++ functions that will tell you the total
365 : number of charm-nodes/cores in an executable and what charm-node/core a parallel
366 : component is on. (In the following examples, the type `T` is an `int` or a
367 : `size_t`)
368 :
369 : - `Parallel::my_node<T>()` returns the charm-node that the parallel component is
370 : on. In the figure, `Sing. 4` would return `2`.
371 : - `Parallel::my_proc<T>()` returns the charm-core that the parallel component is
372 : on. In the figure, `Sing. 4` would return `6` (*not* `9`).
373 : - `Parallel::number_of_nodes<T>()` returns the total number of charm-nodes in an
374 : executable. The above figure would have `3` charm-nodes.
375 : - `Parallel::number_of_procs<T>()` returns the total number of charm-cores in an
376 : executable. The above figure would have `9` charm-cores (*not* `12`).
377 :
378 : \note For Charm++ SMP (shared memory parallelism) builds, a node corresponds to
379 : a collection of cores on a physical node, and a core corresponds to a processor
380 : on that physical node. However, for non-SMP builds, nodes and cores are
381 : equivalent. All of our builds are done with Charm++ SMP so nodes and cores have
382 : their usual definitions.
383 :
384 : The placement of Groups and Nodegroups are determined by Charm++. This is
385 : because a Group is on every charm-core and a Nodegroup is on every charm-node.
386 : Even though Nodegroups are one per charm-node, the user can't choose which core
387 : is used on the charm-node. They run on the next available charm-core on the
388 : charm-node.
389 :
390 : The Elements of an Array, however, can be placed on specific charm-cores. They
391 : are inserted into the Array by using the Charm++ `insert` member function of the
392 : CProxy for the Array. The `insert` function is documented in the Charm++ manual.
393 : In the Array example in the
394 : [Examples](#dev_guide_parallelization_component_examples) section, `array_proxy`
395 : is a `CProxy` and so all the documentation for Charm++ array proxies applies.
396 : SpECTRE always creates empty arrays with the constructor and requires users to
397 : insert however many elements they want and on which charm-cores they want them
398 : to be placed. Note that load balancing calls may result in array elements being
399 : moved.
400 :
401 : In a similar fashion, Singletons can also be placed on specific charm-cores.
402 : This can be specified in the input file.
403 :
404 : From an input file, there are two ways to specify where Array/Singleton parallel
405 : components can be placed.
406 :
407 : ```yaml
408 : ResourceInfo:
409 : AvoidGlobalProc0: true
410 : Singletons:
411 : AhA:
412 : Proc: 12
413 : Exclusive: true
414 : AhB:
415 : Proc: Auto
416 : Exclusive: false
417 : ```
418 :
419 : First is the `AvoidGlobalProc0` option. This option will tell the program to not
420 : put *any* Array Elements or Singletons on the global zeroth charm-core. This
421 : core is sometimes used to write data to disk which is typically much slower
422 : than the program execution. The second is the `Singletons:` option. You can
423 : set the value to `Auto`, and then each singleton will have their proc be chosen
424 : automatically and they won't be exclusive. Otherwise, you must specify options
425 : for each singleton as in the example above. `AhA` is the `pretty_type::name()`
426 : of a Singleton in the program and the user has a choice of which proc to place
427 : the Singleton on (`Auto` will let the program decide) and whether to exclude
428 : Array Elements or other Singletons from being put on this core. This is useful
429 : in case the Singleton does some expensive computation that shouldn't be slowed
430 : down by having lots of Array Elements on the same core. In the figure above,
431 : `AvoidGlobalProc0` is true, and `Sing. 2` requested to be exclusively on core
432 : `2`.
433 :
434 : # Actions {#dev_guide_parallelization_actions}
435 :
436 : %Actions are structs with a static `apply` method and come in five
437 : variants: simple actions, iterable actions, reduction actions,
438 : threaded actions, and local synchronous actions.
439 :
440 : The signature of `apply` methods differs for the different types of
441 : actions, but all types have the same general form. Actions receive a
442 : `db::DataBox`, the Parallel::GlobalCache, and their element's index
443 : and parallel component, as well as arguments specific to the action
444 : type.
445 :
446 : The `db::DataBox` should be thought of as the member data of the parallel
447 : component while the actions are the member functions. The combination of a
448 : `db::DataBox` and actions allows building up classes with arbitrary member data
449 : and methods using template parameters and invocation of actions. This approach
450 : allows us to eliminate the need for users to work with Charm++'s interface
451 : files, which can be error prone and difficult to use.
452 :
453 : The Parallel::GlobalCache is passed to each action so that the
454 : action has access to global data and is able to invoke actions on
455 : other parallel components. The `ParallelComponent` template parameter
456 : is the tag of the parallel component that invoked the action. A proxy
457 : to the calling parallel component can then be retrieved from the
458 : Parallel::GlobalCache. The remote entry method invocations are
459 : slightly different for different types of actions, so they will be
460 : discussed below. However, one thing that is disallowed for all actions
461 : is calling an action locally from within an action on the same
462 : parallel component. Specifically,
463 :
464 : \snippet Test_AlgorithmNestedApply1.cpp bad_recursive_call
465 :
466 : Here `Parallel::local()` is a wrapper around `ckLocal()` which is a Charm++
467 : provided method that returns a pointer to the local (currently executing)
468 : parallel component. See the
469 : [Charm++ manual](http://charm.cs.illinois.edu/help) for more
470 : information. However, you are able to queue a new action to be
471 : executed later on the same parallel component by getting your own
472 : parallel component from the Parallel::GlobalCache
473 : (`Parallel::get_parallel_component<ParallelComponent>(cache)`). The
474 : difference between the two calls is that by calling an action through
475 : the parallel component you will first finish the series of actions you
476 : are in, then when they are complete Charm++ will call the next queued
477 : action.
478 :
479 : Array, group, and nodegroup parallel components can have actions invoked in two
480 : ways. First is a broadcast where the action is called on all elements of the
481 : array:
482 :
483 : \snippet Test_AlgorithmParallel.cpp broadcast_to_group
484 :
485 : The second case is invoking an action on a specific array element by using the
486 : array element's index. The below example shows how a broadcast would be done
487 : manually by looping over all elements in the array:
488 :
489 : \snippet Test_AlgorithmParallel.cpp call_on_indexed_array
490 :
491 : Note that in general you will not know what all the elements in the array are
492 : and so a broadcast is the correct method of sending data to or invoking an
493 : action on all elements of an array parallel component.
494 :
495 : The `array_index` argument passed to all `apply` methods is the index into the
496 : parallel component array. If the parallel component is not an array the value
497 : and type of `array_index` is implementation defined and cannot be relied on.
498 :
499 : ## 1. Simple Actions {#dev_guide_parallelization_simple_actions}
500 :
501 : Simple actions can be thought of as member functions of remote objects
502 : (chares/parallel components). They are the direct analog of entry
503 : methods in Charm++ except that the member data is stored in the
504 : `db::DataBox` that is passed in as the first argument. A simple action
505 : must return void but can use `db::mutate` to change values of items in
506 : the `db::DataBox` if the `db::DataBox` is taken as a non-const
507 : reference. In some cases you will need specific items to be in the
508 : `db::DataBox` otherwise the action won't compile. To restrict which
509 : `db::DataBox`es can be passed you should use `requires` in the
510 : action's `apply` function template parameter list. For example,
511 : \snippet Test_AlgorithmCore.cpp requires_action
512 : checks that `CountActionsCalled` is available in the box.
513 :
514 : Simple actions can be called using a `CProxy` (see the [Charm++
515 : manual](http://charm.cs.illinois.edu/help)), which is retrieved from
516 : the Parallel::GlobalCache using the parallel component struct and the
517 : `Parallel::get_parallel_component()` function. For example, the
518 : action above could be called as
519 : \snippet Test_AlgorithmCore.cpp simple_action_call
520 : Any arguments after the proxy are passed as additional arguments to
521 : the action's `apply` function.
522 :
523 : ## 2. Iterable Actions {#dev_guide_parallelization_iterable_actions}
524 :
525 : Iterable actions make up the algorithms described by the PDALs. These
526 : actions are executed one after the other until one of them cannot be
527 : evaluated. Their `apply` methods signature is
528 : \snippet Test_AlgorithmCore.cpp apply_iterative
529 : The `ActionList` type is the `tmpl::list` of iterable actions in the
530 : current phase. That is, it is equal to the `action_list` type alias
531 : in the current PDAL. The `inboxes` is a collection of the tags
532 : specified as `tmpl::list`s in the iterable actions' member type
533 : aliases `inbox_tags`. This collection represents data received from
534 : other chares using the `receive_data` function.
535 :
536 : Iterable actions can request that the algorithm be paused or halted
537 : for the current phase, and control which action in the current PDAL
538 : will be executed next. This is all done via the return value from the
539 : `apply` function. The `apply` function for iterable actions must
540 : return a Parallel::iterable_action_return_t which is a
541 : `std::tuple<Parallel::AlgorithmExecution, std::optional<size_t>>`. The
542 : first element of the tuple controls how algorithm execution continues.
543 : See the documentation of `Parallel::AlgorithmExecution` for the
544 : meanings of different values of that enum. The second element of the
545 : tuple is usually set to `std::nullopt` in order to continue iterating
546 : through the algorithm, but can be used to jump to a different action
547 : in the current PDAL. Most iterable actions will simply return
548 :
549 : \snippet Test_AlgorithmParallel.cpp iterable_action_return_continue_next_action
550 :
551 : An action that pauses the algorithm will return
552 :
553 : \snippet Test_AlgorithmParallel.cpp return_with_termination
554 :
555 : After an algorithm has been paused it can be restarted by passing
556 : `false` to the `set_terminate` method or by calling `receive_data(...,
557 : true)`. Since the order in which messages are received is undefined in
558 : most cases the `receive_data(..., true)` call should be used to
559 : restart the algorithm.
560 :
561 : The return value `Parallel::AlgorithmExecution::Retry` deserves
562 : special mention. It is intended for use by actions that use data
563 : supplied by other parallel objects to indicate that they have not
564 : received all of the data they require. The algorithm will stop until
565 : an operation that could supply data has occurred and then the action
566 : will be retried. An example of waiting because of missing data is
567 :
568 : \snippet Test_AlgorithmCore.cpp retry_example
569 :
570 : In order to jump to a specific action, the metafunction
571 : `tmpl::index_of<list, element>` can be used to get an
572 : `tmpl::integral_constant` with the value of the index of the element
573 : `element` in the typelist `list`. For example,
574 :
575 : \snippet Test_AlgorithmCore.cpp out_of_order_action
576 :
577 : The metafunction call `tmpl::index_of<ActionList,
578 : iterate_increment_int0>::%value` returns a `size_t` whose value is
579 : that of the action `iterate_increment_int0` in the PDAL. The indexing
580 : of actions in the PDAL starts at `0`.
581 :
582 : Iterable actions are invoked as part of the algorithm and so the only way
583 : to request they be invoked is by having the algorithm run on the parallel
584 : component. The algorithm can be explicitly evaluated in a new phase by calling
585 : `start_phase(Phase::TheCurrentPhase)`:
586 :
587 : \snippet Test_AlgorithmCore.cpp start_phase
588 :
589 : Alternatively, to evaluate the algorithm without changing phases the
590 : `perform_algorithm()` method can be used.
591 :
592 : By passing `true` to `perform_algorithm` the algorithm will be restarted if it
593 : was paused.
594 :
595 : The algorithm is also evaluated by calling the `receive_data` function, either
596 : on an entire array or singleton (this does a broadcast), or an on individual
597 : element of the array. Here is an example of a broadcast call:
598 :
599 : \snippet Test_AlgorithmParallel.cpp broadcast_to_group
600 :
601 : and of calling individual elements:
602 :
603 : \snippet Test_AlgorithmParallel.cpp call_on_indexed_array
604 :
605 : The `receive_data` function always takes a `ReceiveTag`, which is set
606 : in the actions' `inbox_tags` type aliases. The `inbox_tags` must have
607 : two member type aliases, a `temporal_id` which is used to identify
608 : when the data was sent, and a `type` which is the type of the data to
609 : be stored in the `inboxes`. The types are typically a
610 : `std::unordered_map<temporal_id, DATA>`. In the discussed scenario of
611 : waiting for neighboring elements to send their data the `DATA` type
612 : would be a `std::unordered_map<TheElementId, DataSent>`. Inbox tags
613 : must also specify a `static void insert_into_inbox()` function. For
614 : example,
615 :
616 : \snippet Test_AlgorithmParallel.cpp int_receive_tag
617 :
618 : For common types of `DATA`, such as a `map`, a data structure with an `insert`
619 : function, a data structure with a `push_back` function, or copy/move assignment
620 : that is used to insert the received data, inserters are available in
621 : `Parallel::InboxInserters`. For example, there is
622 : `Parallel::InboxInserters::Map` for `map` data structures. The inbox tag can
623 : inherit publicly off the inserters to gain the required insertion capabilities:
624 :
625 : \snippet Test_AlgorithmCore.cpp int receive tag insert
626 :
627 : Any inbox tag that uses Charm++ messages must also specify a `message_type` type
628 : alias which is the object that will be sent. An example is:
629 :
630 : \snippet Test_AlgorithmMessages.cpp charm message inbox tag
631 :
632 : The `inbox_tags` type alias for the action is:
633 :
634 : \snippet Test_AlgorithmParallel.cpp int_receive_tag_list
635 :
636 : An inbox tag can also optionally specify a static function called `output_inbox`
637 : that returns a `std::string`. This function can be used for printing the
638 : contents of the inbox in a nice way as the types can sometimes get complicated.
639 : You can also use the `Parallel::output_inbox` function to output a specific
640 : inbox from all the inboxes. See an above example for the signature of the
641 : `output_inbox` function.
642 :
643 : \warning
644 : It is the responsibility of the iterable action to remove data from the inboxes
645 : that will no longer be needed.
646 :
647 : Normally when remote functions are invoked they go through the Charm++ runtime
648 : system, which adds some overhead. The `receive_data` function tries to elide
649 : the call to the Charm++ RTS for calls into array components. Charm++ refers to
650 : these types of remote calls as "inline entry methods". With the Charm++ method
651 : of eliding the RTS, the code becomes susceptible to stack overflows because
652 : of infinite recursion. The `receive_data` function is limited to at most 64 RTS
653 : elided calls, though in practice reaching this limit is rare. When the limit is
654 : reached the remote method invocation is done through the RTS instead of being
655 : elided.
656 :
657 : ## 3. Reduction Actions {#dev_guide_parallelization_reduction_actions}
658 :
659 : Reduction actions are the targets of reducing data over an array. For
660 : example, you may want to know the sum of a `int` from every element in
661 : the array. You can do this as follows:
662 :
663 : \snippet Test_AlgorithmReduction.cpp contribute_to_reduction_example
664 :
665 : This reduces over the parallel component
666 : `ArrayParallelComponent<Metavariables>`, reduces to the parallel component
667 : `SingletonParallelComponent<Metavariables>`, and calls the action
668 : `ProcessReducedSumOfInts` after the reduction has been performed. The reduction
669 : action is:
670 :
671 : \snippet Test_AlgorithmReduction.cpp reduce_sum_int_action
672 :
673 : As you can see, the last argument to the `apply` function is of type `int`, and
674 : is the reduced value.
675 :
676 : You can also broadcast the result back to an array, even yourself. For example,
677 :
678 : \snippet Test_AlgorithmReduction.cpp contribute_to_broadcast_reduction
679 :
680 : It is often necessary to reduce custom data types, such as `std::vector` or
681 : `std::unordered_map`. Charm++ supports such custom reductions, and so does our
682 : layer on top of Charm++.
683 : Custom reductions require one additional step to calling
684 : `contribute_to_reduction`, which is writing a reduction function to reduce the
685 : custom data. We provide a generic type that can be used in custom reductions,
686 : `Parallel::ReductionData`, which takes a series of `Parallel::ReductionDatum` as
687 : template parameters and `ReductionDatum::value_type`s as the arguments to the
688 : constructor. Each `ReductionDatum` takes up to four template parameters (two
689 : are required). The first is the type of data to reduce, and the second is a
690 : binary invokable that is called at each step of the reduction to combine two
691 : messages. The last two template parameters are used after the reduction has
692 : completed. The third parameter is an n-ary invokable that is called once the
693 : reduction is complete, whose first argument is the result of the reduction. The
694 : additional arguments can be any `ReductionDatum::value_type` in the
695 : `ReductionData` that are before the current one. The fourth template parameter
696 : of `ReductionDatum` is used to specify which data should be passed. It is a
697 : `std::index_sequence` indexing into the `ReductionData`.
698 :
699 : \warning
700 : All elements of the array must call the same reductions in the same order. It is
701 : *defined* behavior to do multiple reductions at once as long as all contribute
702 : calls on all array elements occurred in the same order. It is **undefined**
703 : behavior if the contribute calls are made in different orders on different array
704 : elements.
705 :
706 : ## 4. Threaded Actions {#dev_guide_parallelization_threaded_actions}
707 :
708 : Threaded actions are similar to simple actions, with the difference
709 : being that multiple threaded actions may be running on the same chare
710 : at the same time (potentially in parallel with one simple or reduction
711 : action). The `apply` function for a threaded actions has the same
712 : signature as that for a simple action, except that it also receives a
713 : `NodeLock` intended to control access to the chare's `db::DataBox`.
714 : All access to the `db::DataBox`, including read-only access, must
715 : occur while the action owns this lock. (Simple and reduction actions
716 : implicitly hold the lock for their entire execution.)
717 :
718 : \snippet Test_AlgorithmNodelock.cpp threaded_action_example
719 :
720 : Threaded actions can only be run on nodegroup chares.
721 :
722 : ## 5. Local Synchronous Actions {#dev_guide_parallelization_local_synchronous_actions}
723 :
724 : There is limited ability to retrieve data held by another parallel component via
725 : a direct synchronous call. Unlike the above actions, the invocation of a
726 : synchronous action is precisely a call to a member function of another parallel
727 : component; therefore, these invocations will run to completion, and return their
728 : result before the calling code proceeds in execution.
729 :
730 : Aside from being synchronous and being able to return data, local
731 : synchronous actions behave the same as threaded actions, except that
732 : they will only run on the chare of a nodegroup that is on the local
733 : node.
734 :
735 : Local synchronous actions' `apply` functions follow a signature motivated by
736 : threaded actions, but take fewer arguments. This may be a bug.
737 :
738 : Local synchronous actions must specify their return type in a
739 : `return_type` type alias. This is to help simplify the logic with the
740 : variant `db::DataBox` held by the parallel component.
741 :
742 : An example of a definition of a local synchronous action:
743 :
744 : \snippet Test_AlgorithmLocalSyncAction.cpp synchronous_action_example
745 :
746 : And the corresponding invocation:
747 :
748 : \snippet Test_AlgorithmLocalSyncAction.cpp synchronous_action_invocation_example
749 :
750 : \warning Say an action is being run on a component on a node where a mutable
751 : item in the GlobalCache is up-to-date. This action then calls another action on
752 : a different component on a different node. It is **NOT** guaranteed that this
753 : mutable item in the GlobalCache is up-to-date on this new node. It is up to the
754 : user to ensure the mutable item is up-to-date on whatever node they run an
755 : action on, even if it was up-to-date on the node that sent the message. The
756 : \ref dev_guide_parallelization_mutable_global_cache
757 : "Mutable items in the GlobalCache" section gives details about mutable
758 : GlobalCache items.
759 :
760 : # Mutable items in the GlobalCache {#dev_guide_parallelization_mutable_global_cache}
761 :
762 : Most items in the GlobalCache are constant, and are specified
763 : by type aliases called `const_global_cache_tags` as
764 : described above. However, the GlobalCache can also store mutable
765 : items. Because of asynchronous execution, **EXTREME** care must be taken when
766 : mutating items in the GlobalCache, as described below.
767 :
768 : A mutable item can be of any type, as long as that type is something
769 : that can be checked for whether it is "up-to-date". Here "up-to-date"
770 : means that the item can be safely used (even read-only) without
771 : needing to be mutated first. For example, a mutable item might be a
772 : function of time that knows the range of times for which it is valid;
773 : the mutable item is then deemed up-to-date if it will be called for a
774 : time within its range of validity, and it is deemed not up-to-date if
775 : it will be called for a time outside its range of validity. Thus the
776 : up-to-date status of a mutable item is determined by both the state of
777 : the item itself and by the code that wishes to use that item.
778 :
779 : ## 1. Specification of mutable GlobalCache items
780 :
781 : Mutable GlobalCache items are specified by a
782 : type alias `mutable_global_cache_tags`, which is treated the same way
783 : as `const_global_cache_tags` for const items.
784 :
785 : ## 2. Use of mutable GlobalCache items
786 :
787 : ### 1. Checking if the item is up-to-date
788 :
789 : Because execution is asynchronous, any code that uses a mutable item
790 : in the GlobalCache must first check whether that item is up-to-date.
791 : The information about whether an item is up-to-date is assumed to be
792 : stored in the item itself. For example, a mutable object stored in
793 : the GlobalCache might have type `std::map<temporal_id,T>` (for some
794 : type `T`), and then any code that uses the stored object can check
795 : whether an entry exists for a particular `temporal_id`. To avoid
796 : race conditions, it is
797 : important that up-to-date checks are based on something that is
798 : independent of the order of mutation (like a `temporal_id`, and not
799 : like checking the size of a vector).
800 :
801 : To check an item, use the function
802 : `Parallel::mutable_cache_item_is_ready`, which returns a bool
803 : indicating whether the item is up-to-date. If the item is up-to-date,
804 : then it can be used. `Parallel::mutable_cache_item_is_ready` takes a
805 : lambda as an argument. This lambda is passed a single argument: a
806 : const reference to the item being retrieved. The lambda should
807 : determine whether the item is up-to-date. If so, it should return a
808 : default_constructed `std::unique_ptr<Parallel::Callback>`; if not, it should
809 : return a `std::unique_ptr<Parallel::Callback>` to a callback function that will
810 : be called on the next `Parallel::mutate` of that item. The callback
811 : will typically check again if the item is up-to-date and if so will
812 : execute some code that gets the item via `Parallel::get`.
813 :
814 : For the case of iterable actions,
815 : `Parallel::mutable_cache_item_is_ready` typically uses the callback
816 : `Parallel::PerformAlgorithmCallback`. In the example below, the
817 : vector is considered up-to-date if it is non-empty. If the vector is
818 : not up-to-date, then when it becomes up-to-date the callback function
819 : will be invoked; in this case the callback function re-runs
820 : `perform_algorithm()`, which will call the same action again.
821 :
822 : \snippet Test_AlgorithmGlobalCache.cpp check_mutable_cache_item_is_ready
823 :
824 : Note that `Parallel::mutable_cache_item_is_ready` is called on the local
825 : node and does no parallel communication.
826 :
827 : ### 2. Retrieving the item
828 :
829 : The item is retrieved using `Parallel::get` just like for constant items.
830 : For example, to retrieve the item `Tags::VectorOfDoubles`:
831 : \snippet Test_AlgorithmGlobalCache.cpp retrieve_mutable_cache_item
832 :
833 : Note that `Parallel::get` is called on the local node and does no
834 : parallel communication.
835 :
836 : Whereas we support getting *non-mutable* items in the GlobalCache from
837 : a DataBox via `db::get`, we intentionally do not support
838 : `db::get` of *mutable* items in the GlobalCache from a DataBox.
839 : The reason is that mutable
840 : items should be retrieved only after a `Parallel::mutable_cache_item_is_ready`
841 : check, and being able to retrieve a mutable item from a DataBox makes it
842 : difficult to enforce that check, especially when automatically-executing
843 : compute items are considered.
844 :
845 : ## 3. Modifying a mutable GlobalCache item
846 :
847 : To modify a mutable item, pass `Parallel::mutate` two template
848 : parameters: the tag to mutate, and a struct with an `apply` function
849 : that does the mutating. `Parallel::mutate` takes two arguments:
850 : a reference to the local GlobalCache, and a tuple that is passed into the
851 : mutator function. For the following example,
852 :
853 : \snippet Test_AlgorithmGlobalCache.cpp mutate_global_cache_item
854 :
855 : the mutator function is defined as below:
856 : \snippet Test_AlgorithmGlobalCache.cpp mutate_global_cache_item_mutator
857 :
858 : `Parallel::mutate` broadcasts to every node, where it calls the
859 : mutator function and then calls all the callbacks that have been set
860 : on that node by `Parallel::mutable_cache_item_is_ready`. The
861 : `Parallel::mutate` operation is guaranteed to be thread-safe without
862 : any further action by the developer so long as the item being mutated can be
863 : mutated in a threadsafe way. See the `Parallel::GlobalCache` docs for more
864 : details.
865 :
866 : # Charm++ Node and Processor Level Initialization Functions {#dev_guide_parallelization_charm_node_processor_level_initialization}
867 :
868 : Charm++ allows running functions once per core and once per node before the
869 : construction of any parallel components. This is commonly used for setting up
870 : error handling and enabling floating point exceptions. Other functions could
871 : also be run. Which functions are run on each node and core is set by calling
872 : `Parallel::charmxx::register_init_node_and_proc` in `CkRegisterMainModule()`
873 : with function pointers to the functions to be called. For example:
874 :
875 : \snippet Test_AlgorithmPhaseControlNodegroup.cpp charm_init_funcs_example
|