Line data Source code
1 0 : \cond NEVER
2 : Distributed under the MIT License.
3 : See LICENSE.txt for details.
4 : \endcond
5 : # Parallelization, Charm++, and Core Concepts {#dev_guide_parallelization_foundations}
6 :
7 : \tableofcontents
8 :
9 : # Introduction {#dev_guide_parallelization_introduction}
10 :
11 : SpECTRE builds a layer on top of Charm++ that performs various safety checks and
12 : initialization for the user that can otherwise lead to difficult-to-debug
13 : undefined behavior. The central concept is what is called a %Parallel
14 : Component. A %Parallel Component is a struct with several type aliases that
15 : is used by SpECTRE to set up the Charm++ chares and allowed communication
16 : patterns. %Parallel Components are input arguments to the compiler, which then
17 : writes the parallelization infrastructure that you requested for the executable.
18 : There is no restriction on the number of %Parallel Components, though
19 : practically it is best to have around 10 at most.
20 :
21 : Here is an overview of what is described in detail in the sections below:
22 :
23 : - \ref dev_guide_parallelization_metavariables_class "Metavariables": Provides
24 : high-level configuration to the compiler, e.g. the physical system to be
25 : simulated.
26 : - \ref dev_guide_parallelization_phases_of_execution "Phases": Defines distinct
27 : simulation phases separated by a global synchronization point,
28 : e.g. `Initialization`, `Evolve` and `Exit`.
29 : - \ref dev_guide_parallelization_core_algorithm "Algorithm": In each phase,
30 : repeatedly iterates over a list of actions until the current phase ends.
31 : - \ref dev_guide_parallelization_parallel_components "Parallel component":
32 : Maintains and executes its algorithm.
33 : - \ref dev_guide_parallelization_actions "Action": Performs a computational
34 : task, e.g. evaluating the right hand side of the time evolution equations. May
35 : require data to be received from another action potentially being executed on
36 : a different core or node.
37 :
38 : # The Metavariables Class {#dev_guide_parallelization_metavariables_class}
39 :
40 : SpECTRE takes a different approach to input options passed to an executable than
41 : is common. SpECTRE not only reads an input file at runtime but also has many
42 : choices made at compile time. The compile time options are specified by what is
43 : referred to as the metavariables. What exactly the metavariables struct
44 : specifies depends on the executable, but all metavariables structs must
45 : specify the following:
46 :
47 : - `help`: a `static constexpr Options::String` that will be printed as part of
48 : the help message. It should describe the executable and basic usage of it, as
49 : well as any non-standard options that must be specified in the metavariables
50 : and their current values. An example of a help string for one of the testing
51 : executables is:
52 : \snippet Test_AlgorithmCore.cpp help_string_example
53 : - `component_list`: a `tmpl::list` of the parallel components (described below)
54 : that are to be created. Most evolution executables will have the
55 : `DgElementArray` parallel component listed. An example of a `component_list`
56 : for one of the test executables is:
57 : \snippet Test_AlgorithmCore.cpp component_list_example
58 : - `using const_global_cache_tags`: a `tmpl::list` of tags that are
59 : used to place const items in the GlobalCache. The alias may be
60 : omitted if the list is empty.
61 : - `using mutable_global_cache_tags`: a `tmpl::list` of tags that are
62 : used to place mutable items in the GlobalCache. The alias may be
63 : omitted if the list is empty.
64 : - `default_phase_order`: an array of Parallel::Phase that must contain at least
65 : `Initialization` as the first element and `Exit`as the last element. Phases
66 : are described in the next section.
67 :
68 : There are also several optional members:
69 :
70 : - `input_file`: a `static constexpr Options::String` that is the default name of
71 : the input file that is to be read. This can be overridden at runtime by
72 : passing the `--input-file` argument to the executable.
73 : - `ignore_unrecognized_command_line_options`: a `static constexpr bool` that
74 : defaults to `false`. If set to `true` then unrecognized command line options
75 : are ignored. Ignoring unrecognized options is generally only necessary for
76 : tests where arguments for the testing framework,
77 : [Catch](https://github.com/catchorg/Catch2/), are passed to the executable.
78 :
79 : # Phases of an Execution {#dev_guide_parallelization_phases_of_execution}
80 :
81 : Global synchronization points, where all cores wait for each other, are
82 : undesirable for scalability reasons. However, they are sometimes inevitable for
83 : algorithmic reasons. That is, in order to actually get a correct solution you
84 : need to have a global synchronization. SpECTRE executables can have multiple
85 : phases, where after each phase a global synchronization occurs. By global
86 : synchronization we mean that no parallel components are executing or have more
87 : tasks to execute: everything is waiting to be told what tasks to perform next.
88 :
89 : Every executable must have at least two phases, `Initialization` and
90 : `Exit`. The next phase is decided by the member function
91 : Parallel::Main<Metavariables>::execute_next_phase.
92 : Usually the next phase is determined
93 : from the `default_phase_order` provided by the metavariables. If more
94 : complex decision making is desired, various components can send data
95 : to Parallel::Main via the PhaseControl infrastructure. This allows the next
96 : Phase to be determined dynamically. Here is an example of a
97 : `default_phase_order` member variable:
98 : \snippet LinearSolverAlgorithmTestHelpers.hpp default_phase_order_array
99 :
100 : In contrast, an evolution executable might have phases
101 : `Initialization`, `SetInitialData`, `Evolve`, and `Exit`.
102 : The first phase that is entered is always
103 : `Initialization`. During the `Initialization` phase the
104 : `Parallel::GlobalCache` is created, all (node)group components are created,
105 : and empty array and singleton components are created. Next, the function
106 : `allocate_remaining_components_and_execute_initialization_phase` is called
107 : which allocates singleton components and the elements of each array component,
108 : and then starts the `Initialization` phase on all parallel components. Once all
109 : parallel components' `Initialization` phase is complete, the next
110 : phase is determined and the `execute_next_phase` function is called on
111 : all the parallel components.
112 :
113 : At the end of an execution the `Exit` phase has the executable wait to make sure
114 : no \ref dev_guide_parallelization_parallel_components "parallel components" are
115 : performing or need to perform any more tasks, and then exits. An example where
116 : this approach is important is if we are done evolving a system but still need to
117 : write data to disk. We do not want to exit the simulation until all data has
118 : been written to disk, even though we've reached the final time of the evolution.
119 :
120 : If we reach the `Exit` phase, but some \ref
121 : dev_guide_parallelization_parallel_components "parallel components" have not
122 : terminated properly, this means a deadlock has occurred. A deadlock usually
123 : implies some error in the order messages have been sent/received. For example,
124 : if core 0 was paused and waiting to receive a message from core 1, but core 1
125 : was also paused and waiting to receive a message from core 0, this would be
126 : considered a deadlock. We detect deadlocks during the `Exit` phase. All
127 : executables have the option to specify a function with the following signature
128 :
129 : \snippet Test_DetectHangArray.cpp deadlock_analysis_function
130 :
131 : If this function is specified in the metavariables and a deadlock occurs, this
132 : function and all the simple actions in it will run. The information printed
133 : during this function call is executable dependent, but it should print enough
134 : information for you to determine why the deadlock occurred. If this function
135 : isn't specified and a deadlock occurs, a message about how to add this function
136 : to your metavariables is printed, but nothing else. After this, the executable
137 : aborts.
138 :
139 : # The Algorithm {#dev_guide_parallelization_core_algorithm}
140 :
141 : Since most numerical algorithms repeat steps until some criterion such as the
142 : final time or convergence is met, SpECTRE's parallel components are designed to
143 : do such iterations for the user. An Algorithm executes an ordered list of
144 : actions until one of the actions cannot be evaluated, typically because it is
145 : waiting on data from elsewhere. When an algorithm can no longer evaluate actions
146 : it passively waits by handing control back to Charm++. Once an algorithm
147 : receives data, typically done by having another parallel component call the
148 : `receive_data` function, the algorithm will try again to execute the next
149 : action. If the algorithm is still waiting on more data then the algorithm will
150 : again return control to Charm++ and passively wait for more data. This is
151 : repeated until all required data is available. The actions that are iterated
152 : over by the algorithm are called iterable actions and are described below.
153 : Since the action list is phase dependent we refer to them generally as
154 : phase-dependent action lists (PDALs, pronounced "pedals").
155 :
156 : # Parallel Components {#dev_guide_parallelization_parallel_components}
157 :
158 : Building off the introduction, a %Parallel Component is essentially a wrapper
159 : around Charm++ chares that makes it easy for a user to add parallel objects into
160 : their program. Charm++ chares can be confusing to work with which is why we wrap
161 : them. Each parallel component runs its own \ref
162 : dev_guide_parallelization_core_algorithm "Algorithm". Data can be sent from one
163 : parallel component to another and the receiving parallel components' Algorithm
164 : will be able to take that data and continue the program.
165 :
166 : ## 1. Types of Parallel Components {#dev_guide_parallelization_component_types}
167 :
168 : There are four types of %Parallel Components in SpECTRE:
169 :
170 : 1. `Parallel::Algorithms::Singleton`s have one object in the entire execution of
171 : the program. They are implemented as single element Charm++ chare arrays.
172 : Charm++ does offer a distributed object called a singleton, however, we
173 : explicitly don't use this for various reasons (see
174 : Parallel::Algorithms::Singleton). Henceforth and throughout SpECTRE, a
175 : `singleton` will refer to Parallel::Algorithms::Singleton and *not* a Charm++
176 : singleton.
177 : 2. `Parallel::Algorithms::Array`s hold zero or more elements, each of which is
178 : an object distributed to some core. An array can grow and shrink in size
179 : dynamically if need be and can also be bound to another array. A bound array
180 : has the same number of elements as the array it is bound to, and elements
181 : with the same ID are on the same core. See Charm++'s chare arrays for
182 : details.
183 : 3. `Parallel::Algorithms::Group`s are arrays with one element per core which are
184 : not able to be moved around between cores. These are typically useful for
185 : gathering data from elements of a Parallel::Algorithms::Array on their core,
186 : and then processing or reducing the data further. See
187 : [Charm++'s](http://charm.cs.illinois.edu/help) group chares for details.
188 : 4. `Parallel::Algorithms::Nodegroup`s are similar to groups except that there is
189 : one element per node. See [parallel component
190 : placement](#dev_guide_parallelization_component_placement) for the definition
191 : of a cores and nodes. We ensure that all entry method calls done through the
192 : Algorithm's `simple_action` and `receive_data` functions are threadsafe.
193 : User-controlled threading is possible by calling the entry method member
194 : function `threaded_action`, which is like `simple_action` except it passes a
195 : node lock to the `Action`'s apply function. Note that unlike
196 : `simple_action`s, multiple `threaded_action`s can be executing simultaneously
197 : on the same chare, but on different cores of the node.
198 :
199 : \note Technically there is a fifth type of parallel component called a MainChare
200 : which is placed and executed on the global zeroth core. However, there can only
201 : be one of these in a executable, it is entirely different from a Singleton, and
202 : it is not specifiable by the user so we omit it from the count here.
203 :
204 : ## 2. Requirements {#dev_guide_parallelization_component_requirements}
205 :
206 : Each %Parallel Component struct must have the following type aliases:
207 : 1. `using chare_type` is set to one of the four \ref
208 : dev_guide_parallelization_component_types "types of Parallel Components".
209 : 2. `using metavariables` is set to the Metavariables struct that stores the
210 : global metavariables. It is often easiest to have the %Parallel
211 : Component struct have a template parameter `Metavariables` that is the
212 : global metavariables struct. Examples of this technique are given below.
213 : 3. `using phase_dependent_action_list` is set to a `tmpl::list` of
214 : `Parallel::PhaseActions<Phase, tmpl::list<Actions...>>`
215 : where each `PhaseAction` represents a PDAL that will be executed on
216 : the parallel component during the specified phase. The %Actions are
217 : executed in the order that they are given in the `tmpl::list`s of
218 : the PDALs, but the phases need not be run in linear order. However,
219 : `db::DataBox` types are constructed assuming the phases are
220 : performed from first in the `phase_dependent_action_list` to the
221 : last. Simple actions (described below) can be executed in any
222 : phase. If there are no iterable actions in a phase then a
223 : `PhaseAction` need not be specified for that phase. However, at
224 : least one `PhaseAction`, even if it is empty, must be specified.
225 : 4. `using simple_tags_from_options` which is a `tmpl::list` of all the tags
226 : that will be inserted into the initial `db::DataBox` of each component.
227 : These tags are db::SimpleTag%s that have have a `using option_tags`
228 : type alias and a static function `create_from_options` (see the
229 : example below). This list can usually be constructed from the
230 : initialization actions of the component (i.e. the list of actions
231 : in the `PhaseAction` list for the `Initialization` phase) using the
232 : helper function `Parallel::get_simple_tags_from_options` (see the
233 : examples of components below). Each initialization action may
234 : specify a type alias `using simple_tags_from_options` which are a
235 : `tmpl::list` of tags that will be fetched from the db::DataBox by the
236 : action.
237 : 5. `using const_global_cache_tags` is set to a `tmpl::list` of tags
238 : that are required by the `allocate_array` function of an array
239 : component, or simple actions called on the parallel component.
240 : These tags correspond to const items that are stored in the
241 : Parallel::GlobalCache (of which there is one copy per Charm++
242 : node). The alias can be omitted if the list is empty.
243 : 6. `using mutable_global_cache_tags` is set to a `tmpl::list` of tags
244 : that correspond to mutable items that are stored in the
245 : Parallel::GlobalCache (of which there is one copy per Charm++
246 : core). The alias can be omitted if the list is empty.
247 : 7. `array_allocation_tags` is set to a `tmpl::list` of tags that will be
248 : constructed from options and will only be used in the `allocate_array`
249 : function of an array component. This type alias is only required for array
250 : components.
251 :
252 : \parblock
253 : \note Array parallel components must also specify the type alias `using
254 : array_index`, which is set to the type that indexes the %Parallel Component
255 : Array. Charm++ allows arrays to be 1 through 6 dimensional or be indexed by a
256 : custom type. The Charm++ provided indexes are wrapped as
257 : `Parallel::ArrayIndex1D` through `Parallel::ArrayIndex6D`. When writing custom
258 : array indices, the [Charm++ manual](http://charm.cs.illinois.edu/help) tells you
259 : to write your own `CkArrayIndex`, but we have written a general implementation
260 : that provides this functionality (see `Parallel::ArrayIndex`); all that you need
261 : to provide is a plain-old-data
262 : ([POD](http://en.cppreference.com/w/cpp/concept/PODType)) struct of the size of
263 : at most 3 integers.
264 : \endparblock
265 :
266 : \parblock
267 : \note Singletons use an `array_index` of type `int`, but users need not specify
268 : this. It is already specified in the implementation of a singleton.
269 : \endparblock
270 :
271 : %Parallel array components have a static `allocate_array` function
272 : that is used to construct the elements of the array. The
273 : signature of the `allocate_array` functions must be:
274 : \code
275 : static void allocate_array(
276 : Parallel::CProxy_GlobalCache<metavariables>& global_cache,
277 : const tuples::tagged_tuple_from_typelist<simple_tags_from_options>&
278 : initialization_items,
279 : const tuples::tagged_tuple_from_typelist<array_allocation_tags>&
280 : array_allocation_items,
281 : const std::unordered_set<size_t>& procs_to_ignore);
282 : \endcode
283 : The `allocate_array` function is called by the Main parallel component
284 : when the execution starts and will typically insert elements into
285 : array parallel components. If the `allocate_array` function depends
286 : upon input options that are not in the GlobalCache, those tags should be
287 : added to the `array_allocation_tags` type alias. A TaggedTuple is constructed
288 : from this type alias and its input options and is only available in the
289 : `allocate_array` function. All other tags that will be constructed from options
290 : and used during the %Initialization phase should be placed in the
291 : `simple_tags_from_options` type alias.
292 : This type alias is a `tmpl::list` of tags which
293 : are db::SimpleTag%s that have have a `using option_tags` type alias
294 : and a static function `create_from_options`. They only need to be explicitly
295 : added to the list if no initialization action has added them to its
296 : `simple_tags_from_options` type alias. If you want to ignore specific
297 : processors when placing array elements, you can pass in a
298 : `std::unordered_set<size_t>` to `allocate_array` that contains all the
299 : processors that shouldn't have array elements on them.
300 :
301 : The `allocate_array` functions of different
302 : array components are called in random order and so it is not safe to
303 : have them depend on each other.
304 :
305 : Each parallel component must also decide what to do in the different phases of
306 : the execution. This is controlled by an `execute_next_phase` function with
307 : signature:
308 : \code
309 : static void execute_next_phase(
310 : const Parallel::Phase next_phase,
311 : const Parallel::CProxy_GlobalCache<metavariables>& global_cache);
312 : \endcode
313 : Parallel::Main<Metavariables>::execute_next_phase`
314 : determines the next phase, after
315 : which the `execute_next_phase` function of each component gets called. The
316 : `execute_next_phase` function determines what the parallel component should do
317 : during the next phase. Typically the `execute_next_phase` function should just
318 : call `start_phase(phase)` on the parallel component.
319 :
320 : ## 3. Examples {#dev_guide_parallelization_component_examples}
321 :
322 : An example of a singleton parallel component is:
323 : \snippet Test_AlgorithmParallel.cpp singleton_parallel_component
324 :
325 : An example of an array parallel component is:
326 : \snippet Test_AlgorithmParallel.cpp array_parallel_component
327 :
328 : There are some parallel components that are common to many executables.
329 :
330 : - Parallel::GlobalCache (a Parallel::Algorithms::Nodegroup)
331 : - DgElementArray (a Parallel::Algorithms::Array)
332 : - observers::Observer (a Parallel::Algorithms::Group)
333 : - observers::ObserverWriter (a Parallel::Algorithms::Nodegroup)
334 :
335 : The MutableGlobalCache deserves special mention, which is why is has its own
336 : section with instructions on how to use it. See [Mutable items in the
337 : GlobalCache](#dev_guide_parallelization_mutable_global_cache).
338 :
339 : ## 4. Placement {#dev_guide_parallelization_component_placement}
340 :
341 : The user has some control over where parallel components get placed on the
342 : resources it is running on. Here is a figure that illustrates how one may place
343 : parallel components.
344 :
345 : \image html charm_node_structure.png "Parallel component placement."
346 :
347 : In this example we are running on three (3) nodes that have four (4) cores each.
348 : For all our executables, we reserve one core of each node purely for
349 : communication purposes. Nothing else is run on this core. Because of this, what
350 : Charm++ calls a node, doesn't correspond to a full node on a supercomputer. A
351 : charm-node simply corresponds to a collection of cores on a physical node. In
352 : our case, a charm-node is represented by the remaining cores on a node not used
353 : for communication (i.e. the first charm-node corresponds to cores 1-3 on the
354 : first physical node). Also the definition of a charm-core doesn't necessarily
355 : have to correspond to an actual core (it could correspond to a hyperthreaded
356 : virtual core), however, for our purposes, it does.
357 :
358 : SpECTRE offers wrappers around Charm++ functions that will tell you the total
359 : number of charm-nodes/cores in an executable and what charm-node/core a parallel
360 : component is on. (In the following examples, the type `T` is an `int` or a
361 : `size_t`)
362 :
363 : - `Parallel::my_node<T>()` returns the charm-node that the parallel component is
364 : on. In the figure, `Sing. 4` would return `2`.
365 : - `Parallel::my_proc<T>()` returns the charm-core that the parallel component is
366 : on. In the figure, `Sing. 4` would return `6` (*not* `9`).
367 : - `Parallel::number_of_nodes<T>()` returns the total number of charm-nodes in an
368 : executable. The above figure would have `3` charm-nodes.
369 : - `Parallel::number_of_procs<T>()` returns the total number of charm-cores in an
370 : executable. The above figure would have `9` charm-cores (*not* `12`).
371 :
372 : \note For Charm++ SMP (shared memory parallelism) builds, a node corresponds to
373 : a collection of cores on a physical node, and a core corresponds to a processor
374 : on that physical node. However, for non-SMP builds, nodes and cores are
375 : equivalent. All of our builds are done with Charm++ SMP so nodes and cores have
376 : their usual definitions.
377 :
378 : The placement of Groups and Nodegroups are determined by Charm++. This is
379 : because a Group is on every charm-core and a Nodegroup is on every charm-node.
380 : Even though Nodegroups are one per charm-node, the user can't choose which core
381 : is used on the charm-node. They run on the next available charm-core on the
382 : charm-node.
383 :
384 : The Elements of an Array, however, can be placed on specific charm-cores. They
385 : are inserted into the Array by using the Charm++ `insert` member function of the
386 : CProxy for the Array. The `insert` function is documented in the Charm++ manual.
387 : In the Array example in the
388 : [Examples](#dev_guide_parallelization_component_examples) section, `array_proxy`
389 : is a `CProxy` and so all the documentation for Charm++ array proxies applies.
390 : SpECTRE always creates empty arrays with the constructor and requires users to
391 : insert however many elements they want and on which charm-cores they want them
392 : to be placed. Note that load balancing calls may result in array elements being
393 : moved.
394 :
395 : In a similar fashion, Singletons can also be placed on specific charm-cores.
396 : This can be specified in the input file.
397 :
398 : From an input file, there are two ways to specify where Array/Singleton parallel
399 : components can be placed.
400 :
401 : ```yaml
402 : ResourceInfo:
403 : AvoidGlobalProc0: true
404 : Singletons:
405 : AhA:
406 : Proc: 12
407 : Exclusive: true
408 : AhB:
409 : Proc: Auto
410 : Exclusive: false
411 : ```
412 :
413 : First is the `AvoidGlobalProc0` option. This option will tell the program to not
414 : put *any* Array Elements or Singletons on the global zeroth charm-core. This
415 : core is sometimes used to write data to disk which is typically much slower
416 : than the program execution. The second is the `Singletons:` option. You can
417 : set the value to `Auto`, and then each singleton will have their proc be chosen
418 : automatically and they won't be exclusive. Otherwise, you must specify options
419 : for each singleton as in the example above. `AhA` is the `pretty_type::name()`
420 : of a Singleton in the program and the user has a choice of which proc to place
421 : the Singleton on (`Auto` will let the program decide) and whether to exclude
422 : Array Elements or other Singletons from being put on this core. This is useful
423 : in case the Singleton does some expensive computation that shouldn't be slowed
424 : down by having lots of Array Elements on the same core. In the figure above,
425 : `AvoidGlobalProc0` is true, and `Sing. 2` requested to be exclusively on core
426 : `2`.
427 :
428 : # Actions {#dev_guide_parallelization_actions}
429 :
430 : %Actions are structs with a static `apply` method and come in five
431 : variants: simple actions, iterable actions, reduction actions,
432 : threaded actions, and local synchronous actions.
433 :
434 : The signature of `apply` methods differs for the different types of
435 : actions, but all types have the same general form. Actions receive a
436 : `db::DataBox`, the Parallel::GlobalCache, and their element's index
437 : and parallel component, as well as arguments specific to the action
438 : type.
439 :
440 : The `db::DataBox` should be thought of as the member data of the parallel
441 : component while the actions are the member functions. The combination of a
442 : `db::DataBox` and actions allows building up classes with arbitrary member data
443 : and methods using template parameters and invocation of actions. This approach
444 : allows us to eliminate the need for users to work with Charm++'s interface
445 : files, which can be error prone and difficult to use.
446 :
447 : The Parallel::GlobalCache is passed to each action so that the
448 : action has access to global data and is able to invoke actions on
449 : other parallel components. The `ParallelComponent` template parameter
450 : is the tag of the parallel component that invoked the action. A proxy
451 : to the calling parallel component can then be retrieved from the
452 : Parallel::GlobalCache. The remote entry method invocations are
453 : slightly different for different types of actions, so they will be
454 : discussed below. However, one thing that is disallowed for all actions
455 : is calling an action locally from within an action on the same
456 : parallel component. Specifically,
457 :
458 : \snippet Test_AlgorithmNestedApply1.cpp bad_recursive_call
459 :
460 : Here `Parallel::local()` is a wrapper around `ckLocal()` which is a Charm++
461 : provided method that returns a pointer to the local (currently executing)
462 : parallel component. See the
463 : [Charm++ manual](http://charm.cs.illinois.edu/help) for more
464 : information. However, you are able to queue a new action to be
465 : executed later on the same parallel component by getting your own
466 : parallel component from the Parallel::GlobalCache
467 : (`Parallel::get_parallel_component<ParallelComponent>(cache)`). The
468 : difference between the two calls is that by calling an action through
469 : the parallel component you will first finish the series of actions you
470 : are in, then when they are complete Charm++ will call the next queued
471 : action.
472 :
473 : Array, group, and nodegroup parallel components can have actions invoked in two
474 : ways. First is a broadcast where the action is called on all elements of the
475 : array:
476 :
477 : \snippet Test_AlgorithmParallel.cpp broadcast_to_group
478 :
479 : The second case is invoking an action on a specific array element by using the
480 : array element's index. The below example shows how a broadcast would be done
481 : manually by looping over all elements in the array:
482 :
483 : \snippet Test_AlgorithmParallel.cpp call_on_indexed_array
484 :
485 : Note that in general you will not know what all the elements in the array are
486 : and so a broadcast is the correct method of sending data to or invoking an
487 : action on all elements of an array parallel component.
488 :
489 : The `array_index` argument passed to all `apply` methods is the index into the
490 : parallel component array. If the parallel component is not an array the value
491 : and type of `array_index` is implementation defined and cannot be relied on.
492 :
493 : ## 1. Simple Actions {#dev_guide_parallelization_simple_actions}
494 :
495 : Simple actions can be thought of as member functions of remote objects
496 : (chares/parallel components). They are the direct analog of entry
497 : methods in Charm++ except that the member data is stored in the
498 : `db::DataBox` that is passed in as the first argument. A simple action
499 : must return void but can use `db::mutate` to change values of items in
500 : the `db::DataBox` if the `db::DataBox` is taken as a non-const
501 : reference. In some cases you will need specific items to be in the
502 : `db::DataBox` otherwise the action won't compile. To restrict which
503 : `db::DataBox`es can be passed you should use `Requires` in the
504 : action's `apply` function template parameter list. For example,
505 : \snippet Test_AlgorithmCore.cpp requires_action
506 : checks that `CountActionsCalled` is available in the box.
507 :
508 : Simple actions can be called using a `CProxy` (see the [Charm++
509 : manual](http://charm.cs.illinois.edu/help)), which is retrieved from
510 : the Parallel::GlobalCache using the parallel component struct and the
511 : `Parallel::get_parallel_component()` function. For example, the
512 : action above could be called as
513 : \snippet Test_AlgorithmCore.cpp simple_action_call
514 : Any arguments after the proxy are passed as additional arguments to
515 : the action's `apply` function.
516 :
517 : ## 2. Iterable Actions {#dev_guide_parallelization_iterable_actions}
518 :
519 : Iterable actions make up the algorithms described by the PDALs. These
520 : actions are executed one after the other until one of them cannot be
521 : evaluated. Their `apply` methods signature is
522 : \snippet Test_AlgorithmCore.cpp apply_iterative
523 : The `ActionList` type is the `tmpl::list` of iterable actions in the
524 : current phase. That is, it is equal to the `action_list` type alias
525 : in the current PDAL. The `inboxes` is a collection of the tags
526 : specified as `tmpl::list`s in the iterable actions' member type
527 : aliases `inbox_tags`. This collection represents data received from
528 : other chares using the `receive_data` function.
529 :
530 : Iterable actions can request that the algorithm be paused or halted
531 : for the current phase, and control which action in the current PDAL
532 : will be executed next. This is all done via the return value from the
533 : `apply` function. The `apply` function for iterable actions must
534 : return a Parallel::iterable_action_return_t which is a
535 : `std::tuple<Parallel::AlgorithmExecution, std::optional<size_t>>`. The
536 : first element of the tuple controls how algorithm execution continues.
537 : See the documentation of `Parallel::AlgorithmExecution` for the
538 : meanings of different values of that enum. The second element of the
539 : tuple is usually set to `std::nullopt` in order to continue iterating
540 : through the algorithm, but can be used to jump to a different action
541 : in the current PDAL. Most iterable actions will simply return
542 :
543 : \snippet Test_AlgorithmParallel.cpp iterable_action_return_continue_next_action
544 :
545 : An action that pauses the algorithm will return
546 :
547 : \snippet Test_AlgorithmParallel.cpp return_with_termination
548 :
549 : After an algorithm has been paused it can be restarted by passing
550 : `false` to the `set_terminate` method or by calling `receive_data(...,
551 : true)`. Since the order in which messages are received is undefined in
552 : most cases the `receive_data(..., true)` call should be used to
553 : restart the algorithm.
554 :
555 : The return value `Parallel::AlgorithmExecution::Retry` deserves
556 : special mention. It is intended for use by actions that use data
557 : supplied by other parallel objects to indicate that they have not
558 : received all of the data they require. The algorithm will stop until
559 : an operation that could supply data has occurred and then the action
560 : will be retried. An example of waiting because of missing data is
561 :
562 : \snippet Test_AlgorithmCore.cpp retry_example
563 :
564 : In order to jump to a specific action, the metafunction
565 : `tmpl::index_of<list, element>` can be used to get an
566 : `tmpl::integral_constant` with the value of the index of the element
567 : `element` in the typelist `list`. For example,
568 :
569 : \snippet Test_AlgorithmCore.cpp out_of_order_action
570 :
571 : The metafunction call `tmpl::index_of<ActionList,
572 : iterate_increment_int0>::%value` returns a `size_t` whose value is
573 : that of the action `iterate_increment_int0` in the PDAL. The indexing
574 : of actions in the PDAL starts at `0`.
575 :
576 : Iterable actions are invoked as part of the algorithm and so the only way
577 : to request they be invoked is by having the algorithm run on the parallel
578 : component. The algorithm can be explicitly evaluated in a new phase by calling
579 : `start_phase(Phase::TheCurrentPhase)`:
580 :
581 : \snippet Test_AlgorithmCore.cpp start_phase
582 :
583 : Alternatively, to evaluate the algorithm without changing phases the
584 : `perform_algorithm()` method can be used.
585 :
586 : By passing `true` to `perform_algorithm` the algorithm will be restarted if it
587 : was paused.
588 :
589 : The algorithm is also evaluated by calling the `receive_data` function, either
590 : on an entire array or singleton (this does a broadcast), or an on individual
591 : element of the array. Here is an example of a broadcast call:
592 :
593 : \snippet Test_AlgorithmParallel.cpp broadcast_to_group
594 :
595 : and of calling individual elements:
596 :
597 : \snippet Test_AlgorithmParallel.cpp call_on_indexed_array
598 :
599 : The `receive_data` function always takes a `ReceiveTag`, which is set
600 : in the actions' `inbox_tags` type aliases. The `inbox_tags` must have
601 : two member type aliases, a `temporal_id` which is used to identify
602 : when the data was sent, and a `type` which is the type of the data to
603 : be stored in the `inboxes`. The types are typically a
604 : `std::unordered_map<temporal_id, DATA>`. In the discussed scenario of
605 : waiting for neighboring elements to send their data the `DATA` type
606 : would be a `std::unordered_map<TheElementId, DataSent>`. Inbox tags
607 : must also specify a `static void insert_into_inbox()` function. For
608 : example,
609 :
610 : \snippet Test_AlgorithmParallel.cpp int_receive_tag
611 :
612 : For common types of `DATA`, such as a `map`, a data structure with an `insert`
613 : function, a data structure with a `push_back` function, or copy/move assignment
614 : that is used to insert the received data, inserters are available in
615 : `Parallel::InboxInserters`. For example, there is
616 : `Parallel::InboxInserters::Map` for `map` data structures. The inbox tag can
617 : inherit publicly off the inserters to gain the required insertion capabilities:
618 :
619 : \snippet Test_AlgorithmCore.cpp int receive tag insert
620 :
621 : Any inbox tag that uses Charm++ messages must also specify a `message_type` type
622 : alias which is the object that will be sent. An example is:
623 :
624 : \snippet Test_AlgorithmMessages.cpp charm message inbox tag
625 :
626 : The `inbox_tags` type alias for the action is:
627 :
628 : \snippet Test_AlgorithmParallel.cpp int_receive_tag_list
629 :
630 : An inbox tag can also optionally specify a static function called `output_inbox`
631 : that returns a `std::string`. This function can be used for printing the
632 : contents of the inbox in a nice way as the types can sometimes get complicated.
633 : You can also use the `Parallel::output_inbox` function to output a specific
634 : inbox from all the inboxes. See an above example for the signature of the
635 : `output_inbox` function.
636 :
637 : \warning
638 : It is the responsibility of the iterable action to remove data from the inboxes
639 : that will no longer be needed.
640 :
641 : Normally when remote functions are invoked they go through the Charm++ runtime
642 : system, which adds some overhead. The `receive_data` function tries to elide
643 : the call to the Charm++ RTS for calls into array components. Charm++ refers to
644 : these types of remote calls as "inline entry methods". With the Charm++ method
645 : of eliding the RTS, the code becomes susceptible to stack overflows because
646 : of infinite recursion. The `receive_data` function is limited to at most 64 RTS
647 : elided calls, though in practice reaching this limit is rare. When the limit is
648 : reached the remote method invocation is done through the RTS instead of being
649 : elided.
650 :
651 : ## 3. Reduction Actions {#dev_guide_parallelization_reduction_actions}
652 :
653 : Reduction actions are the targets of reducing data over an array. For
654 : example, you may want to know the sum of a `int` from every element in
655 : the array. You can do this as follows:
656 :
657 : \snippet Test_AlgorithmReduction.cpp contribute_to_reduction_example
658 :
659 : This reduces over the parallel component
660 : `ArrayParallelComponent<Metavariables>`, reduces to the parallel component
661 : `SingletonParallelComponent<Metavariables>`, and calls the action
662 : `ProcessReducedSumOfInts` after the reduction has been performed. The reduction
663 : action is:
664 :
665 : \snippet Test_AlgorithmReduction.cpp reduce_sum_int_action
666 :
667 : As you can see, the last argument to the `apply` function is of type `int`, and
668 : is the reduced value.
669 :
670 : You can also broadcast the result back to an array, even yourself. For example,
671 :
672 : \snippet Test_AlgorithmReduction.cpp contribute_to_broadcast_reduction
673 :
674 : It is often necessary to reduce custom data types, such as `std::vector` or
675 : `std::unordered_map`. Charm++ supports such custom reductions, and so does our
676 : layer on top of Charm++.
677 : Custom reductions require one additional step to calling
678 : `contribute_to_reduction`, which is writing a reduction function to reduce the
679 : custom data. We provide a generic type that can be used in custom reductions,
680 : `Parallel::ReductionData`, which takes a series of `Parallel::ReductionDatum` as
681 : template parameters and `ReductionDatum::value_type`s as the arguments to the
682 : constructor. Each `ReductionDatum` takes up to four template parameters (two
683 : are required). The first is the type of data to reduce, and the second is a
684 : binary invokable that is called at each step of the reduction to combine two
685 : messages. The last two template parameters are used after the reduction has
686 : completed. The third parameter is an n-ary invokable that is called once the
687 : reduction is complete, whose first argument is the result of the reduction. The
688 : additional arguments can be any `ReductionDatum::value_type` in the
689 : `ReductionData` that are before the current one. The fourth template parameter
690 : of `ReductionDatum` is used to specify which data should be passed. It is a
691 : `std::index_sequence` indexing into the `ReductionData`.
692 :
693 : \warning
694 : All elements of the array must call the same reductions in the same order. It is
695 : *defined* behavior to do multiple reductions at once as long as all contribute
696 : calls on all array elements occurred in the same order. It is **undefined**
697 : behavior if the contribute calls are made in different orders on different array
698 : elements.
699 :
700 : ## 4. Threaded Actions {#dev_guide_parallelization_threaded_actions}
701 :
702 : Threaded actions are similar to simple actions, with the difference
703 : being that multiple threaded actions may be running on the same chare
704 : at the same time (potentially in parallel with one simple or reduction
705 : action). The `apply` function for a threaded actions has the same
706 : signature as that for a simple action, except that it also receives a
707 : `NodeLock` intended to control access to the chare's `db::DataBox`.
708 : All access to the `db::DataBox`, including read-only access, must
709 : occur while the action owns this lock. (Simple and reduction actions
710 : implicitly hold the lock for their entire execution.)
711 :
712 : \snippet Test_AlgorithmNodelock.cpp threaded_action_example
713 :
714 : Threaded actions can only be run on nodegroup chares.
715 :
716 : ## 5. Local Synchronous Actions {#dev_guide_parallelization_local_synchronous_actions}
717 :
718 : There is limited ability to retrieve data held by another parallel component via
719 : a direct synchronous call. Unlike the above actions, the invocation of a
720 : synchronous action is precisely a call to a member function of another parallel
721 : component; therefore, these invocations will run to completion, and return their
722 : result before the calling code proceeds in execution.
723 :
724 : Aside from being synchronous and being able to return data, local
725 : synchronous actions behave the same as threaded actions, except that
726 : they will only run on the chare of a nodegroup that is on the local
727 : node.
728 :
729 : Local synchronous actions' `apply` functions follow a signature motivated by
730 : threaded actions, but take fewer arguments. This may be a bug.
731 :
732 : Local synchronous actions must specify their return type in a
733 : `return_type` type alias. This is to help simplify the logic with the
734 : variant `db::DataBox` held by the parallel component.
735 :
736 : An example of a definition of a local synchronous action:
737 :
738 : \snippet Test_AlgorithmLocalSyncAction.cpp synchronous_action_example
739 :
740 : And the corresponding invocation:
741 :
742 : \snippet Test_AlgorithmLocalSyncAction.cpp synchronous_action_invocation_example
743 :
744 : \warning Say an action is being run on a component on a node where a mutable
745 : item in the GlobalCache is up-to-date. This action then calls another action on
746 : a different component on a different node. It is **NOT** guaranteed that this
747 : mutable item in the GlobalCache is up-to-date on this new node. It is up to the
748 : user to ensure the mutable item is up-to-date on whatever node they run an
749 : action on, even if it was up-to-date on the node that sent the message. The
750 : \ref dev_guide_parallelization_mutable_global_cache
751 : "Mutable items in the GlobalCache" section gives details about mutable
752 : GlobalCache items.
753 :
754 : # Mutable items in the GlobalCache {#dev_guide_parallelization_mutable_global_cache}
755 :
756 : Most items in the GlobalCache are constant, and are specified
757 : by type aliases called `const_global_cache_tags` as
758 : described above. However, the GlobalCache can also store mutable
759 : items. Because of asynchronous execution, **EXTREME** care must be taken when
760 : mutating items in the GlobalCache, as described below.
761 :
762 : A mutable item can be of any type, as long as that type is something
763 : that can be checked for whether it is "up-to-date". Here "up-to-date"
764 : means that the item can be safely used (even read-only) without
765 : needing to be mutated first. For example, a mutable item might be a
766 : function of time that knows the range of times for which it is valid;
767 : the mutable item is then deemed up-to-date if it will be called for a
768 : time within its range of validity, and it is deemed not up-to-date if
769 : it will be called for a time outside its range of validity. Thus the
770 : up-to-date status of a mutable item is determined by both the state of
771 : the item itself and by the code that wishes to use that item.
772 :
773 : ## 1. Specification of mutable GlobalCache items
774 :
775 : Mutable GlobalCache items are specified by a
776 : type alias `mutable_global_cache_tags`, which is treated the same way
777 : as `const_global_cache_tags` for const items.
778 :
779 : ## 2. Use of mutable GlobalCache items
780 :
781 : ### 1. Checking if the item is up-to-date
782 :
783 : Because execution is asynchronous, any code that uses a mutable item
784 : in the GlobalCache must first check whether that item is up-to-date.
785 : The information about whether an item is up-to-date is assumed to be
786 : stored in the item itself. For example, a mutable object stored in
787 : the GlobalCache might have type `std::map<temporal_id,T>` (for some
788 : type `T`), and then any code that uses the stored object can check
789 : whether an entry exists for a particular `temporal_id`. To avoid
790 : race conditions, it is
791 : important that up-to-date checks are based on something that is
792 : independent of the order of mutation (like a `temporal_id`, and not
793 : like checking the size of a vector).
794 :
795 : To check an item, use the function
796 : `Parallel::mutable_cache_item_is_ready`, which returns a bool
797 : indicating whether the item is up-to-date. If the item is up-to-date,
798 : then it can be used. `Parallel::mutable_cache_item_is_ready` takes a
799 : lambda as an argument. This lambda is passed a single argument: a
800 : const reference to the item being retrieved. The lambda should
801 : determine whether the item is up-to-date. If so, it should return a
802 : default_constructed `std::unique_ptr<Parallel::Callback>`; if not, it should
803 : return a `std::unique_ptr<Parallel::Callback>` to a callback function that will
804 : be called on the next `Parallel::mutate` of that item. The callback
805 : will typically check again if the item is up-to-date and if so will
806 : execute some code that gets the item via `Parallel::get`.
807 :
808 : For the case of iterable actions,
809 : `Parallel::mutable_cache_item_is_ready` typically uses the callback
810 : `Parallel::PerformAlgorithmCallback`. In the example below, the
811 : vector is considered up-to-date if it is non-empty. If the vector is
812 : not up-to-date, then when it becomes up-to-date the callback function
813 : will be invoked; in this case the callback function re-runs
814 : `perform_algorithm()`, which will call the same action again.
815 :
816 : \snippet Test_AlgorithmGlobalCache.cpp check_mutable_cache_item_is_ready
817 :
818 : Note that `Parallel::mutable_cache_item_is_ready` is called on the local
819 : node and does no parallel communication.
820 :
821 : ### 2. Retrieving the item
822 :
823 : The item is retrieved using `Parallel::get` just like for constant items.
824 : For example, to retrieve the item `Tags::VectorOfDoubles`:
825 : \snippet Test_AlgorithmGlobalCache.cpp retrieve_mutable_cache_item
826 :
827 : Note that `Parallel::get` is called on the local node and does no
828 : parallel communication.
829 :
830 : Whereas we support getting *non-mutable* items in the GlobalCache from
831 : a DataBox via `db::get`, we intentionally do not support
832 : `db::get` of *mutable* items in the GlobalCache from a DataBox.
833 : The reason is that mutable
834 : items should be retrieved only after a `Parallel::mutable_cache_item_is_ready`
835 : check, and being able to retrieve a mutable item from a DataBox makes it
836 : difficult to enforce that check, especially when automatically-executing
837 : compute items are considered.
838 :
839 : ## 3. Modifying a mutable GlobalCache item
840 :
841 : To modify a mutable item, pass `Parallel::mutate` two template
842 : parameters: the tag to mutate, and a struct with an `apply` function
843 : that does the mutating. `Parallel::mutate` takes two arguments:
844 : a reference to the local GlobalCache, and a tuple that is passed into the
845 : mutator function. For the following example,
846 :
847 : \snippet Test_AlgorithmGlobalCache.cpp mutate_global_cache_item
848 :
849 : the mutator function is defined as below:
850 : \snippet Test_AlgorithmGlobalCache.cpp mutate_global_cache_item_mutator
851 :
852 : `Parallel::mutate` broadcasts to every node, where it calls the
853 : mutator function and then calls all the callbacks that have been set
854 : on that node by `Parallel::mutable_cache_item_is_ready`. The
855 : `Parallel::mutate` operation is guaranteed to be thread-safe without
856 : any further action by the developer so long as the item being mutated can be
857 : mutated in a threadsafe way. See the `Parallel::GlobalCache` docs for more
858 : details.
859 :
860 : # Charm++ Node and Processor Level Initialization Functions {#dev_guide_parallelization_charm_node_processor_level_initialization}
861 :
862 : Charm++ allows running functions once per core and once per node before the
863 : construction of any parallel components. This is commonly used for setting up
864 : error handling and enabling floating point exceptions. Other functions could
865 : also be run. Which functions are run on each node and core is set by calling
866 : `Parallel::charmxx::register_init_node_and_proc` in `CkRegisterMainModule()`
867 : with function pointers to the functions to be called. For example:
868 :
869 : \snippet Test_AlgorithmPhaseControl.cpp charm_init_funcs_example
|