Line data Source code

1 0 : \cond NEVER 2 : Distributed under the MIT License. 3 : See LICENSE.txt for details. 4 : \endcond 5 : # Observers Infrastructure {#observers_infrastructure_dev_guide} 6 : 7 : The observers infrastructure works with two parallel components: a group and a 8 : nodegroup. We have two types of observations: `Reduction` and `Volume` (see the 9 : enum `observers::TypeOfObservation`). `Reduction` data is anything that is 10 : written once per time/integral identifier per simulation. Some examples of 11 : reduction data are integrals or L2 norms over the entire domain, integrals or L2 12 : norms over part of the domain, and integrals over lower-dimensional surfaces 13 : such as apparent horizons or slices through the domain. Volume data is anything 14 : that has physical extent, such as any of the evolved variables (or derived 15 : quantities thereof) across all or part of the domain, or quantities on 16 : lower-dimensional surfaces in the domain (e.g. the rest mass density in the 17 : xy-plane). Reduction and volume data both use the group and nodegroup for 18 : actually getting the data to disk, but do so in a slightly different manner. 19 : 20 : ### Reduction Data 21 : 22 : Reduction data requires combining information from many or all cores of a 23 : supercomputer to get a single value. Reductions are tagged by some temporal 24 : value, which for hyperbolic systems is the time and for elliptic systems some 25 : combination of linear and non-linear iteration count. The reduction data is 26 : stored in an object of type `Parallel::ReductionData`, which takes as template 27 : parameters a series of `Parallel::ReductionDatum`. A `Parallel::ReductionDatum` 28 : takes as template parameters the type of the data and operators that define how 29 : data from the different cores are to be combined to a single value. See the 30 : paragraphs below for more detail, and the documentation of 31 : `Parallel::ReductionDatum` for examples. 32 : 33 : At the start of a simulation, every component and event that wants to perform a 34 : reduction for observation, or will be part of a reduction observation, must 35 : register with the `observers::Observer` component. The `observers::Observer` is 36 : a group, which means there is one per core. The registration is used so that the 37 : `Observer` knows once all data for a specific reduction (both in time and by 38 : name/ID) has been contributed. Reduction data is combined on each core as it is 39 : contributed by using the binary operator from `Parallel::ReductionDatum`'s 40 : second template parameter. Once all the data is collected on the core, it is 41 : copied to the local `observers::ObserverWriter` nodegroup, which keeps track of 42 : how many of the cores on the node will be contributing to a specific 43 : observation, and again combines all the data as it is being contributed. Once 44 : all the node's data is collected to the nodegroup, the data is sent to node `0` 45 : which combines the reduction data as it arrives using the binary operator from 46 : `Parallel::ReductionDatum`'s second template parameter. Using node `0` for 47 : collecting the final reduction data is an arbitrary choice, but we are always 48 : guaranteed to have a node `0`. 49 : 50 : Once all the reductions are received on node `0`, the `ObserverWriter` invokes 51 : the `InvokeFinal` (third) template parameter on each `Parallel::ReductionDatum` 52 : (this is the n-ary) in order to finalize the data before writing. This is used, 53 : for example, for dividing by the total number of grid points in an L1 or L2 54 : norm. The reduction data is then written to an HDF5 file whose name is set in 55 : the input file using the option 56 : `observers::Tags::ReductionFileName`. Specifically, the data is written into an 57 : `h5::Dat` subfile since, along with the data, the subfile name must be passed 58 : through the reductions. 59 : 60 : The actions used for registering reductions are 61 : `observers::Actions::RegisterEventsWithObservers`, 62 : `observers::Actions::RegisterSingletonWithObserverWriter`, and 63 : `observers::Actions::RegisterWithObservers`. There is a separate `Registration` 64 : phase at the beginning of all simulations where everything must register with 65 : the observers. The action `observers::Actions::ContributeReductionData` is used 66 : to send data to the `observers::Observer` component in the case where there is a 67 : reduction done across an array or subset of an array. If a singleton parallel 68 : component needs to write data directly to disk it should use the 69 : `observers::ThreadedActions::WriteReductionData` action called on the zeroth 70 : element of the `observers::ObserverWriter` component. 71 : 72 : ### Volume Data 73 : 74 : Volume data is vaguely defined as anything that has some extent. For example, in 75 : a 3d simulation, data on 2d surfaces is still considered volume data for the 76 : purposes of observing data. The spectral coefficients can also be written as 77 : volume data, though some care must be taken in that case to correctly identify 78 : which mode is associated with which terms in the basis function 79 : expansion. Whatever component will contribute volume data to be written must 80 : register with the `observers::Observer` component (there currently isn't tested 81 : support for directly registering with the `observers::ObserverWriter`). This 82 : registration is the same as in the reduction data case. 83 : 84 : Once the observers are registered, data is contributed to the 85 : `observers::Observer` component using the 86 : `observers::Actions::ContributeVolumeData` action. The data is packed into a 87 : `std::vector<TensorComponent>`, where the `TensorComponent` is data from just 88 : one tensor component or a reduction over a tensor. The `extents`, 89 : `Spectral::Basis` and `Spectral::Quadrature` are currently also passed to the 90 : `ContributeVolumeData` action. Once all the elements on a single core have 91 : contributed their volume data to the `observers::Observer` group, the 92 : `observers::Observer` group moves its data to the `observers::ObserverWriter` 93 : component to be written. We write one file per node, appending the node ID to 94 : the HDF5 file name to distinguish between files written by different nodes. The 95 : HDF5 file name is specified in the input file using the 96 : `observers::Tags::VolumeFileName` option. The data is written into a subfile of 97 : the HDF5 file using the `h5::VolumeFile` class. 98 : 99 : ### Threading and NodeLocks 100 : 101 : Since the `observers::ObserverWriter` class is a nodegroup, its entry methods 102 : can be invoked simultaneously on different cores of the node. However, this can 103 : lead to race conditions if care isn't taken. The biggest caution is that the 104 : `DataBox` cannot be mutated on one core and simultaneously accessed on 105 : another. This is because in order to guarantee a reasonable state for data in 106 : the `DataBox`, it must be impossible to perform a `db::get` on a `DataBox` from 107 : inside or while a `db::mutate` is being done. What this means in practice is 108 : that all entry methods on a nodegroup must put their `DataBox` accesses inside 109 : of a `node_lock.lock()` and `node_lock.unlock()` block. To achieve better 110 : parallel performance and threading, the amount of work done while the entire 111 : node is locked should be minimized. To this end, we have additional locks. One 112 : for the HDF5 files because we do not require a threadsafe HDF5 113 : (`observers::Tags::H5FileLock`). We also have locks for the objects mutated when 114 : contributing reduction data (`observers::Tags::ReductionDataLock`) and the 115 : objects mutated when contributing volume data 116 : (`observers::Tags::VolumeDataLock`). 117 : 118 : ### Future changes 119 : - It would be preferable to make the `Observer` and `ObserverWriter` parallel 120 : components more general and have them act as the core (node)group. Since any 121 : simple actions can be run on them, it should be possible to use them for most, 122 : if not all cases where we need a (node)group.