Line data Source code
1 0 : \cond NEVER 2 : Distributed under the MIT License. 3 : See LICENSE.txt for details. 4 : \endcond 5 : # Observers Infrastructure {#observers_infrastructure_dev_guide} 6 : 7 : \tableofcontents 8 : 9 : The observers infrastructure works with two parallel components: a group and a 10 : nodegroup. We have two types of observations: `Reduction` and `Volume` (see the 11 : enum `observers::TypeOfObservation`). `Reduction` data is anything that is 12 : written once per time/integral identifier per simulation. Some examples of 13 : reduction data are integrals or L2 norms over the entire domain, integrals or L2 14 : norms over part of the domain, and integrals over lower-dimensional surfaces 15 : such as apparent horizons or slices through the domain. Volume data is anything 16 : that has physical extent, such as any of the evolved variables (or derived 17 : quantities thereof) across all or part of the domain, or quantities on 18 : lower-dimensional surfaces in the domain (e.g. the rest mass density in the 19 : xy-plane). Reduction and volume data both use the group and nodegroup for 20 : actually getting the data to disk, but do so in a slightly different manner. 21 : 22 : ### Reduction Data 23 : 24 : Reduction data requires combining information from many or all cores of a 25 : supercomputer to get a single value. Reductions are tagged by some temporal 26 : value, which for hyperbolic systems is the time and for elliptic systems some 27 : combination of linear and non-linear iteration count. The reduction data is 28 : stored in an object of type `Parallel::ReductionData`, which takes as template 29 : parameters a series of `Parallel::ReductionDatum`. A `Parallel::ReductionDatum` 30 : takes as template parameters the type of the data and operators that define how 31 : data from the different cores are to be combined to a single value. See the 32 : paragraphs below for more detail, and the documentation of 33 : `Parallel::ReductionDatum` for examples. 34 : 35 : At the start of a simulation, every component and event that wants to perform a 36 : reduction for observation, or will be part of a reduction observation, must 37 : register with the `observers::Observer` component. The `observers::Observer` is 38 : a group, which means there is one per core. The registration is used so that the 39 : `Observer` knows once all data for a specific reduction (both in time and by 40 : name/ID) has been contributed. Reduction data is combined on each core as it is 41 : contributed by using the binary operator from `Parallel::ReductionDatum`'s 42 : second template parameter. Once all the data is collected on the core, it is 43 : copied to the local `observers::ObserverWriter` nodegroup, which keeps track of 44 : how many of the cores on the node will be contributing to a specific 45 : observation, and again combines all the data as it is being contributed. Once 46 : all the node's data is collected to the nodegroup, the data is sent to node `0` 47 : which combines the reduction data as it arrives using the binary operator from 48 : `Parallel::ReductionDatum`'s second template parameter. Using node `0` for 49 : collecting the final reduction data is an arbitrary choice, but we are always 50 : guaranteed to have a node `0`. 51 : 52 : Once all the reductions are received on node `0`, the `ObserverWriter` invokes 53 : the `InvokeFinal` (third) template parameter on each `Parallel::ReductionDatum` 54 : (this is the n-ary) in order to finalize the data before writing. This is used, 55 : for example, for dividing by the total number of grid points in an L1 or L2 56 : norm. The reduction data is then written to an HDF5 file whose name is set in 57 : the input file using the option 58 : `observers::Tags::ReductionFileName`. Specifically, the data is written into an 59 : `h5::Dat` subfile since, along with the data, the subfile name must be passed 60 : through the reductions. 61 : 62 : The actions used for registering reductions are 63 : `observers::Actions::RegisterEventsWithObservers` and 64 : `observers::Actions::RegisterWithObservers`. There is a separate `Registration` 65 : phase at the beginning of all simulations where everything must register with 66 : the observers. The action `observers::Actions::ContributeReductionData` is used 67 : to send data to the `observers::Observer` component in the case where there is a 68 : reduction done across an array or subset of an array. If a singleton parallel 69 : component or a specific chare needs to write data directly to disk it should use 70 : the `observers::ThreadedActions::WriteReductionDataRow` action called on the 71 : zeroth element of the `observers::ObserverWriter` component. 72 : 73 : ### Volume Data 74 : 75 : Volume data is vaguely defined as anything that has some extent. For example, in 76 : a 3d simulation, data on 2d surfaces is still considered volume data for the 77 : purposes of observing data. The spectral coefficients can also be written as 78 : volume data, though some care must be taken in that case to correctly identify 79 : which mode is associated with which terms in the basis function 80 : expansion. Whatever component will contribute volume data to be written must 81 : register with the `observers::Observer` component (there currently isn't tested 82 : support for directly registering with the `observers::ObserverWriter`). This 83 : registration is the same as in the reduction data case. 84 : 85 : Once the observers are registered, data is contributed to the 86 : `observers::Observer` component using the 87 : `observers::Actions::ContributeVolumeData` action. The data is packed into an 88 : `ElementVolumeData` object that carries `TensorComponent`s on a grid. 89 : Information on the grid, such as its extents, basis and quadrature, are stored 90 : alongside the `TensorComponent`s. Once all the elements on a single core have 91 : contributed their volume data to the `observers::Observer` group, the 92 : `observers::Observer` group moves its data to the `observers::ObserverWriter` 93 : component to be written. We write one file per node, appending the node ID to 94 : the HDF5 file name to distinguish between files written by different nodes. The 95 : HDF5 file name is specified in the input file using the 96 : `observers::Tags::VolumeFileName` option. The data is written into a subfile of 97 : the HDF5 file using the `h5::VolumeFile` class. 98 : 99 : If a singleton parallel component or a specific chare needs to write volume data 100 : directly to disk, such as surface data from an apparent horizon, it should use 101 : the `observers::ThreadedActions::WriteVolumeData` action called on the zeroth 102 : element of the `observers::ObserverWriter` component. For surface data (such as 103 : output from horizon finds), this data should be written to a file specified by 104 : the `observers::Tags::SurfaceFileName` option. 105 : 106 : ### Threading and NodeLocks 107 : 108 : Since the `observers::ObserverWriter` class is a nodegroup, its entry methods 109 : can be invoked simultaneously on different cores of the node. However, this can 110 : lead to race conditions if care isn't taken. The biggest caution is that the 111 : `DataBox` cannot be mutated on one core and simultaneously accessed on 112 : another. This is because in order to guarantee a reasonable state for data in 113 : the `DataBox`, it must be impossible to perform a `db::get` on a `DataBox` from 114 : inside or while a `db::mutate` is being done. What this means in practice is 115 : that all entry methods on a nodegroup must put their `DataBox` accesses inside 116 : of a `node_lock.lock()` and `node_lock.unlock()` block. To achieve better 117 : parallel performance and threading, the amount of work done while the entire 118 : node is locked should be minimized. To this end, we have additional locks. One 119 : for the HDF5 files because we do not require a threadsafe HDF5 120 : (`observers::Tags::H5FileLock`). We also have locks for the objects mutated when 121 : contributing reduction data (`observers::Tags::ReductionDataLock`) and the 122 : objects mutated when contributing volume data 123 : (`observers::Tags::VolumeDataLock`). 124 : 125 : ### Future changes 126 : - It would be preferable to make the `Observer` and `ObserverWriter` parallel 127 : components more general and have them act as the core (node)group. Since any 128 : simple actions can be run on them, it should be possible to use them for most, 129 : if not all cases where we need a (node)group.