Line data Source code
1 0 : \cond NEVER 2 : Distributed under the MIT License. 3 : See LICENSE.txt for details. 4 : \endcond 5 : 6 : # Notes on SpECTRE load-balancing using Charm++'s built-in load balancers {#load_balancing_notes} 7 : 8 : \tableofcontents 9 : 10 : The goal of load-balancing (LB) is to ensure that HPC resources are well-used 11 : while performing large inhomogeneous simulations. In 2020-2021, Jordan Moxon 12 : and Francois Hebert performed a number of tests using Charm++'s built-in load 13 : balancers with SpECTRE. 14 : 15 : These notes highlight the key points and give (at the bottom) some general 16 : recommendations. 17 : 18 : ### Overview of how LBs work with SpECTRE 19 : 20 : In late 2020 FH tested Charm's LBs on simple, homogeneous, SpECTRE test cases. 21 : These tests reveal the following broad behavior patterns: 22 : - Without using any LBs (no LB command-line args, or `+balancer NullLB`), 23 : SpECTRE's performance depends sensitively on how the DG elements are 24 : distributed over the HPC system. This indicates that communication costs are 25 : very important in SpECTRE runs. This statement remains true for more expensive 26 : evolution systems like Generalized Harmonic. 27 : - Charm's LBs that are not communications aware (e.g., `GreedyLB`, `RefineLB`, 28 : ...) all result in low parallel efficiency (20-30%). This is consistent with 29 : the understanding that communication costs are large. A good initial 30 : distribution of elements over processors will be degraded by these LBs, 31 : leading to more complicated communication graph and loss of performance. 32 : - Some of Charm's communication-aware LBs perform well: they approach the 33 : efficiency of a "manually tuned" initial distribution of elements onto 34 : processors. This suggests these LBs do a good job of partitioning the 35 : communications graph. In FH's simple tests, the best results were from 36 : `RecBipartLB`, which came within 10-20% of a manual initial distribution. 37 : However, it is a slow algorithm and is best used infrequently or only a few 38 : times near the start of the simulation. 39 : 40 : Note that at the 2020 Charm++ Workshop, the Charm team recommended that we use 41 : `MetisLB`, or that we combined `MetisLB` with `RefineLB` (syntax: 42 : `+balancer MetisLB +balancer RefineLB`, which applies the first balancer on the 43 : first invocation and the second balancer on all subsequent invocations, to 44 : 'polish' the results of the first LB). In practice, this failed for two reasons: 45 : - `MetisLB` tends to error with FPEs 46 : - When falling back to the pairing of `RecBipartLB` followed by `RefineLB`, the 47 : run starts with good performance. However, within a few applications of 48 : `RefineLB`, the performance is heavily degraded (down to 20-30% efficiency). 49 : It appears that we should stick to the comm-aware LB strategies. 50 : 51 : ### Contaminated LB measurements on first invocation 52 : 53 : There is reason to suspect that the Charm load balancing may incorrectly balance 54 : the load when applied near the start of some simulations. This is because the 55 : 'one-time' setup of the system may involve nontrivial computation, and (e.g. 56 : for numeric initial data) communication patterns between components that differ 57 : significantly from the patterns during evolution. Then, the first balancer 58 : invocation, based partially on the measurements taken during the 59 : non-generalizable initialization phase, can give rise to a poorly-chosen balance 60 : and injure performance. 61 : This is the suspected cause of poor performance that has been noticed 62 : in cases of homogeneous load and numeric initial data in Generalized Harmonic 63 : tests performed by Geoffrey Lovelace. JM confirmed the problem. 64 : It appears that the balance is not similarly degraded when using cases that do 65 : not involve numeric initial data -- some basic 2-node re-tests with Generalized 66 : Harmonic by JM seem to produce useful balance (improved performance) when not 67 : using numeric initial data. 68 : 69 : This issue has not been investigated to a completely satisfactory conclusion, 70 : though the above explanation seems most plausible. 71 : 72 : In cases for which it appears that the LB data is problematically impacted by 73 : the set up of the evolution system, we can try two main strategies to mitigate 74 : the problem: 75 : - Apply the load balancer at least two times near the start of the simulation, 76 : with sufficient gaps to collect useful balancing information. The LB database 77 : in Charm is cleared every time a balance is applied, so the later balances 78 : during the evolution should be uncontaminated. This strategy has not yet 79 : been carefully tested. To do this, use an input file similar to 80 : ``` 81 : PhaseChangeAndTriggers: 82 : - Trigger: 83 : Slabs: 84 : Specified: 85 : Values: [5, 10, 15] 86 : PhaseChanges: 87 : - VisitAndReturn(LoadBalancing) 88 : ``` 89 : - Use `LBTurnInstrumentOff` and `LBTurnInstrumentOn` to specifically exclude 90 : setup procedures from the LB instrumentation. First attempts indicate that 91 : this process might be challenging to accomplish correctly, and may require 92 : correspondence with the Charm developers to clarify at what points in the 93 : code execution those commands may be used, and precisely how they affect 94 : the load-balancing database. A first attempt by JM was to turn instrumentation 95 : off during array element construction, then turn instrumentation on for each 96 : element during the start of the `Evolve` phase, but that attempt led to a 97 : hang of the system, so the utility must have more constraints than were 98 : initially apparent. 99 : 100 : ### Scotch load balancer 101 : 102 : JM tested `ScotchLB`, and found better performance than with `RecBipartLB`. The 103 : margin varied a great deal among the number of nodes used, but at multiple 104 : points tested, the runtime was less than 65% of the `RecBipartLB` runtime. 105 : The tests were performed with homogeneous load, but starting from the 106 : round-robin element distribution. The indication is therefore that `ScotchLB` 107 : is very effective at minimizing communication costs. 108 : 109 : However, in practical applications, JM found that the `ScotchLB` often generates 110 : FPEs during the graph partition step and causes the simulation to crash. 111 : The issue [charm++ issue #3401](https://github.com/UIUC-PPL/charm/issues/3401) 112 : tracks the progress to determine the cause of the problem and fix it in Charm. 113 : The source of the problem has largely been identified, but the fix is still 114 : pending. 115 : 116 : `ScotchLB` will likely replace `RecBipartLB` as the most-frequently recommended 117 : centralized communication based balancer for SpECTRE once the FPE bugs have 118 : been fixed. 119 : 120 : ### General recommendations 121 : 122 : #### Homogeneous loads 123 : 124 : For homogeneous loads, it is likely best to omit load-balancing and just use 125 : the z-curve distribution (default) to give a good initial distribution and use 126 : that for the entire evolution. This means calling the SpECTRE executable with 127 : no LB-related command-line args, or with `+balancer NullLB`. 128 : 129 : You may find modest gains from using a communication-based load balancer, but 130 : likely only from the 'extra' parallel components of the system that cause the 131 : load to be not completely homogeneous (e.g., components like the interpolator 132 : or horizon finder). 133 : If you need a very long evolution or intend to submit a large number of 134 : evolutions, it may be worth experimenting to see whether 1-3 applications of 135 : `RecBipartLB` (or `ScotchLB` once its bugs are fixed, see above) improve 136 : performance for the system, for instance by using the input file: 137 : ``` 138 : PhaseChangeAndTriggers: 139 : - Trigger: 140 : Slabs: 141 : Specified: 142 : Values: [5, 10, 15] 143 : PhaseChanges: 144 : - VisitAndReturn(LoadBalancing) 145 : ``` 146 : and command-line args `+balancer RecBipartLB` (or `ScotchLB` when its 147 : bugs are fixed). This may be particularly relevant for cases with numeric 148 : initial data or other complicated set-up procedures. 149 : 150 : #### Inhomogeneous loads 151 : 152 : Based on our experiments, we anticipate that using a load-balancer may 153 : significantly improve runtimes with inhomogeneous loads. Our testing on this 154 : case is far more sparse, but for SpECTRE executables, it is probably remains 155 : true that managing communication costs will be an important goal for the 156 : balancer. It is likely worth attempting the evolution with a 157 : periodically-applied centralized communication-aware balancer, e.g.: 158 : ``` 159 : PhaseChangeAndTriggers: 160 : - Trigger: 161 : Slabs: 162 : EvenlySpaced: 163 : Interval: 1000 164 : Offset: 5 165 : PhaseChanges: 166 : - VisitAndReturn(LoadBalancing) 167 : ``` 168 : paired with command-line args `+balancer RecBipartLB` (or `ScotchLB` when its 169 : bugs are fixed). 170 : 171 : Important considerations when choosing the interval with which to balance are: 172 : - you will want to ensure that the balancer is applied frequently enough to 173 : prioritize expensive parts of the simulation before any relevent features 174 : 'move' to other elements. For example, if a shock is moving across the 175 : simulation domain causing certain elements to be more expensive to compute, 176 : you want to balance often enough that the LB 'keeps up' with the movement of 177 : the shock. 178 : - you will want to avoid balancing so frequently that the synchronization 179 : and balancer calculation itself becomes a significant portion of runtime. 180 : 181 : We have not yet taken much detailed data on using the load-balancers for 182 : inhomogeneous loads, so more detailed tests determining their efficacy would be 183 : valuable.