|
SpECTRE
v2026.04.01
|
The goal of load-balancing (LB) is to ensure that HPC resources are well-used while performing large inhomogeneous simulations. In 2020-2021, Jordan Moxon and Francois Hebert performed a number of tests using Charm++'s built-in load balancers with SpECTRE.
These notes highlight the key points and give (at the bottom) some general recommendations.
In late 2020 FH tested Charm's LBs on simple, homogeneous, SpECTRE test cases. These tests reveal the following broad behavior patterns:
Note that at the 2020 Charm++ Workshop, the Charm team recommended that we use MetisLB, or that we combined MetisLB with RefineLB (syntax: +balancer MetisLB +balancer RefineLB, which applies the first balancer on the first invocation and the second balancer on all subsequent invocations, to 'polish' the results of the first LB). In practice, this failed for two reasons:
There is reason to suspect that the Charm load balancing may incorrectly balance the load when applied near the start of some simulations. This is because the 'one-time' setup of the system may involve nontrivial computation, and (e.g. for numeric initial data) communication patterns between components that differ significantly from the patterns during evolution. Then, the first balancer invocation, based partially on the measurements taken during the non-generalizable initialization phase, can give rise to a poorly-chosen balance and injure performance. This is the suspected cause of poor performance that has been noticed in cases of homogeneous load and numeric initial data in Generalized Harmonic tests performed by Geoffrey Lovelace. JM confirmed the problem. It appears that the balance is not similarly degraded when using cases that do not involve numeric initial data – some basic 2-node re-tests with Generalized Harmonic by JM seem to produce useful balance (improved performance) when not using numeric initial data.
This issue has not been investigated to a completely satisfactory conclusion, though the above explanation seems most plausible.
In cases for which it appears that the LB data is problematically impacted by the set up of the evolution system, we can try two main strategies to mitigate the problem:
JM tested ScotchLB, and found better performance than with RecBipartLB. The margin varied a great deal among the number of nodes used, but at multiple points tested, the runtime was less than 65% of the RecBipartLB runtime. The tests were performed with homogeneous load, but starting from the round-robin element distribution. The indication is therefore that ScotchLB is very effective at minimizing communication costs.
However, in practical applications, JM found that the ScotchLB often generates FPEs during the graph partition step and causes the simulation to crash. The issue charm++ issue #3401 tracks the progress to determine the cause of the problem and fix it in Charm. The source of the problem has largely been identified, but the fix is still pending.
ScotchLB will likely replace RecBipartLB as the most-frequently recommended centralized communication based balancer for SpECTRE once the FPE bugs have been fixed.
For homogeneous loads, it is likely best to omit load-balancing and just use the z-curve distribution (default) to give a good initial distribution and use that for the entire evolution. This means calling the SpECTRE executable with no LB-related command-line args, or with +balancer NullLB.
You may find modest gains from using a communication-based load balancer, but likely only from the 'extra' parallel components of the system that cause the load to be not completely homogeneous (e.g., components like the interpolator or horizon finder). If you need a very long evolution or intend to submit a large number of evolutions, it may be worth experimenting to see whether 1-3 applications of RecBipartLB (or ScotchLB once its bugs are fixed, see above) improve performance for the system, for instance by using the input file:
and command-line args +balancer RecBipartLB (or ScotchLB when its bugs are fixed). This may be particularly relevant for cases with numeric initial data or other complicated set-up procedures.
Based on our experiments, we anticipate that using a load-balancer may significantly improve runtimes with inhomogeneous loads. Our testing on this case is far more sparse, but for SpECTRE executables, it is probably remains true that managing communication costs will be an important goal for the balancer. It is likely worth attempting the evolution with a periodically-applied centralized communication-aware balancer, e.g.:
paired with command-line args +balancer RecBipartLB (or ScotchLB when its bugs are fixed).
Important considerations when choosing the interval with which to balance are:
We have not yet taken much detailed data on using the load-balancers for inhomogeneous loads, so more detailed tests determining their efficacy would be valuable.