SpECTRE Documentation Coverage Report
Current view: top level - __w/spectre/spectre/docs/DevGuide - BuildOptimization.md Hit Total Coverage
Commit: b5ffa4904430ccef0b226f73dcd25c74cb5188f6 Lines: 0 1 0.0 %
Date: 2021-07-28 22:05:18
Legend: Lines: hit not hit

          Line data    Source code
       1           0 : \cond NEVER
       2             : Distributed under the MIT License.
       3             : See LICENSE.txt for details.
       4             : \endcond
       5             : 
       6             : # Build Profiling and Optimization {#build_profiling_and_optimization}
       7             : 
       8             : \tableofcontents
       9             : 
      10             : # Why is our build so expensive?
      11             : 
      12             : SpECTRE makes heavy use of compile-time logic, which is responsible for a lot of
      13             : the nice type-checking, performance, and flexibility the code has to offer.
      14             : For instance, our central data structure, the \ref DataBoxGroup "DataBox", uses
      15             : a type list of several "tags" to determine its contents, as well as to
      16             : automatically propagate dependencies in compute items.
      17             : 
      18             : This use of compile-time logic, however, has the trade-off of making our builds
      19             : take longer and use more memory than a similar implementation in runtime logic
      20             : would.
      21             : There is good reason to believe that some of these costs are payed back at
      22             : runtime, because many of our compile-time switches permit the final runtime code
      23             : to be more efficient or avoid unnecessary computation.
      24             : There is certainly room for optimization, though, either in finding better
      25             : compile-time implementations of the algorithms, eliminating expensive template
      26             : instantiations, or moving inefficient parts to runtime.
      27             : This guide gives a quick outline of some of the methods that can be used to
      28             : profile the build and possible pitfalls
      29             : 
      30             : # Understanding template expenses
      31             : 
      32             : The cost of compile-time template logic is a bit non-intuitive if you are used
      33             : to thinking only of runtime performance.
      34             : The main reason is that the typical unit of 'cost' in a compile-time operation
      35             : is the number of instantiated types and functions.
      36             : Re-using a type that has been instantiated elsewhere (in the same translation
      37             : unit) typically has a very low compile-time cost, where instantiating a type
      38             : with new template parameters will incur its own cost, plus any new types that it
      39             : requires in e.g. type aliases or member functions.
      40             : 
      41             : Consider a Fibonacci calculation at compile-time:
      42             : ```
      43             : template <size_t N>
      44             : struct Fibonacci {
      45             :   static constexpr size_t value =
      46             :       Fibonacci<N - 1>::value + Fibonacci<N - 2>::value;
      47             : };
      48             : 
      49             : template <>
      50             : struct Fibonacci<1> {
      51             :   static constexpr size_t value = 1;
      52             : };
      53             : 
      54             : template <>
      55             : struct Fibonacci<0> {
      56             :   static constexpr size_t value = 1;
      57             : };
      58             : ```
      59             : If we were to write the same logic at runtime, the algorithm would be hopelessly
      60             : inefficient; the recursive calls would cause each call to Fibonacci to make two
      61             : calls to the same function, resulting in an exponential time algorithm!
      62             : However, the C++ language will only instantiate unique types, so only N types
      63             : will be created, giving a linear in compile-time operation.
      64             : 
      65             : 
      66             : Compile-time lists and list operations frequently appear in SpECTRE, and should
      67             : be thought of differently from runtime list operations.
      68             : 
      69             : In compile-time lists, we have no access to true constant-time lookup, speedy
      70             : algorithms that rely on sorted structures, or more sophisticated data structures
      71             : (balanced trees, hash tables, etc.).
      72             : The limitations of compile-time list techniques can cause list operations to be
      73             : more costly than we could achieve with runtime data structures.
      74             : 
      75             : Consider a basic version of the compile-time list accessor:
      76             : ```
      77             : template <typename List, size_t Index>
      78             : struct list_at;
      79             : 
      80             : template <typename ListItem0, typename... ListItems>
      81             : struct list_at<tmpl::list<ListItem0, ListItems...>, 0> {
      82             :   using type = ListItem0;
      83             : };
      84             : 
      85             : template <typename ListItem0, typename... ListItems, size_t Index>
      86             : struct list_at<tmpl::list<ListItem0, ListItems...>, Index> {
      87             :   using type = typename list_at<tmpl::list<ListItems...>, Index - 1>::type;
      88             : };
      89             : ```
      90             : Now, to access the Nth item in the list, we need to instantiate \f$O(N)\f$
      91             : types.
      92             : The above implementation is significantly more costly than we would find in
      93             : practice in template metaprogramming libraries -- in particular, our chosen
      94             : TMPL backend, [Brigand](https://github.com/edouarda/brigand), manages the task
      95             : in \f$O(\log(N))\f$ (at least in type instantiation count).
      96             : 
      97             : Most of the list operations in SpECTRE cannot take advantage of any particular
      98             : ordering or hashing of the list, so must resort to naive list operations --
      99             : so, searching a list (`tmpl::list_contains` or `tmpl::index_of`) is
     100             : \f$O(N)\f$ cost, `tmpl::remove_duplicates` is \f$O(N^2)\f$, and
     101             : `tmpl::list_difference` is similarly \f$O(N^2)\f$.
     102             : So, complicated type logic scales pretty badly with long lists, and improvements
     103             : can sometimes be made by reducing a list's size or avoiding the more costly list
     104             : operations when a list is known to be long.
     105             : 
     106             : # Profiling the build
     107             : 
     108             : In the current version of SpECTRE, the most expensive builds are the final
     109             : translation units associated with the executables (particularly the most
     110             : complicated executables, like Generalized Harmonic and Valencia), which should
     111             : be unsurprising from the above discussion: it is in these compilation steps that
     112             : we instantiate the Parallel components, and in turn, all of the \ref
     113             : DataBoxGroup "DataBox" types that will be used during the evolution.
     114             : 
     115             : Similarly, a number of tests have now shown that in the current version of
     116             : SpECTRE (as of early 2021), by far the most expensive part of the build is
     117             : \ref DataBoxGroup "DataBox" operations and instantiations, and the best build
     118             : performance gains are available by either reducing the number of
     119             : \ref DataBoxGroup "DataBox"es that are instantiated, reducing the number of tags
     120             : (particularly compute tags) stored in the \ref DataBoxGroup "DataBox", or
     121             : optimizing the implementations of the \ref DataBoxGroup "DataBox" and its
     122             : utilities.
     123             : So, generally speaking, profiling should start by focusing on the
     124             : \ref DataBoxGroup "DataBox", and move to other utilities if it becomes clear
     125             : that there are other parts of the code that are contributing significantly to
     126             : the compilation time or memory usage.
     127             : 
     128             : 
     129             : ## Specialized tests and feature exploration
     130             : 
     131             : This is simultaneously the most reliable and most labor-intensive strategy for
     132             : understanding build costs.
     133             : The procedure is to identify a feature you'd like to profile, create a
     134             : specialized test for that feature, and put that test in the
     135             : `tests/Unit/RunSingleTest/CMakeLists.txt`.
     136             : Then, you can easily include or exclude uses of functions or classes that you
     137             : want to profile, and compare the relative total cost of building
     138             : `RunSingleTest`.
     139             : 
     140             : There are a number of tools for profiling the cost of an individual process,
     141             : but for compilation, the detailed tools like `perf` or hpctoolkit are unlikely
     142             : to give useful information about what parts of our code are slow to build.
     143             : Instead, it's best to just carefully measure the full build of the target
     144             : in question, and rapidly iterate to include or exclude potentially expensive
     145             : parts to understand the build costs.
     146             : 
     147             : There are a lot of tools that can give you the global resource usage
     148             : information, including the `/proc/$PID/status` file from kernel information,
     149             : `top`, or tools from the `sysstat` package.
     150             : The `time` utility is particularly user-friendly, available on ubuntu (and
     151             : therefore in the SpECTRE build container) via `apt-get install time`.
     152             : Then, it can be invoked by `/usr/bin/time -v your_command` (note that simply
     153             : `time` will route to a different alias in the default environment on ubuntu in
     154             : the container, so the full path `/usr/bin/time` is required).
     155             : After completion, it will print a readable report of the time and memory usage.
     156             : 
     157             : One important feature to be aware of in profiling the build by this method is
     158             : the implicit memoization of many compile-time features.
     159             : For instance, if feature `A` and `B` both instantiate a class `C` that is
     160             : expensive to build, you'll see a difference in the build cost when _either_ `A`
     161             : or `B` are included, but the cost won't be additive - the second feature will
     162             : just 'reuse' the instantiation from the first.
     163             : To optimize this type of situation, either `C` must be improved to be less
     164             : costly to instantiate, or its use must be eliminated from _both_ `A` and `B` --
     165             : removing `C` from only one of the classes that use it won't help the build much
     166             : at all.
     167             : 
     168             : ## Templight++
     169             : 
     170             : Detailed profiling of a C++ build is a surprisingly hard task, and there are few
     171             : useful tools for getting a good idea for what parts of a compilation are
     172             : expensive. One tool that can perform some profiling of the build is
     173             : [templight++](https://github.com/mikael-s-persson/templight).
     174             : _It is important to note that this tool often produces build profiles that are
     175             : misleading or incomplete!_ It is included in this guide under the philosophy
     176             : that a flawed tool can be better than no tool at all in some circumstances,
     177             : but the templight++ profiles should be taken primarily as a loose guide for
     178             : features to investigate with more careful follow-up investigations like the
     179             : above suggestion of specialized tests and feature exploration.
     180             : 
     181             : The `templight++` build and usage instructions work nicely with the current
     182             : SpECTRE build system, and the cmake trick suggested by the `templight++`
     183             : documentation
     184             : ```bash
     185             : export CC="/path/to/llvm/build/bin/templight -Xtemplight -profiler\
     186             :  -Xtemplight -memory"
     187             : export CXX="/path/to/llvm/build/bin/templight++ -Xtemplight -profiler\
     188             :  -Xtemplight -memory"
     189             : ```
     190             : works well in SpECTRE. Build profiling with `templight++` are incredibly slow,
     191             : and seem to produce increasingly misleading data for larger builds, so it is
     192             : recommended to avoid using the tool for our most expensive evolution
     193             : executables. Experience indicates that you will likely wait for hours and be
     194             : disappointed by deeply flawed results.
     195             : 
     196             : 
     197             : After building a target, you will find along side each `.o` file in the
     198             : `build/src` tree an additional file that ends with `.trace.pbf`.
     199             : These are the `templight++` trace files, and (like many performance tool
     200             : outputs), require post-processing to recover human-readable data.
     201             : The companion package
     202             : [templight-tools](https://github.com/mikael-s-persson/templight-tools)
     203             : can be built to obtain the `templight-convert` utility that converts
     204             : the templight traces to more managable formats.
     205             : It is recommended to install and use
     206             : [KCacheGrind](https://kcachegrind.github.io/html/Home.html)
     207             : (which does, unfortunately require some KDE libraries, but doesn't require you
     208             : to use the full KDE window system) to visualize the output -- the larger graphs
     209             : produced by templight are inefficient to render in the graphviz format.
     210             : 
     211             : ## Clang AST syntax generation
     212             : 
     213             : There is a nice and poorly documented feature of the `clang++` compiler that it
     214             : can produce a rough approximation of the collection of C++ template
     215             : instantiations produced by a particular executable.
     216             : Adding `-Xclang -ast-print -fsyntax-only` to the `CXX_FLAGS` will cause this
     217             : information to be printed to stdout, which should probably be redirected to file
     218             : because it will be an enormous output.
     219             : Importantly, to the best of our knowledge, this tool has not yet been used to
     220             : successfully profile any SpECTRE build, but with sufficient post-processing the
     221             : C++-syntax version of the AST might be useful to determine the number and nature
     222             : of instantiations produced by a particular piece of code, and might offer some
     223             : proxy for build performance.
     224             : 
     225             : For instance, if we put the above `Fibonacci` struct in a source file with:
     226             : ```cpp
     227             : int main(int argc, char** argv) {
     228             :   std::cout << Fibonacci<6>::value << "\n";
     229             : }
     230             : ```
     231             : and we compile it with
     232             : ```
     233             : clang++-10 -Xclang -ast-print -fsyntax-only -o fib ./fib.cpp >> fib_out
     234             : ```
     235             : we obtain in `fib_out` (after thousands of lines of STL-generated code):
     236             : ```cpp
     237             : template <size_t N> struct Fibonacci {
     238             :   static constexpr size_t value =
     239             :       Fibonacci<N - 1>::value + Fibonacci<N - 2>::value;
     240             : };
     241             : template <> struct Fibonacci<6> {
     242             :   static constexpr size_t value =
     243             :       Fibonacci<6UL - 1>::value + Fibonacci<6UL - 2>::value;
     244             : };
     245             : template <> struct Fibonacci<5> {
     246             :   static constexpr size_t value =
     247             :       Fibonacci<5UL - 1>::value + Fibonacci<5UL - 2>::value;
     248             : };
     249             : template <> struct Fibonacci<4> {
     250             :   static constexpr size_t value =
     251             :       Fibonacci<4UL - 1>::value + Fibonacci<4UL - 2>::value;
     252             : };
     253             : template <> struct Fibonacci<3> {
     254             :   static constexpr size_t value =
     255             :       Fibonacci<3UL - 1>::value + Fibonacci<3UL - 2>::value;
     256             : };
     257             : template <> struct Fibonacci<2> {
     258             :   static constexpr size_t value =
     259             :       Fibonacci<2UL - 1>::value + Fibonacci<2UL - 2>::value;
     260             : };
     261             : template <> struct Fibonacci<1> { static constexpr size_t value = 1; };
     262             : template <> struct Fibonacci<0> { static constexpr size_t value = 1; };
     263             : int main(int argc, char **argv) { std::cout << Fibonacci<6>::value << "\n"; }
     264             : ```
     265             : Which is actually pretty illuminating about what the compiler decided to do in
     266             : this simple case.
     267             : Unfortunately, the AST produced by `clang++` in more complicated cases produces
     268             : extremely large outputs, so realistic cases are likely too large to be usefully
     269             : human-readable.
     270             : It may be possible, though, to script post-processing tools to sift through the
     271             : collections of template instantiations for particular classes to understand
     272             : specific cases of template logic.
     273             : 

Generated by: LCOV version 1.14