LCOV - 1c32b58340e006addc79befb2cdaa7547247e09c - /__w/spectre/spectre/docs/DevGuide/BuildOptimization.md

SpECTRE Documentation Coverage Report

Current view:	top level - __w/spectre/spectre/docs/DevGuide - BuildOptimization.md		Hit	Total	Coverage
Commit:	1c32b58340e006addc79befb2cdaa7547247e09c	Lines:	0	1	0.0 %
Date:	2024-04-19 07:30:15
Legend:	Lines: hit not hit

          Line data    Source code

       1           0 : \cond NEVER
       2             : Distributed under the MIT License.
       3             : See LICENSE.txt for details.
       4             : \endcond
       5             : 
       6             : # Build Profiling and Optimization {#build_profiling_and_optimization}
       7             : 
       8             : \tableofcontents
       9             : 
      10             : # Why is our build so expensive?
      11             : 
      12             : SpECTRE makes heavy use of compile-time logic, which is responsible for a lot of
      13             : the nice type-checking, performance, and flexibility the code has to offer.
      14             : For instance, our central data structure, the \ref DataBoxGroup "DataBox", uses
      15             : a type list of several "tags" to determine its contents, as well as to
      16             : automatically propagate dependencies in compute items.
      17             : 
      18             : This use of compile-time logic, however, has the trade-off of making our builds
      19             : take longer and use more memory than a similar implementation in runtime logic
      20             : would.
      21             : There is good reason to believe that some of these costs are payed back at
      22             : runtime, because many of our compile-time switches permit the final runtime code
      23             : to be more efficient or avoid unnecessary computation.
      24             : There is certainly room for optimization, though, either in finding better
      25             : compile-time implementations of the algorithms, eliminating expensive template
      26             : instantiations, or moving inefficient parts to runtime.
      27             : This guide gives a quick outline of some of the methods that can be used to
      28             : profile the build and possible pitfalls
      29             : 
      30             : # Understanding template expenses
      31             : 
      32             : The cost of compile-time template logic is a bit non-intuitive if you are used
      33             : to thinking only of runtime performance.
      34             : The main reason is that the typical unit of 'cost' in a compile-time operation
      35             : is the number of instantiated types and functions.
      36             : Re-using a type that has been instantiated elsewhere (in the same translation
      37             : unit) typically has a very low compile-time cost, where instantiating a type
      38             : with new template parameters will incur its own cost, plus any new types that it
      39             : requires in e.g. type aliases or member functions.
      40             : 
      41             : Consider a Fibonacci calculation at compile-time:
      42             : ```
      43             : template <size_t N>
      44             : struct Fibonacci {
      45             :   static constexpr size_t value =
      46             :       Fibonacci<N - 1>::value + Fibonacci<N - 2>::value;
      47             : };
      48             : 
      49             : template <>
      50             : struct Fibonacci<1> {
      51             :   static constexpr size_t value = 1;
      52             : };
      53             : 
      54             : template <>
      55             : struct Fibonacci<0> {
      56             :   static constexpr size_t value = 1;
      57             : };
      58             : ```
      59             : If we were to write the same logic at runtime, the algorithm would be hopelessly
      60             : inefficient; the recursive calls would cause each call to Fibonacci to make two
      61             : calls to the same function, resulting in an exponential time algorithm!
      62             : However, the C++ language will only instantiate unique types, so only N types
      63             : will be created, giving a linear in compile-time operation.
      64             : 
      65             : 
      66             : Compile-time lists and list operations frequently appear in SpECTRE, and should
      67             : be thought of differently from runtime list operations.
      68             : 
      69             : In compile-time lists, we have no access to true constant-time lookup, speedy
      70             : algorithms that rely on sorted structures, or more sophisticated data structures
      71             : (balanced trees, hash tables, etc.).
      72             : The limitations of compile-time list techniques can cause list operations to be
      73             : more costly than we could achieve with runtime data structures.
      74             : 
      75             : Consider a basic version of the compile-time list accessor:
      76             : ```
      77             : template <typename List, size_t Index>
      78             : struct list_at;
      79             : 
      80             : template <typename ListItem0, typename... ListItems>
      81             : struct list_at<tmpl::list<ListItem0, ListItems...>, 0> {
      82             :   using type = ListItem0;
      83             : };
      84             : 
      85             : template <typename ListItem0, typename... ListItems, size_t Index>
      86             : struct list_at<tmpl::list<ListItem0, ListItems...>, Index> {
      87             :   using type = typename list_at<tmpl::list<ListItems...>, Index - 1>::type;
      88             : };
      89             : ```
      90             : Now, to access the Nth item in the list, we need to instantiate \f$O(N)\f$
      91             : types.
      92             : The above implementation is significantly more costly than we would find in
      93             : practice in template metaprogramming libraries -- in particular, our chosen
      94             : TMPL backend, [Brigand](https://github.com/edouarda/brigand), manages the task
      95             : in \f$O(\log(N))\f$ (at least in type instantiation count).
      96             : 
      97             : Most of the list operations in SpECTRE cannot take advantage of any particular
      98             : ordering or hashing of the list, so must resort to naive list operations --
      99             : so, searching a list (`tmpl::list_contains` or `tmpl::index_of`) is
     100             : \f$O(N)\f$ cost, `tmpl::remove_duplicates` is \f$O(N^2)\f$, and
     101             : `tmpl::list_difference` is similarly \f$O(N^2)\f$.
     102             : So, complicated type logic scales pretty badly with long lists, and improvements
     103             : can sometimes be made by reducing a list's size or avoiding the more costly list
     104             : operations when a list is known to be long.
     105             : 
     106             : # Profiling the build
     107             : 
     108             : In the current version of SpECTRE, the most expensive builds are the final
     109             : translation units associated with the executables (particularly the most
     110             : complicated executables, like Generalized Harmonic and Valencia), which should
     111             : be unsurprising from the above discussion: it is in these compilation steps that
     112             : we instantiate the Parallel components, and in turn, all of the \ref
     113             : DataBoxGroup "DataBox" types that will be used during the evolution.
     114             : 
     115             : Similarly, a number of tests have now shown that in the current version of
     116             : SpECTRE (as of early 2021), by far the most expensive part of the build is
     117             : \ref DataBoxGroup "DataBox" operations and instantiations, and the best build
     118             : performance gains are available by either reducing the number of
     119             : \ref DataBoxGroup "DataBox"es that are instantiated, reducing the number of tags
     120             : (particularly compute tags) stored in the \ref DataBoxGroup "DataBox", or
     121             : optimizing the implementations of the \ref DataBoxGroup "DataBox" and its
     122             : utilities.
     123             : So, generally speaking, profiling should start by focusing on the
     124             : \ref DataBoxGroup "DataBox", and move to other utilities if it becomes clear
     125             : that there are other parts of the code that are contributing significantly to
     126             : the compilation time or memory usage.
     127             : 
     128             : 
     129             : ## Specialized tests and feature exploration
     130             : 
     131             : This is simultaneously the most reliable and most labor-intensive strategy for
     132             : understanding build costs.
     133             : The procedure is to identify a feature you'd like to profile and create a
     134             : specialized test for that feature.
     135             : Then, you can easily include or exclude uses of functions or classes that you
     136             : want to profile, and compare the relative total cost of building the test
     137             : executable. You may want to temporarily remove all other source files from
     138             : the test executable.
     139             : 
     140             : There are a number of tools for profiling the cost of an individual process,
     141             : but for compilation, the detailed tools like `perf` or hpctoolkit are unlikely
     142             : to give useful information about what parts of our code are slow to build.
     143             : Instead, it's best to just carefully measure the full build of the target
     144             : in question, and rapidly iterate to include or exclude potentially expensive
     145             : parts to understand the build costs.
     146             : 
     147             : There are a lot of tools that can give you the global resource usage
     148             : information, including the `/proc/$PID/status` file from kernel information,
     149             : `top`, or tools from the `sysstat` package.
     150             : The `time` utility is particularly user-friendly, available on ubuntu (and
     151             : therefore in the SpECTRE build container) via `apt-get install time`.
     152             : Then, it can be invoked by `/usr/bin/time -v your_command` (note that simply
     153             : `time` will route to a different alias in the default environment on ubuntu in
     154             : the container, so the full path `/usr/bin/time` is required).
     155             : After completion, it will print a readable report of the time and memory usage.
     156             : 
     157             : One important feature to be aware of in profiling the build by this method is
     158             : the implicit memoization of many compile-time features.
     159             : For instance, if feature `A` and `B` both instantiate a class `C` that is
     160             : expensive to build, you'll see a difference in the build cost when _either_ `A`
     161             : or `B` are included, but the cost won't be additive - the second feature will
     162             : just 'reuse' the instantiation from the first.
     163             : To optimize this type of situation, either `C` must be improved to be less
     164             : costly to instantiate, or its use must be eliminated from _both_ `A` and `B` --
     165             : removing `C` from only one of the classes that use it won't help the build much
     166             : at all.
     167             : 
     168             : ## Templight++
     169             : 
     170             : Detailed profiling of a C++ build is a surprisingly hard task, and there are few
     171             : useful tools for getting a good idea for what parts of a compilation are
     172             : expensive. One tool that can perform some profiling of the build is
     173             : [templight++](https://github.com/mikael-s-persson/templight).
     174             : _It is important to note that this tool often produces build profiles that are
     175             : misleading or incomplete!_ It is included in this guide under the philosophy
     176             : that a flawed tool can be better than no tool at all in some circumstances,
     177             : but the templight++ profiles should be taken primarily as a loose guide for
     178             : features to investigate with more careful follow-up investigations like the
     179             : above suggestion of specialized tests and feature exploration.
     180             : 
     181             : The `templight++` build and usage instructions work nicely with the current
     182             : SpECTRE build system, and the cmake trick suggested by the `templight++`
     183             : documentation
     184             : ```bash
     185             : export CC="/path/to/llvm/build/bin/templight -Xtemplight -profiler\
     186             :  -Xtemplight -memory"
     187             : export CXX="/path/to/llvm/build/bin/templight++ -Xtemplight -profiler\
     188             :  -Xtemplight -memory"
     189             : ```
     190             : works well in SpECTRE. Build profiling with `templight++` are incredibly slow,
     191             : and seem to produce increasingly misleading data for larger builds, so it is
     192             : recommended to avoid using the tool for our most expensive evolution
     193             : executables. Experience indicates that you will likely wait for hours and be
     194             : disappointed by deeply flawed results.
     195             : 
     196             : 
     197             : After building a target, you will find along side each `.o` file in the
     198             : `build/src` tree an additional file that ends with `.trace.pbf`.
     199             : These are the `templight++` trace files, and (like many performance tool
     200             : outputs), require post-processing to recover human-readable data.
     201             : The companion package
     202             : [templight-tools](https://github.com/mikael-s-persson/templight-tools)
     203             : can be built to obtain the `templight-convert` utility that converts
     204             : the templight traces to more managable formats.
     205             : It is recommended to install and use
     206             : [KCacheGrind](https://kcachegrind.github.io/html/Home.html)
     207             : (which does, unfortunately require some KDE libraries, but doesn't require you
     208             : to use the full KDE window system) to visualize the output -- the larger graphs
     209             : produced by templight are inefficient to render in the graphviz format.
     210             : 
     211             : ## Clang profiling
     212             : 
     213             : With `clang >= 9`, you can generate a flame graph of what the compiler is
     214             : spending its time on, including specific function instantiations, class
     215             : instantiations, source file parsing, debug info generation, and function
     216             : optimization. To do so, add `-ftime-trace` to the `cmake` option
     217             : `-D CMAKE_CXX_FLAGS` in your `cmake` command for your build. Then, you can
     218             : simply `make TargetName`. This will produce a `.json` for each `.cpp.o` that
     219             : was compiled in the directory that the object file is stored. Note that these
     220             : object and `.json` files are deep within the build directory. You can then load
     221             : your `.json` of interest into `chrome://tracing`. Because our code base utilizes
     222             : many templates and some compilation targets instantiate many types, classes, and
     223             : functions, note that the flame graph for a particular compilation target may be
     224             : a lot to visually navigate, click through, and analyze at a high level. Because
     225             : of this, the tool described below can be very useful and more manageable for
     226             : some analyses.
     227             : 
     228             : The companion tool to the flame graph generation is the
     229             : [ClangBuildAnalyzer](https://github.com/aras-p/ClangBuildAnalyzer). This is an
     230             : open source tool that you can clone and run to give you profiling output
     231             : similar to when you profile runtime code in that it identifies compilation hot
     232             : spots for you. It will report things like "Here are the templates/functions
     233             : that took the longest to instantiate/compile." Note that because many of our
     234             : types can have very long names due to our templating, some class and function
     235             : signatures will not fit in the default character limit of what is printed to the
     236             : terminal. You can adjust the maximum character length printed by creating a
     237             : `ClangBuildAnalyzer.ini` file in your working directory as described in the
     238             : [ClangBuildAnalyzer](https://github.com/aras-p/ClangBuildAnalyzer) `readme.md`.
     239             : 
     240             : ## Clang AST syntax generation
     241             : 
     242             : There is a nice and poorly documented feature of the `clang++` compiler that it
     243             : can produce a rough approximation of the collection of C++ template
     244             : instantiations produced by a particular executable.
     245             : Adding `-Xclang -ast-print -fsyntax-only` to the `CXX_FLAGS` will cause this
     246             : information to be printed to stdout, which should probably be redirected to file
     247             : because it will be an enormous output.
     248             : Importantly, to the best of our knowledge, this tool has not yet been used to
     249             : successfully profile any SpECTRE build, but with sufficient post-processing the
     250             : C++-syntax version of the AST might be useful to determine the number and nature
     251             : of instantiations produced by a particular piece of code, and might offer some
     252             : proxy for build performance.
     253             : 
     254             : For instance, if we put the above `Fibonacci` struct in a source file with:
     255             : ```cpp
     256             : int main(int argc, char** argv) {
     257             :   std::cout << Fibonacci<6>::value << "\n";
     258             : }
     259             : ```
     260             : and we compile it with
     261             : ```
     262             : clang++-10 -Xclang -ast-print -fsyntax-only -o fib ./fib.cpp >> fib_out
     263             : ```
     264             : we obtain in `fib_out` (after thousands of lines of STL-generated code):
     265             : ```cpp
     266             : template <size_t N> struct Fibonacci {
     267             :   static constexpr size_t value =
     268             :       Fibonacci<N - 1>::value + Fibonacci<N - 2>::value;
     269             : };
     270             : template <> struct Fibonacci<6> {
     271             :   static constexpr size_t value =
     272             :       Fibonacci<6UL - 1>::value + Fibonacci<6UL - 2>::value;
     273             : };
     274             : template <> struct Fibonacci<5> {
     275             :   static constexpr size_t value =
     276             :       Fibonacci<5UL - 1>::value + Fibonacci<5UL - 2>::value;
     277             : };
     278             : template <> struct Fibonacci<4> {
     279             :   static constexpr size_t value =
     280             :       Fibonacci<4UL - 1>::value + Fibonacci<4UL - 2>::value;
     281             : };
     282             : template <> struct Fibonacci<3> {
     283             :   static constexpr size_t value =
     284             :       Fibonacci<3UL - 1>::value + Fibonacci<3UL - 2>::value;
     285             : };
     286             : template <> struct Fibonacci<2> {
     287             :   static constexpr size_t value =
     288             :       Fibonacci<2UL - 1>::value + Fibonacci<2UL - 2>::value;
     289             : };
     290             : template <> struct Fibonacci<1> { static constexpr size_t value = 1; };
     291             : template <> struct Fibonacci<0> { static constexpr size_t value = 1; };
     292             : int main(int argc, char **argv) { std::cout << Fibonacci<6>::value << "\n"; }
     293             : ```
     294             : Which is actually pretty illuminating about what the compiler decided to do in
     295             : this simple case.
     296             : Unfortunately, the AST produced by `clang++` in more complicated cases produces
     297             : extremely large outputs, so realistic cases are likely too large to be usefully
     298             : human-readable.
     299             : It may be possible, though, to script post-processing tools to sift through the
     300             : collections of template instantiations for particular classes to understand
     301             : specific cases of template logic.
     302             :

Generated by: LCOV version 1.14