SpECTRE Documentation Coverage Report
 Current view: top level - __w/spectre/spectre/docs/DevGuide - BuildOptimization.md Hit Total Coverage Commit: b5ffa4904430ccef0b226f73dcd25c74cb5188f6 Lines: 0 1 0.0 % Date: 2021-07-28 22:05:18 Legend: Lines: hit not hit
  Line data Source code  1 0 : \cond NEVER 2 : Distributed under the MIT License. 3 : See LICENSE.txt for details. 4 : \endcond 5 : 6 : # Build Profiling and Optimization {#build_profiling_and_optimization} 7 : 8 : \tableofcontents 9 : 10 : # Why is our build so expensive? 11 : 12 : SpECTRE makes heavy use of compile-time logic, which is responsible for a lot of 13 : the nice type-checking, performance, and flexibility the code has to offer. 14 : For instance, our central data structure, the \ref DataBoxGroup "DataBox", uses 15 : a type list of several "tags" to determine its contents, as well as to 16 : automatically propagate dependencies in compute items. 17 : 18 : This use of compile-time logic, however, has the trade-off of making our builds 19 : take longer and use more memory than a similar implementation in runtime logic 20 : would. 21 : There is good reason to believe that some of these costs are payed back at 22 : runtime, because many of our compile-time switches permit the final runtime code 23 : to be more efficient or avoid unnecessary computation. 24 : There is certainly room for optimization, though, either in finding better 25 : compile-time implementations of the algorithms, eliminating expensive template 26 : instantiations, or moving inefficient parts to runtime. 27 : This guide gives a quick outline of some of the methods that can be used to 28 : profile the build and possible pitfalls 29 : 30 : # Understanding template expenses 31 : 32 : The cost of compile-time template logic is a bit non-intuitive if you are used 33 : to thinking only of runtime performance. 34 : The main reason is that the typical unit of 'cost' in a compile-time operation 35 : is the number of instantiated types and functions. 36 : Re-using a type that has been instantiated elsewhere (in the same translation 37 : unit) typically has a very low compile-time cost, where instantiating a type 38 : with new template parameters will incur its own cost, plus any new types that it 39 : requires in e.g. type aliases or member functions. 40 : 41 : Consider a Fibonacci calculation at compile-time: 42 :  43 : template 44 : struct Fibonacci { 45 : static constexpr size_t value = 46 : Fibonacci::value + Fibonacci::value; 47 : }; 48 : 49 : template <> 50 : struct Fibonacci<1> { 51 : static constexpr size_t value = 1; 52 : }; 53 : 54 : template <> 55 : struct Fibonacci<0> { 56 : static constexpr size_t value = 1; 57 : }; 58 :  59 : If we were to write the same logic at runtime, the algorithm would be hopelessly 60 : inefficient; the recursive calls would cause each call to Fibonacci to make two 61 : calls to the same function, resulting in an exponential time algorithm! 62 : However, the C++ language will only instantiate unique types, so only N types 63 : will be created, giving a linear in compile-time operation. 64 : 65 : 66 : Compile-time lists and list operations frequently appear in SpECTRE, and should 67 : be thought of differently from runtime list operations. 68 : 69 : In compile-time lists, we have no access to true constant-time lookup, speedy 70 : algorithms that rely on sorted structures, or more sophisticated data structures 71 : (balanced trees, hash tables, etc.). 72 : The limitations of compile-time list techniques can cause list operations to be 73 : more costly than we could achieve with runtime data structures. 74 : 75 : Consider a basic version of the compile-time list accessor: 76 :  77 : template 78 : struct list_at; 79 : 80 : template 81 : struct list_at, 0> { 82 : using type = ListItem0; 83 : }; 84 : 85 : template 86 : struct list_at, Index> { 87 : using type = typename list_at, Index - 1>::type; 88 : }; 89 :  90 : Now, to access the Nth item in the list, we need to instantiate \f$O(N)\f$ 91 : types. 92 : The above implementation is significantly more costly than we would find in 93 : practice in template metaprogramming libraries -- in particular, our chosen 94 : TMPL backend, [Brigand](https://github.com/edouarda/brigand), manages the task 95 : in \f$O(\log(N))\f$ (at least in type instantiation count). 96 : 97 : Most of the list operations in SpECTRE cannot take advantage of any particular 98 : ordering or hashing of the list, so must resort to naive list operations -- 99 : so, searching a list (tmpl::list_contains or tmpl::index_of) is 100 : \f$O(N)\f$ cost, tmpl::remove_duplicates is \f$O(N^2)\f$, and 101 : tmpl::list_difference is similarly \f$O(N^2)\f$. 102 : So, complicated type logic scales pretty badly with long lists, and improvements 103 : can sometimes be made by reducing a list's size or avoiding the more costly list 104 : operations when a list is known to be long. 105 : 106 : # Profiling the build 107 : 108 : In the current version of SpECTRE, the most expensive builds are the final 109 : translation units associated with the executables (particularly the most 110 : complicated executables, like Generalized Harmonic and Valencia), which should 111 : be unsurprising from the above discussion: it is in these compilation steps that 112 : we instantiate the Parallel components, and in turn, all of the \ref 113 : DataBoxGroup "DataBox" types that will be used during the evolution. 114 : 115 : Similarly, a number of tests have now shown that in the current version of 116 : SpECTRE (as of early 2021), by far the most expensive part of the build is 117 : \ref DataBoxGroup "DataBox" operations and instantiations, and the best build 118 : performance gains are available by either reducing the number of 119 : \ref DataBoxGroup "DataBox"es that are instantiated, reducing the number of tags 120 : (particularly compute tags) stored in the \ref DataBoxGroup "DataBox", or 121 : optimizing the implementations of the \ref DataBoxGroup "DataBox" and its 122 : utilities. 123 : So, generally speaking, profiling should start by focusing on the 124 : \ref DataBoxGroup "DataBox", and move to other utilities if it becomes clear 125 : that there are other parts of the code that are contributing significantly to 126 : the compilation time or memory usage. 127 : 128 : 129 : ## Specialized tests and feature exploration 130 : 131 : This is simultaneously the most reliable and most labor-intensive strategy for 132 : understanding build costs. 133 : The procedure is to identify a feature you'd like to profile, create a 134 : specialized test for that feature, and put that test in the 135 : tests/Unit/RunSingleTest/CMakeLists.txt. 136 : Then, you can easily include or exclude uses of functions or classes that you 137 : want to profile, and compare the relative total cost of building 138 : RunSingleTest. 139 : 140 : There are a number of tools for profiling the cost of an individual process, 141 : but for compilation, the detailed tools like perf or hpctoolkit are unlikely 142 : to give useful information about what parts of our code are slow to build. 143 : Instead, it's best to just carefully measure the full build of the target 144 : in question, and rapidly iterate to include or exclude potentially expensive 145 : parts to understand the build costs. 146 : 147 : There are a lot of tools that can give you the global resource usage 148 : information, including the /proc/\$PID/status file from kernel information, 149 : top, or tools from the sysstat package. 150 : The time utility is particularly user-friendly, available on ubuntu (and 151 : therefore in the SpECTRE build container) via apt-get install time. 152 : Then, it can be invoked by /usr/bin/time -v your_command (note that simply 153 : time will route to a different alias in the default environment on ubuntu in 154 : the container, so the full path /usr/bin/time is required). 155 : After completion, it will print a readable report of the time and memory usage. 156 : 157 : One important feature to be aware of in profiling the build by this method is 158 : the implicit memoization of many compile-time features. 159 : For instance, if feature A and B both instantiate a class C that is 160 : expensive to build, you'll see a difference in the build cost when _either_ A 161 : or B are included, but the cost won't be additive - the second feature will 162 : just 'reuse' the instantiation from the first. 163 : To optimize this type of situation, either C must be improved to be less 164 : costly to instantiate, or its use must be eliminated from _both_ A and B -- 165 : removing C from only one of the classes that use it won't help the build much 166 : at all. 167 : 168 : ## Templight++ 169 : 170 : Detailed profiling of a C++ build is a surprisingly hard task, and there are few 171 : useful tools for getting a good idea for what parts of a compilation are 172 : expensive. One tool that can perform some profiling of the build is 173 : [templight++](https://github.com/mikael-s-persson/templight). 174 : _It is important to note that this tool often produces build profiles that are 175 : misleading or incomplete!_ It is included in this guide under the philosophy 176 : that a flawed tool can be better than no tool at all in some circumstances, 177 : but the templight++ profiles should be taken primarily as a loose guide for 178 : features to investigate with more careful follow-up investigations like the 179 : above suggestion of specialized tests and feature exploration. 180 : 181 : The templight++ build and usage instructions work nicely with the current 182 : SpECTRE build system, and the cmake trick suggested by the templight++ 183 : documentation 184 : bash 185 : export CC="/path/to/llvm/build/bin/templight -Xtemplight -profiler\ 186 : -Xtemplight -memory" 187 : export CXX="/path/to/llvm/build/bin/templight++ -Xtemplight -profiler\ 188 : -Xtemplight -memory" 189 :  190 : works well in SpECTRE. Build profiling with templight++ are incredibly slow, 191 : and seem to produce increasingly misleading data for larger builds, so it is 192 : recommended to avoid using the tool for our most expensive evolution 193 : executables. Experience indicates that you will likely wait for hours and be 194 : disappointed by deeply flawed results. 195 : 196 : 197 : After building a target, you will find along side each .o file in the 198 : build/src tree an additional file that ends with .trace.pbf. 199 : These are the templight++ trace files, and (like many performance tool 200 : outputs), require post-processing to recover human-readable data. 201 : The companion package 202 : [templight-tools](https://github.com/mikael-s-persson/templight-tools) 203 : can be built to obtain the templight-convert utility that converts 204 : the templight traces to more managable formats. 205 : It is recommended to install and use 206 : [KCacheGrind](https://kcachegrind.github.io/html/Home.html) 207 : (which does, unfortunately require some KDE libraries, but doesn't require you 208 : to use the full KDE window system) to visualize the output -- the larger graphs 209 : produced by templight are inefficient to render in the graphviz format. 210 : 211 : ## Clang AST syntax generation 212 : 213 : There is a nice and poorly documented feature of the clang++ compiler that it 214 : can produce a rough approximation of the collection of C++ template 215 : instantiations produced by a particular executable. 216 : Adding -Xclang -ast-print -fsyntax-only to the CXX_FLAGS will cause this 217 : information to be printed to stdout, which should probably be redirected to file 218 : because it will be an enormous output. 219 : Importantly, to the best of our knowledge, this tool has not yet been used to 220 : successfully profile any SpECTRE build, but with sufficient post-processing the 221 : C++-syntax version of the AST might be useful to determine the number and nature 222 : of instantiations produced by a particular piece of code, and might offer some 223 : proxy for build performance. 224 : 225 : For instance, if we put the above Fibonacci struct in a source file with: 226 : cpp 227 : int main(int argc, char** argv) { 228 : std::cout << Fibonacci<6>::value << "\n"; 229 : } 230 :  231 : and we compile it with 232 :  233 : clang++-10 -Xclang -ast-print -fsyntax-only -o fib ./fib.cpp >> fib_out 234 :  235 : we obtain in fib_out (after thousands of lines of STL-generated code): 236 : cpp 237 : template struct Fibonacci { 238 : static constexpr size_t value = 239 : Fibonacci::value + Fibonacci::value; 240 : }; 241 : template <> struct Fibonacci<6> { 242 : static constexpr size_t value = 243 : Fibonacci<6UL - 1>::value + Fibonacci<6UL - 2>::value; 244 : }; 245 : template <> struct Fibonacci<5> { 246 : static constexpr size_t value = 247 : Fibonacci<5UL - 1>::value + Fibonacci<5UL - 2>::value; 248 : }; 249 : template <> struct Fibonacci<4> { 250 : static constexpr size_t value = 251 : Fibonacci<4UL - 1>::value + Fibonacci<4UL - 2>::value; 252 : }; 253 : template <> struct Fibonacci<3> { 254 : static constexpr size_t value = 255 : Fibonacci<3UL - 1>::value + Fibonacci<3UL - 2>::value; 256 : }; 257 : template <> struct Fibonacci<2> { 258 : static constexpr size_t value = 259 : Fibonacci<2UL - 1>::value + Fibonacci<2UL - 2>::value; 260 : }; 261 : template <> struct Fibonacci<1> { static constexpr size_t value = 1; }; 262 : template <> struct Fibonacci<0> { static constexpr size_t value = 1; }; 263 : int main(int argc, char **argv) { std::cout << Fibonacci<6>::value << "\n"; } 264 :  265 : Which is actually pretty illuminating about what the compiler decided to do in 266 : this simple case. 267 : Unfortunately, the AST produced by clang++ in more complicated cases produces 268 : extremely large outputs, so realistic cases are likely too large to be usefully 269 : human-readable. 270 : It may be possible, though, to script post-processing tools to sift through the 271 : collections of template instantiations for particular classes to understand 272 : specific cases of template logic. 273 : 

 Generated by: LCOV version 1.14