Line data Source code
1 0 : \cond NEVER
2 : Distributed under the MIT License.
3 : See LICENSE.txt for details.
4 : \endcond
5 :
6 : # Build Profiling and Optimization {#build_profiling_and_optimization}
7 :
8 : \tableofcontents
9 :
10 : # Why is our build so expensive?
11 :
12 : SpECTRE makes heavy use of compile-time logic, which is responsible for a lot of
13 : the nice type-checking, performance, and flexibility the code has to offer.
14 : For instance, our central data structure, the \ref DataBoxGroup "DataBox", uses
15 : a type list of several "tags" to determine its contents, as well as to
16 : automatically propagate dependencies in compute items.
17 :
18 : This use of compile-time logic, however, has the trade-off of making our builds
19 : take longer and use more memory than a similar implementation in runtime logic
20 : would.
21 : There is good reason to believe that some of these costs are payed back at
22 : runtime, because many of our compile-time switches permit the final runtime code
23 : to be more efficient or avoid unnecessary computation.
24 : There is certainly room for optimization, though, either in finding better
25 : compile-time implementations of the algorithms, eliminating expensive template
26 : instantiations, or moving inefficient parts to runtime.
27 : This guide gives a quick outline of some of the methods that can be used to
28 : profile the build and possible pitfalls
29 :
30 : # Understanding template expenses
31 :
32 : The cost of compile-time template logic is a bit non-intuitive if you are used
33 : to thinking only of runtime performance.
34 : The main reason is that the typical unit of 'cost' in a compile-time operation
35 : is the number of instantiated types and functions.
36 : Re-using a type that has been instantiated elsewhere (in the same translation
37 : unit) typically has a very low compile-time cost, where instantiating a type
38 : with new template parameters will incur its own cost, plus any new types that it
39 : requires in e.g. type aliases or member functions.
40 :
41 : Consider a Fibonacci calculation at compile-time:
42 : ```
43 : template <size_t N>
44 : struct Fibonacci {
45 : static constexpr size_t value =
46 : Fibonacci<N - 1>::value + Fibonacci<N - 2>::value;
47 : };
48 :
49 : template <>
50 : struct Fibonacci<1> {
51 : static constexpr size_t value = 1;
52 : };
53 :
54 : template <>
55 : struct Fibonacci<0> {
56 : static constexpr size_t value = 1;
57 : };
58 : ```
59 : If we were to write the same logic at runtime, the algorithm would be hopelessly
60 : inefficient; the recursive calls would cause each call to Fibonacci to make two
61 : calls to the same function, resulting in an exponential time algorithm!
62 : However, the C++ language will only instantiate unique types, so only N types
63 : will be created, giving a linear in compile-time operation.
64 :
65 :
66 : Compile-time lists and list operations frequently appear in SpECTRE, and should
67 : be thought of differently from runtime list operations.
68 :
69 : In compile-time lists, we have no access to true constant-time lookup, speedy
70 : algorithms that rely on sorted structures, or more sophisticated data structures
71 : (balanced trees, hash tables, etc.).
72 : The limitations of compile-time list techniques can cause list operations to be
73 : more costly than we could achieve with runtime data structures.
74 :
75 : Consider a basic version of the compile-time list accessor:
76 : ```
77 : template <typename List, size_t Index>
78 : struct list_at;
79 :
80 : template <typename ListItem0, typename... ListItems>
81 : struct list_at<tmpl::list<ListItem0, ListItems...>, 0> {
82 : using type = ListItem0;
83 : };
84 :
85 : template <typename ListItem0, typename... ListItems, size_t Index>
86 : struct list_at<tmpl::list<ListItem0, ListItems...>, Index> {
87 : using type = typename list_at<tmpl::list<ListItems...>, Index - 1>::type;
88 : };
89 : ```
90 : Now, to access the Nth item in the list, we need to instantiate \f$O(N)\f$
91 : types.
92 : The above implementation is significantly more costly than we would find in
93 : practice in template metaprogramming libraries -- in particular, our chosen
94 : TMPL backend, [Brigand](https://github.com/edouarda/brigand), manages the task
95 : in \f$O(\log(N))\f$ (at least in type instantiation count).
96 :
97 : Most of the list operations in SpECTRE cannot take advantage of any particular
98 : ordering or hashing of the list, so must resort to naive list operations --
99 : so, searching a list (`tmpl::list_contains` or `tmpl::index_of`) is
100 : \f$O(N)\f$ cost, `tmpl::remove_duplicates` is \f$O(N^2)\f$, and
101 : `tmpl::list_difference` is similarly \f$O(N^2)\f$.
102 : So, complicated type logic scales pretty badly with long lists, and improvements
103 : can sometimes be made by reducing a list's size or avoiding the more costly list
104 : operations when a list is known to be long.
105 :
106 : # Profiling the build
107 :
108 : In the current version of SpECTRE, the most expensive builds are the final
109 : translation units associated with the executables (particularly the most
110 : complicated executables, like Generalized Harmonic and Valencia), which should
111 : be unsurprising from the above discussion: it is in these compilation steps that
112 : we instantiate the Parallel components, and in turn, all of the \ref
113 : DataBoxGroup "DataBox" types that will be used during the evolution.
114 :
115 : Similarly, a number of tests have now shown that in the current version of
116 : SpECTRE (as of early 2021), by far the most expensive part of the build is
117 : \ref DataBoxGroup "DataBox" operations and instantiations, and the best build
118 : performance gains are available by either reducing the number of
119 : \ref DataBoxGroup "DataBox"es that are instantiated, reducing the number of tags
120 : (particularly compute tags) stored in the \ref DataBoxGroup "DataBox", or
121 : optimizing the implementations of the \ref DataBoxGroup "DataBox" and its
122 : utilities.
123 : So, generally speaking, profiling should start by focusing on the
124 : \ref DataBoxGroup "DataBox", and move to other utilities if it becomes clear
125 : that there are other parts of the code that are contributing significantly to
126 : the compilation time or memory usage.
127 :
128 :
129 : ## Specialized tests and feature exploration
130 :
131 : This is simultaneously the most reliable and most labor-intensive strategy for
132 : understanding build costs.
133 : The procedure is to identify a feature you'd like to profile and create a
134 : specialized test for that feature.
135 : Then, you can easily include or exclude uses of functions or classes that you
136 : want to profile, and compare the relative total cost of building the test
137 : executable. You may want to temporarily remove all other source files from
138 : the test executable.
139 :
140 : There are a number of tools for profiling the cost of an individual process,
141 : but for compilation, the detailed tools like `perf` or hpctoolkit are unlikely
142 : to give useful information about what parts of our code are slow to build.
143 : Instead, it's best to just carefully measure the full build of the target
144 : in question, and rapidly iterate to include or exclude potentially expensive
145 : parts to understand the build costs.
146 :
147 : There are a lot of tools that can give you the global resource usage
148 : information, including the `/proc/$PID/status` file from kernel information,
149 : `top`, or tools from the `sysstat` package.
150 : The `time` utility is particularly user-friendly, available on ubuntu (and
151 : therefore in the SpECTRE build container) via `apt-get install time`.
152 : Then, it can be invoked by `/usr/bin/time -v your_command` (note that simply
153 : `time` will route to a different alias in the default environment on ubuntu in
154 : the container, so the full path `/usr/bin/time` is required).
155 : After completion, it will print a readable report of the time and memory usage.
156 :
157 : One important feature to be aware of in profiling the build by this method is
158 : the implicit memoization of many compile-time features.
159 : For instance, if feature `A` and `B` both instantiate a class `C` that is
160 : expensive to build, you'll see a difference in the build cost when _either_ `A`
161 : or `B` are included, but the cost won't be additive - the second feature will
162 : just 'reuse' the instantiation from the first.
163 : To optimize this type of situation, either `C` must be improved to be less
164 : costly to instantiate, or its use must be eliminated from _both_ `A` and `B` --
165 : removing `C` from only one of the classes that use it won't help the build much
166 : at all.
167 :
168 : ## Templight++
169 :
170 : Detailed profiling of a C++ build is a surprisingly hard task, and there are few
171 : useful tools for getting a good idea for what parts of a compilation are
172 : expensive. One tool that can perform some profiling of the build is
173 : [templight++](https://github.com/mikael-s-persson/templight).
174 : _It is important to note that this tool often produces build profiles that are
175 : misleading or incomplete!_ It is included in this guide under the philosophy
176 : that a flawed tool can be better than no tool at all in some circumstances,
177 : but the templight++ profiles should be taken primarily as a loose guide for
178 : features to investigate with more careful follow-up investigations like the
179 : above suggestion of specialized tests and feature exploration.
180 :
181 : The `templight++` build and usage instructions work nicely with the current
182 : SpECTRE build system, and the cmake trick suggested by the `templight++`
183 : documentation
184 : ```bash
185 : export CC="/path/to/llvm/build/bin/templight -Xtemplight -profiler\
186 : -Xtemplight -memory"
187 : export CXX="/path/to/llvm/build/bin/templight++ -Xtemplight -profiler\
188 : -Xtemplight -memory"
189 : ```
190 : works well in SpECTRE. Build profiling with `templight++` are incredibly slow,
191 : and seem to produce increasingly misleading data for larger builds, so it is
192 : recommended to avoid using the tool for our most expensive evolution
193 : executables. Experience indicates that you will likely wait for hours and be
194 : disappointed by deeply flawed results.
195 :
196 :
197 : After building a target, you will find along side each `.o` file in the
198 : `build/src` tree an additional file that ends with `.trace.pbf`.
199 : These are the `templight++` trace files, and (like many performance tool
200 : outputs), require post-processing to recover human-readable data.
201 : The companion package
202 : [templight-tools](https://github.com/mikael-s-persson/templight-tools)
203 : can be built to obtain the `templight-convert` utility that converts
204 : the templight traces to more managable formats.
205 : It is recommended to install and use
206 : [KCacheGrind](https://kcachegrind.github.io/html/Home.html)
207 : (which does, unfortunately require some KDE libraries, but doesn't require you
208 : to use the full KDE window system) to visualize the output -- the larger graphs
209 : produced by templight are inefficient to render in the graphviz format.
210 :
211 : ## Clang profiling
212 :
213 : With `clang >= 9`, you can generate a flame graph of what the compiler is
214 : spending its time on, including specific function instantiations, class
215 : instantiations, source file parsing, debug info generation, and function
216 : optimization. To do so, add `-ftime-trace` to the `cmake` option
217 : `-D CMAKE_CXX_FLAGS` in your `cmake` command for your build. Then, you can
218 : simply `make TargetName`. This will produce a `.json` for each `.cpp.o` that
219 : was compiled in the directory that the object file is stored. Note that these
220 : object and `.json` files are deep within the build directory. You can then load
221 : your `.json` of interest into `chrome://tracing`. Because our code base utilizes
222 : many templates and some compilation targets instantiate many types, classes, and
223 : functions, note that the flame graph for a particular compilation target may be
224 : a lot to visually navigate, click through, and analyze at a high level. Because
225 : of this, the tool described below can be very useful and more manageable for
226 : some analyses.
227 :
228 : The companion tool to the flame graph generation is the
229 : [ClangBuildAnalyzer](https://github.com/aras-p/ClangBuildAnalyzer). This is an
230 : open source tool that you can clone and run to give you profiling output
231 : similar to when you profile runtime code in that it identifies compilation hot
232 : spots for you. It will report things like "Here are the templates/functions
233 : that took the longest to instantiate/compile." Note that because many of our
234 : types can have very long names due to our templating, some class and function
235 : signatures will not fit in the default character limit of what is printed to the
236 : terminal. You can adjust the maximum character length printed by creating a
237 : `ClangBuildAnalyzer.ini` file in your working directory as described in the
238 : [ClangBuildAnalyzer](https://github.com/aras-p/ClangBuildAnalyzer) `readme.md`.
239 :
240 : ## Clang AST syntax generation
241 :
242 : There is a nice and poorly documented feature of the `clang++` compiler that it
243 : can produce a rough approximation of the collection of C++ template
244 : instantiations produced by a particular executable.
245 : Adding `-Xclang -ast-print -fsyntax-only` to the `CXX_FLAGS` will cause this
246 : information to be printed to stdout, which should probably be redirected to file
247 : because it will be an enormous output.
248 : Importantly, to the best of our knowledge, this tool has not yet been used to
249 : successfully profile any SpECTRE build, but with sufficient post-processing the
250 : C++-syntax version of the AST might be useful to determine the number and nature
251 : of instantiations produced by a particular piece of code, and might offer some
252 : proxy for build performance.
253 :
254 : For instance, if we put the above `Fibonacci` struct in a source file with:
255 : ```cpp
256 : int main(int argc, char** argv) {
257 : std::cout << Fibonacci<6>::value << "\n";
258 : }
259 : ```
260 : and we compile it with
261 : ```
262 : clang++-10 -Xclang -ast-print -fsyntax-only -o fib ./fib.cpp >> fib_out
263 : ```
264 : we obtain in `fib_out` (after thousands of lines of STL-generated code):
265 : ```cpp
266 : template <size_t N> struct Fibonacci {
267 : static constexpr size_t value =
268 : Fibonacci<N - 1>::value + Fibonacci<N - 2>::value;
269 : };
270 : template <> struct Fibonacci<6> {
271 : static constexpr size_t value =
272 : Fibonacci<6UL - 1>::value + Fibonacci<6UL - 2>::value;
273 : };
274 : template <> struct Fibonacci<5> {
275 : static constexpr size_t value =
276 : Fibonacci<5UL - 1>::value + Fibonacci<5UL - 2>::value;
277 : };
278 : template <> struct Fibonacci<4> {
279 : static constexpr size_t value =
280 : Fibonacci<4UL - 1>::value + Fibonacci<4UL - 2>::value;
281 : };
282 : template <> struct Fibonacci<3> {
283 : static constexpr size_t value =
284 : Fibonacci<3UL - 1>::value + Fibonacci<3UL - 2>::value;
285 : };
286 : template <> struct Fibonacci<2> {
287 : static constexpr size_t value =
288 : Fibonacci<2UL - 1>::value + Fibonacci<2UL - 2>::value;
289 : };
290 : template <> struct Fibonacci<1> { static constexpr size_t value = 1; };
291 : template <> struct Fibonacci<0> { static constexpr size_t value = 1; };
292 : int main(int argc, char **argv) { std::cout << Fibonacci<6>::value << "\n"; }
293 : ```
294 : Which is actually pretty illuminating about what the compiler decided to do in
295 : this simple case.
296 : Unfortunately, the AST produced by `clang++` in more complicated cases produces
297 : extremely large outputs, so realistic cases are likely too large to be usefully
298 : human-readable.
299 : It may be possible, though, to script post-processing tools to sift through the
300 : collections of template instantiations for particular classes to understand
301 : specific cases of template logic.
302 :
|