Line data Source code
1 0 : \cond NEVER 2 : Distributed under the MIT License. 3 : See LICENSE.txt for details. 4 : \endcond 5 : # Profiling {#profiling} 6 : 7 : \tableofcontents 8 : 9 : There are a number of tools available for profiling, each with their own 10 : strengths and weaknesses. This makes it difficult to recommend one "right" way 11 : of analyzing performance using profilers. Instead, one should use a combination 12 : of the tools to discover and eliminate performance bottle necks. Common 13 : profilers are Charm++ Projections (tracing-based), HPCToolkit (sampling-based, 14 : very versatile), Linux perf (sampling-based, command line only), Intel VTune 15 : (sampling-based, works well on Intel hardware), and AMD uProf (similar to Intel 16 : VTune). 17 : 18 : ## Profiling with HPCToolkit {#profiling_with_hpctoolkit} 19 : 20 : Follow the HPCToolkit installation instructions at 21 : [hpctoolkit.org](http://hpctoolkit.org). The Spack 22 : installation seems to work well. Once installed, compile your executable in 23 : Release mode with `-D ENABLE_PROFILING=ON -D DEBUG_SYMBOLS=ON` since otherwise 24 : you won't be able to get call stacks and source analysis. Using `-D 25 : BUILD_SHARED_LIBS=ON` is recommended since it makes HPCToolkit a lot easier to 26 : use. You must also use the system allocator, `-D MEMORY_ALLOCATOR=SYSTEM`. We 27 : will work from the build directory and perform all runs and performance analysis 28 : there. 29 : 30 : First run HPCToolkit as: 31 : ``` 32 : hpcrun -t --event CYCLES@f200 ./bin/EXEC --input-file ./Input.yaml +p1 33 : ``` 34 : We will profile on one core, but you can profile on multiple cores as well as 35 : multiple nodes if using MPI as the Charm++ backend. This will generate a 36 : `hpctoolkit-EXEC-measurements` directory. Run 37 : ``` 38 : hpcstruct -jN ./hpctoolkit-EXEC-measurements 39 : ``` 40 : where `N` is the number of cores to run on. This will generate a mapping to line 41 : numbers, etc. in the measurements directory. 42 : 43 : \warning Skipping the `hpcstruct` step will make `hprprof` below run extremely 44 : slowly. 45 : 46 : Once the run is complete, run 47 : ``` 48 : hpcprof -I /path/to/spectre/src/+ hpctoolkit-EXEC-measurements 49 : ``` 50 : Note that the `+` is a literal `+` symbol. This will create the directory 51 : ``` 52 : hpctoolkit-EXEC-database 53 : ``` 54 : which you can view using 55 : ``` 56 : hpcviewer ./hpctoolkit-EXEC-database 57 : ``` 58 : 59 : HPCViewer will generally start you in the `Top-down view` (callgraph of 60 : callers). You can select `Bottom-up view` (callgraph of callees) to get a 61 : different perspective. Whether you want to look at the callgraph of callers or 62 : callees depends a bit on the executable, what you're looking to measure, and how 63 : you like to think about things. The callees graph can give you a nice overview 64 : of what the low-level things taking up a lot of time are, but certainly makes 65 : the call stack not look like you would expect. On the right of the callgraphs 66 : you will see `CYCLES:Sum (I)` and `CYCLES:Sum (E)`. `I` means time spent 67 : _including_ callees, while `E` means time spent in the function itself 68 : (exclusive time). Sorting by exclusive gives a good idea of what the hot 69 : functions are. Here is a screenshot from HPCViewer: 70 : 71 : \image html HpcViewerCallees.png "HPCViewer callgraph of callees" 72 : 73 : You can see that 49.1% of inclusive time is spent in primitive recovery, and the 74 : line after the 49.1% function is a function inside the Kastaun 75 : recovery scheme. The `__nss_database_lookup` is some system call, 76 : e.g. `__memcpy_avx_unaligned_erms` or `__memset_avx2_unaligned_erms`. Looking at 77 : the calling code, e.g. `prepare_neighbor_data` gives a good hint as to what's 78 : going on. In most cases these are memory copies or memory sets (`std::vector` 79 : default initializes its memory, which is bad for performance). The way to fix 80 : these bottlenecks is to avoid memory copies and `std::vector<double>` as 81 : buffers. 82 : 83 : HPCToolkit allows you to sample on a variety of different event counters instead 84 : of just cycles. Please see the HPCToolkit manual for details. 85 : 86 : ## Profiling with AMD uProf {#profiling_with_amd_uprof} 87 : 88 : [AMD uProf](https://developer.amd.com/amd-uprof/) is AMD's sampling-based 89 : profiler that makes it relatively easy to do 90 : quite a bit of detailed performance analysis. The uProf manual is quite good and 91 : extensive, so for the most part the reader is referred to that. However, we will 92 : go over some basics for profiling executables and understanding the 93 : results. Make sure to compile your executable in Release mode with 94 : `-DENABLE_PROFILING=ON -D DEBUG_SYMBOLS=ON` since otherwise you won't be able to 95 : get call stacks and source analysis. 96 : 97 : When you open uProf you may be asked to change the kernel event paranoid 98 : level. Once you have uProf open, select `PROFILE` at the top. Specify the 99 : application path, options, etc. We will again run on a single core to analyze 100 : performance. It's recommended that you set the Core Affinity in AMD uProf so 101 : that your application isn't migrated between cores during a profiling run. Then 102 : choose `Next` in the lower right corner. Make sure the `CPU Profile Type` is set 103 : to `CPU Profile` at the top. We will first do a `Time-based Sampling` run (on 104 : the left). This means uProf will interrupt the application every `N` 105 : milliseconds and see where the application is. You typically want a few thousand 106 : total samples to get something that's reasonably representative of your 107 : application. Under the `Advanced %Options` make sure `Enable CSS` (on the right) 108 : is enabled (green) and that `Enable FPO` is also enabled. Now click `Start 109 : Profile` in the bottom right. Once the profile is complete you will be presented 110 : with a summary outlining where your code is spending most of its time. Click 111 : `ANALYZE` at the top to get a more detailed view. On the left you can select 112 : between a callgraph of callees (Function HotSpots), a callgraph of callers (Call 113 : Graph), and a few other views. Below is an example of a result from the same run 114 : we used with HPCToolkit above. 115 : 116 : \image html AmdUprofCallgraph.png "AMD uProf callgraph of callees" 117 : 118 : Again we see that most of our time is spent in primitive recovery but also that 119 : a lot of time is spent copying memory. This was grouped into 120 : `__nss_database_lookup` in HPCToolkit. Unfortunately, getting a call stack out 121 : of the `memcpy` doesn't always work and so while you know you're spending a lot 122 : of time copying memory, it's not so obvious where those copies are occurring. 123 : 124 : ## Profiling With Charm++ Projections {#profiling_with_projections} 125 : 126 : To view trace data after a profiling run you must download Charm++'s 127 : Projections software from their [website](http://charm.cs.illinois.edu/). 128 : If you encounter issues it may 129 : be necessary to clone the git repository and build the correct version 130 : from scratch. Note that the version of Charm++ used to compile SpECTRE 131 : should match the version of Projections used to analyze the trace data. 132 : You can collect the trace data on a different machine than the one you 133 : will be analyzing the data on. For example, you can collect the data on 134 : a supercomputer and analyze it on your desktop or laptop. 135 : 136 : For profiling you will want to use a production build of Charm++, which 137 : means compiling Charm++ with the `--with-production` flag. To enable trace 138 : collecting you must build with the `--enable-tracing` flag as well. For 139 : example, on a multicore 64-bit Linux machine the build command would be 140 : ``` shell 141 : ./build LIBS multicore-linux-x86_64 gcc -j8 --with-production --enable-tracing 142 : ``` 143 : You must build your executable in Release mode as well, specifying 144 : `-DCMAKE_BUILD_TYPE=Release` to CMake, as well as 145 : ``` 146 : -DCHARM_TRACE_PROJECTIONS=ON -DCHARM_TRACE_SUMMARY=ON -DENABLE_PROFILING=ON 147 : ``` 148 : to enable SpECTRE to use Charm++'s tracing features. 149 : 150 : ### Running SpECTRE With Trace Output 151 : 152 : When running SpECTRE you must specify a directory to output trace data into. 153 : This is done by adding the command line argument `+traceroot DIR` where `DIR` is 154 : the directory to dump the trace data into. Note that `DIR` must already exist, 155 : the application will not create it. 156 : For example, 157 : 158 : ```shell 159 : ./bin/EXEC --input-file ./Input.yaml +p4 +traceroot ./ExecTraces 160 : ``` 161 : You might get a warning that Charm++ had to flush the log some number of times 162 : during the run. Flushing the log adds overhead to the execution and so affects 163 : timing measurements. While Charm++ has the ability to manually flush the log 164 : periodically (and therefore exclude the time it takes to flush the log from the 165 : trace), we have not yet implemented support for this. For short executable runs 166 : you can increase the log size by specifying `+logsize M` when running the 167 : executable. The default log size is 1,000,000 (1000000). Note that if you 168 : increase the log size too much you will run out of memory/RAM. 169 : 170 : For more information on runtime options to 171 : control trace data see the 172 : [Charm++ Projections manual](http://charm.cs.illinois.edu/manuals/html/projections/1.html). 173 : 174 : ### Visualizing Trace %Data In Projections 175 : 176 : By default Charm++ records entry method names by using the `PRETTY_FUNCTION` 177 : macro. This means entry method names include all class (parallel component) and 178 : action template parameter names, including any template parameters of the 179 : template parameters. This very quickly leads to incomprehensibly long names that 180 : are very difficult to read in the Projections interface. We include a basic 181 : Python executable to handle the majority of renames, but the executable supports 182 : additional basic (textual find-replace) and regular expression 183 : replacements via a JSON file. These additional replacements are useful for 184 : making executable-specific renames. The Python executable is 185 : `tools/CharmSimplifyTraces.py` and an example replacements file is 186 : `tools/CharmTraceReplacements.json`. 187 : 188 : See the [Charm++ Projections manual](http://charm.cs.illinois.edu/manuals/html/projections/2.html) 189 : for details.