SpECTRE Documentation Coverage Report
Current view: top level - __w/spectre/spectre/docs/DevGuide - Profiling.md Hit Total Coverage
Commit: e2d014ad3971b6249c51f7c66e5c5edd9c36c123 Lines: 0 1 0.0 %
Date: 2024-04-18 20:10:13
Legend: Lines: hit not hit

          Line data    Source code
       1           0 : \cond NEVER
       2             : Distributed under the MIT License.
       3             : See LICENSE.txt for details.
       4             : \endcond
       5             : # Profiling {#profiling}
       6             : 
       7             : \tableofcontents
       8             : 
       9             : There are a number of tools available for profiling, each with their own
      10             : strengths and weaknesses. This makes it difficult to recommend one "right" way
      11             : of analyzing performance using profilers. Instead, one should use a combination
      12             : of the tools to discover and eliminate performance bottle necks. Common
      13             : profilers are Charm++ Projections (tracing-based), HPCToolkit (sampling-based,
      14             : very versatile), Linux perf (sampling-based, command line only), Intel VTune
      15             : (sampling-based, works well on Intel hardware), and AMD uProf (similar to Intel
      16             : VTune).
      17             : 
      18             : ## Profiling with HPCToolkit {#profiling_with_hpctoolkit}
      19             : 
      20             : Follow the HPCToolkit installation instructions at
      21             : [hpctoolkit.org](http://hpctoolkit.org). The Spack
      22             : installation seems to work well. Once installed, compile your executable in
      23             : Release mode with `-D ENABLE_PROFILING=ON -D DEBUG_SYMBOLS=ON` since otherwise
      24             : you won't be able to get call stacks and source analysis. Using `-D
      25             : BUILD_SHARED_LIBS=ON` is recommended since it makes HPCToolkit a lot easier to
      26             : use. You must also use the system allocator, `-D MEMORY_ALLOCATOR=SYSTEM`. We
      27             : will work from the build directory and perform all runs and performance analysis
      28             : there.
      29             : 
      30             : First run HPCToolkit as:
      31             : ```
      32             : hpcrun -t --event CYCLES@f200 ./bin/EXEC --input-file ./Input.yaml +p1
      33             : ```
      34             : We will profile on one core, but you can profile on multiple cores as well as
      35             : multiple nodes if using MPI as the Charm++ backend. This will generate a
      36             : `hpctoolkit-EXEC-measurements` directory. Run
      37             : ```
      38             : hpcstruct -jN ./hpctoolkit-EXEC-measurements
      39             : ```
      40             : where `N` is the number of cores to run on. This will generate a mapping to line
      41             : numbers, etc. in the measurements directory.
      42             : 
      43             : \warning Skipping the `hpcstruct` step will make `hprprof` below run extremely
      44             : slowly.
      45             : 
      46             : Once the run is complete, run
      47             : ```
      48             : hpcprof -I /path/to/spectre/src/+ hpctoolkit-EXEC-measurements
      49             : ```
      50             : Note that the `+` is a literal `+` symbol. This will create the directory
      51             : ```
      52             : hpctoolkit-EXEC-database
      53             : ```
      54             : which you can view using
      55             : ```
      56             : hpcviewer ./hpctoolkit-EXEC-database
      57             : ```
      58             : 
      59             : HPCViewer will generally start you in the `Top-down view` (callgraph of
      60             : callers). You can select  `Bottom-up view` (callgraph of callees) to get a
      61             : different perspective. Whether you want to look at the callgraph of callers or
      62             : callees depends a bit on the executable, what you're looking to measure, and how
      63             : you like to think about things. The callees graph can give you a nice overview
      64             : of what the low-level things taking up a lot of time are, but certainly makes
      65             : the call stack not look like you would expect. On the right of the callgraphs
      66             : you will see `CYCLES:Sum (I)` and `CYCLES:Sum (E)`. `I` means time spent
      67             : _including_ callees, while `E` means time spent in the function itself
      68             : (exclusive time). Sorting by exclusive gives a good idea of what the hot
      69             : functions are. Here is a screenshot from HPCViewer:
      70             : 
      71             : \image html HpcViewerCallees.png "HPCViewer callgraph of callees"
      72             : 
      73             : You can see that 49.1% of inclusive time is spent in primitive recovery, and the
      74             : line after the 49.1% function is a function inside the Kastaun
      75             : recovery scheme. The `__nss_database_lookup` is some system call,
      76             : e.g. `__memcpy_avx_unaligned_erms` or `__memset_avx2_unaligned_erms`. Looking at
      77             : the calling code, e.g. `prepare_neighbor_data` gives a good hint as to what's
      78             : going on. In most cases these are memory copies or memory sets (`std::vector`
      79             : default initializes its memory, which is bad for performance). The way to fix
      80             : these bottlenecks is to avoid memory copies and `std::vector<double>` as
      81             : buffers.
      82             : 
      83             : HPCToolkit allows you to sample on a variety of different event counters instead
      84             : of just cycles. Please see the HPCToolkit manual for details.
      85             : 
      86             : ## Profiling with AMD uProf {#profiling_with_amd_uprof}
      87             : 
      88             : [AMD uProf](https://developer.amd.com/amd-uprof/) is AMD's sampling-based
      89             : profiler that makes it relatively easy to do
      90             : quite a bit of detailed performance analysis. The uProf manual is quite good and
      91             : extensive, so for the most part the reader is referred to that. However, we will
      92             : go over some basics for profiling executables and understanding the
      93             : results. Make sure to compile your executable in Release mode with
      94             : `-DENABLE_PROFILING=ON -D DEBUG_SYMBOLS=ON` since otherwise you won't be able to
      95             : get call stacks and source analysis.
      96             : 
      97             : When you open uProf you may be asked to change the kernel event paranoid
      98             : level. Once you have uProf open, select `PROFILE` at the top. Specify the
      99             : application path, options, etc. We will again run on a single core to analyze
     100             : performance. It's recommended that you set the Core Affinity in AMD uProf so
     101             : that your application isn't migrated between cores during a profiling run. Then
     102             : choose `Next` in the lower right corner. Make sure the `CPU Profile Type` is set
     103             : to `CPU Profile` at the top. We will first do a `Time-based Sampling` run (on
     104             : the left). This means uProf will interrupt the application every `N`
     105             : milliseconds and see where the application is. You typically want a few thousand
     106             : total samples to get something that's reasonably representative of your
     107             : application. Under the `Advanced %Options` make sure `Enable CSS` (on the right)
     108             : is enabled (green) and that `Enable FPO` is also enabled. Now click `Start
     109             : Profile` in the bottom right. Once the profile is complete you will be presented
     110             : with a summary outlining where your code is spending most of its time. Click
     111             : `ANALYZE` at the top to get a more detailed view. On the left you can select
     112             : between a callgraph of callees (Function HotSpots), a callgraph of callers (Call
     113             : Graph), and a few other views. Below is an example of a result from the same run
     114             : we used with HPCToolkit above.
     115             : 
     116             : \image html AmdUprofCallgraph.png "AMD uProf callgraph of callees"
     117             : 
     118             : Again we see that most of our time is spent in primitive recovery but also that
     119             : a lot of time is spent copying memory. This was grouped into
     120             : `__nss_database_lookup` in HPCToolkit. Unfortunately, getting a call stack out
     121             : of the `memcpy` doesn't always work and so while you know you're spending a lot
     122             : of time copying memory, it's not so obvious where those copies are occurring.
     123             : 
     124             : ## Profiling With Charm++ Projections {#profiling_with_projections}
     125             : 
     126             : To view trace data after a profiling run you must download Charm++'s
     127             : Projections software from their [website](http://charm.cs.illinois.edu/).
     128             : If you encounter issues it may
     129             : be necessary to clone the git repository and build the correct version
     130             : from scratch. Note that the version of Charm++ used to compile SpECTRE
     131             : should match the version of Projections used to analyze the trace data.
     132             : You can collect the trace data on a different machine than the one you
     133             : will be analyzing the data on. For example, you can collect the data on
     134             : a supercomputer and analyze it on your desktop or laptop.
     135             : 
     136             : For profiling you will want to use a production build of Charm++, which
     137             : means compiling Charm++ with the `--with-production` flag. To enable trace
     138             : collecting you must build with the `--enable-tracing` flag as well. For
     139             : example, on a multicore 64-bit Linux machine the build command would be
     140             : ``` shell
     141             : ./build LIBS multicore-linux-x86_64 gcc -j8 --with-production --enable-tracing
     142             : ```
     143             : You must build your executable in Release mode as well, specifying
     144             : `-DCMAKE_BUILD_TYPE=Release` to CMake, as well as
     145             : ```
     146             : -DCHARM_TRACE_PROJECTIONS=ON -DCHARM_TRACE_SUMMARY=ON -DENABLE_PROFILING=ON
     147             : ```
     148             : to enable SpECTRE to use Charm++'s tracing features.
     149             : 
     150             : ### Running SpECTRE With Trace Output
     151             : 
     152             : When running SpECTRE you must specify a directory to output trace data into.
     153             : This is done by adding the command line argument `+traceroot DIR` where `DIR` is
     154             : the directory to dump the trace data into. Note that `DIR` must already exist,
     155             : the application will not create it.
     156             : For example,
     157             : 
     158             : ```shell
     159             : ./bin/EXEC --input-file ./Input.yaml +p4 +traceroot ./ExecTraces
     160             : ```
     161             : You might get a warning that Charm++ had to flush the log some number of times
     162             : during the run. Flushing the log adds overhead to the execution and so affects
     163             : timing measurements. While Charm++ has the ability to manually flush the log
     164             : periodically (and therefore exclude the time it takes to flush the log from the
     165             : trace), we have not yet implemented support for this. For short executable runs
     166             : you can increase the log size by specifying `+logsize M` when running the
     167             : executable. The default log size is 1,000,000 (1000000). Note that if you
     168             : increase the log size too much you will run out of memory/RAM.
     169             : 
     170             : For more information on runtime options to
     171             : control trace data see the
     172             : [Charm++ Projections manual](http://charm.cs.illinois.edu/manuals/html/projections/1.html).
     173             : 
     174             : ### Visualizing Trace %Data In Projections
     175             : 
     176             : By default Charm++ records entry method names by using the `PRETTY_FUNCTION`
     177             : macro. This means entry method names include all class (parallel component) and
     178             : action template parameter names, including any template parameters of the
     179             : template parameters. This very quickly leads to incomprehensibly long names that
     180             : are very difficult to read in the Projections interface. We include a basic
     181             : Python executable to handle the majority of renames, but the executable supports
     182             : additional basic (textual find-replace) and regular expression
     183             : replacements via a JSON file. These additional replacements are useful for
     184             : making executable-specific renames. The Python executable is
     185             : `tools/CharmSimplifyTraces.py` and an example replacements file is
     186             : `tools/CharmTraceReplacements.json`.
     187             : 
     188             : See the [Charm++ Projections manual](http://charm.cs.illinois.edu/manuals/html/projections/2.html)
     189             : for details.

Generated by: LCOV version 1.14