LCOV - eded15d6fcfa762a5dfde087b28df9bcedd8b386 - /__w/spectre/spectre/docs/DevGuide/ParallelExecutable/Concepts.md

SpECTRE Documentation Coverage Report

Current view:	top level - __w/spectre/spectre/docs/DevGuide/ParallelExecutable - Concepts.md		Hit	Total	Coverage
Commit:	eded15d6fcfa762a5dfde087b28df9bcedd8b386	Lines:	0	1	0.0 %
Date:	2024-04-15 22:23:51
Legend:	Lines: hit not hit

          Line data    Source code

       1           0 : \cond NEVER
       2             : Distributed under the MIT License.
       3             : See LICENSE.txt for details.
       4             : \endcond
       5             : # Parallelization in SpECTRE {#tutorial_parallel_concepts}
       6             : 
       7             : \tableofcontents
       8             : 
       9             : This overview describes the concepts and terminology that SpECTRE uses
      10             : to enable parallelism.  This overview is a general discussion with no
      11             : code examples.  Subsequent tutorials will provide a more in-depth
      12             : exploration of the parallelization infrastructure, including code
      13             : examples.
      14             : 
      15             : Unlike many parallel scientific codes which use data-based
      16             : parallelism, SpECTRE uses task-based parallelism.  The classical
      17             : strategy for parallelism (data-based parallelism) is to assign a
      18             : portion of the data to processes (or threads) that synchronously
      19             : execute compute kernels.  This is implemented in many codes but it is
      20             : difficult to design codes with this strategy that will efficiently
      21             : scale for complex multi-scale, multi-physics workloads. Task-based
      22             : parallelism provides a solution: Instead of dividing work between
      23             : parallel processes based on data ownership, there is a set of tasks
      24             : and their inter-dependencies. Tasks are scheduled and assigned to
      25             : processes dynamically, providing opportunities for load balancing and
      26             : minimization of idle threads.  By dividing the program into small
      27             : enough tasks such that you have several tasks per thread,
      28             : communication time is hidden by interleaving tasks that are ready to
      29             : be executed with tasks that are waiting for data.
      30             : 
      31             : In order to implement task-based parallelism, SpECTRE is built on top
      32             : of the parallel programming framework of the Charm++ library, which is
      33             : developed by the [Parallel Programming
      34             : Laboratory](http://charm.cs.illinois.edu/) at the University of
      35             : Illinois.  Charm++ is a mature parallel programming framework that
      36             : provides intra-node threading and can use a variety of communication
      37             : interfaces (including MPI) to communicate between nodes.  Charm++ has
      38             : a large user base, which includes users of the cosmological
      39             : \f$N\f$-body code
      40             : [ChaNGa](https://github.com/N-BodyShop/changa/wiki/ChaNGa) and of the
      41             : molecular dynamics code
      42             : [NAMD](https://www.ks.uiuc.edu/Research/namd/).
      43             : 
      44             : ## Charm++ basic concepts
      45             : 
      46             : In order to understand how parallelization works in SpECTRE, it is
      47             : useful to understand the basic concepts in the design of Charm++.
      48             : Much of the following is quoted verbatim from the [Charm++
      49             : documentation](https://charm.readthedocs.io), interspersed with
      50             : comments on how SpECTRE interacts with Charm++.
      51             : 
      52             : > Charm++ is a C++-based parallel programming system, founded on the
      53             : > migratable-objects programming model, and supported by a novel and
      54             : > powerful adaptive runtime system. It supports both irregular as well
      55             : > as regular applications, and can be used to specify task-parallelism
      56             : > as well as data parallelism in a single application. It automates
      57             : > dynamic load balancing for task-parallel as well as data-parallel
      58             : > applications, via separate suites of load-balancing strategies. Via
      59             : > its message-driven execution model, it supports automatic latency
      60             : > tolerance, modularity and parallel composition. Charm++ also supports
      61             : > automatic checkpoint/restart, as well as fault tolerance based on
      62             : > distributed checkpoints.
      63             : 
      64             : SpECTRE currently wraps only some of the features of Charm++,
      65             : primarily the ones that support task-parallelism.  We are just
      66             : beginning our exploration of dynamic load balancing.  Coming soon we
      67             : will utilize automatic checkpoint/restart.  At present we do not use
      68             : Charm++ support for fault tolerance.
      69             : 
      70             : > The key feature of the migratable-objects programming model is
      71             : > over-decomposition: The programmer decomposes the program into a
      72             : > large number of work units and data units, and specifies the
      73             : > computation in terms of creation of and interactions between these
      74             : > units, without any direct reference to the processor on which any
      75             : > unit resides. This empowers the runtime system to assign units to
      76             : > processors, and to change the assignment at runtime as necessary.
      77             : 
      78             : SpECTRE's parallelization module is designed to make it easy for users
      79             : to exploit the migratable-object model by providing a framework to
      80             : define the units into which a program can be decomposed.
      81             : 
      82             : > A basic unit of parallel computation in Charm++ programs is a
      83             : > chare.  At its most basic level, it is just a C++ object. A Charm++
      84             : > computation consists of a large number of chares distributed on
      85             : > available processors of the system, and interacting with each other
      86             : > via asynchronous method invocations. Asynchronously invoking a
      87             : > method on a remote object can also be thought of as sending a
      88             : > “message” to it. So, these method invocations are sometimes referred
      89             : > to as messages. (besides, in the implementation, the method
      90             : > invocations are packaged as messages anyway). Chares can be created
      91             : > dynamically.
      92             : 
      93             : In SpECTRE, we wrap Charm++ chares in struct templates that we call
      94             : __parallel components__ that represent a collection of distributed
      95             : objects.  We wrap the asynchronous method invocations between the
      96             : elements of parallel components in struct templates that we call
      97             : __actions__.  Thus, each element of a parallel component can be
      98             : thought of as a C++ object that exists on one core on the
      99             : supercomputer, and an action as calling a member function of that
     100             : object, even if the caller is on another core.
     101             : 
     102             : > Conceptually, the system maintains a “work-pool” consisting of seeds
     103             : > for new chares, and messages for existing chares. The Charm++
     104             : > runtime system (Charm RTS) may pick multiple items,
     105             : > non-deterministically, from this pool and execute them, with the
     106             : > proviso that two different methods cannot be simultaneously
     107             : > executing on the same chare object (say, on different
     108             : > processors). Although one can define a reasonable theoretical
     109             : > operational semantics of Charm++ in this fashion, a more practical
     110             : > description of execution is useful to understand Charm++. A Charm++
     111             : > application’s execution is distributed among Processing Elements
     112             : > (PEs), which are OS threads or processes depending on the selected
     113             : > Charm++ build options. On each PE, there is a scheduler operating
     114             : > with its own private pool of messages. Each instantiated chare has
     115             : > one PE which is where it currently resides. The pool on each PE
     116             : > includes messages meant for chares residing on that PE, and seeds
     117             : > for new chares that are tentatively meant to be instantiated on that
     118             : > PE. The scheduler picks a message, creates a new chare if the
     119             : > message is a seed (i.e. a constructor invocation) for a new chare,
     120             : > and invokes the method specified by the message. When the method
     121             : > returns control back to the scheduler, it repeats the cycle
     122             : > (i.e. there is no pre-emptive scheduling of other invocations).
     123             : 
     124             : It is very important to keep in mind that the actions that are
     125             : executed on elements of parallel components are done so
     126             : non-deterministically by the run-time system.  Therefore it is the
     127             : responsibility of the programmer to ensure that actions are not called
     128             : out of order.  This means that if action B must be executed after
     129             : action A on a given element of a parallel component, the programmer
     130             : must ensure that either action B is called after the completion of
     131             : action A (i.e. it is not sufficient that action B is invoked after
     132             : action A is invoked), or that action B checks that action A has
     133             : completed before running.
     134             : 
     135             : > When a chare method executes, it may create method invocations for
     136             : > other chares. The Charm Runtime System (RTS) locates the PE where
     137             : > the targeted chare resides, and delivers the invocation to the
     138             : > scheduler on that PE.
     139             : 
     140             : In SpECTRE, this is done by one element of a parallel component
     141             : calling an action on an element of a another parallel component.
     142             : 
     143             : > Methods of a chare that can be remotely invoked are called entry
     144             : > methods. Entry methods may take serializable parameters, or a
     145             : > pointer to a message object. Since chares can be created on remote
     146             : > processors, obviously some constructor of a chare needs to be an
     147             : > entry method. Ordinary entry methods are completely non-preemptive-
     148             : > Charm++ will not interrupt an executing method to start any other
     149             : > work, and all calls made are asynchronous.
     150             : 
     151             : In SpECTRE, the struct template that defines a parallel component has
     152             : taken care of creating the entry methods for the underlying chare,
     153             : which are then called by invoking actions.
     154             : 
     155             : > Charm++ provides dynamic seed-based load balancing. Thus location
     156             : > (processor number) need not be specified while creating a remote
     157             : > chare. The Charm RTS will then place the remote chare on a suitable
     158             : > processor. Thus one can imagine chare creation as generating only a
     159             : > seed for the new chare, which may take root on some specific
     160             : > processor at a later time.
     161             : 
     162             : We are just in the process of beginning to explore the load-balancing
     163             : features of Charm++, but plan to have new elements of parallel
     164             : components (the wrapped chares) be creatable using actions, without
     165             : specifying the location on which the new element is created.
     166             : 
     167             : > Chares can be grouped into collections. The types of collections of
     168             : > chares supported in Charm++ are: chare-arrays, chare-groups, and
     169             : > chare-nodegroups, referred to as arrays, groups, and nodegroups
     170             : > throughout this manual for brevity. A Chare-array is a collection of
     171             : > an arbitrary number of migratable chares, indexed by some index
     172             : > type, and mapped to processors according to a user-defined map
     173             : > group. A group (nodegroup) is a collection of chares, with exactly
     174             : > one member element on each PE (“node”).
     175             : 
     176             : Each of SpECTRE's parallel components has a type alias `chare_type`
     177             : corresponding to whether it is a chare-array, chare-group, or
     178             : chare-nodegroup.  In addition we support a singleton which is
     179             : essentially a one-element array.
     180             : 
     181             : > Charm++ does not allow global variables, except readonly
     182             : > variables. A chare can normally only access its own data
     183             : > directly. However, each chare is accessible by a globally valid
     184             : > name. So, one can think of Charm++ as supporting a global object
     185             : > space.
     186             : 
     187             : SpECTRE does not use the readonly global variables provided by
     188             : Charm++.  Instead SpECTRE provides a nodegroup called the
     189             : `GlobalCache` which provides global access to both read-only
     190             : and writable objects, as well as a way to access every parallel component.
     191             : 
     192             : > Every Charm++ program must have at least one mainchare. Each
     193             : > mainchare is created by the system on processor 0 when the Charm++
     194             : > program starts up. Execution of a Charm++ program begins with the
     195             : > Charm RTS constructing all the designated mainchares. For a
     196             : > mainchare named X, execution starts at constructor X() or X(CkArgMsg
     197             : > *) which are equivalent. Typically, the mainchare constructor starts
     198             : > the computation by creating arrays, other chares, and groups. It can
     199             : > also be used to initialize shared readonly objects.
     200             : 
     201             : SpECTRE provides a pre-defined mainchare called `Main` that is run
     202             : when a SpECTRE executable is started.  `Main` will create the other
     203             : parallel components, and initialize items in the `GlobalCache`
     204             : whose items can be used by any parallel component.
     205             : 
     206             : > Charm++ program execution is terminated by the CkExit call. Like the
     207             : > exit system call, CkExit never returns, and it optionally accepts an
     208             : > integer value to specify the exit code that is returned to the
     209             : > calling shell. If no exit code is specified, a value of zero
     210             : > (indicating successful execution) is returned. The Charm RTS ensures
     211             : > that no more messages are processed and no entry methods are called
     212             : > after a CkExit. CkExit need not be called on all processors; it is
     213             : > enough to call it from just one processor at the end of the
     214             : > computation.
     215             : 
     216             : SpECTRE wraps `CkExit` with the function `sys::exit`.  As no more
     217             : messages are processed by Charm++ after this call, SpECTRE also
     218             : defines a special `Exit` phase that is guaranteed to be executed after
     219             : all messages and entry methods have been processed.
     220             : 
     221             : > As described so far, the execution of individual Chares is
     222             : > “reactive”: When method A is invoked the chare executes this code,
     223             : > and so on. But very often, chares have specific life-cycles, and the
     224             : > sequence of entry methods they execute can be specified in a
     225             : > structured manner, while allowing for some localized non-determinism
     226             : > (e.g. a pair of methods may execute in any order, but when they both
     227             : > finish, the execution continues in a pre-determined manner, say
     228             : > executing a 3rd entry method).
     229             : 
     230             : Charm++ provides a special notation to simplify expression of such
     231             : control structures, but this requires writing specialized interface
     232             : files that are parsed by Charm++.  SpECTRE does not support this;
     233             : rather we split the executable into a set of phases. In
     234             : each phase, each parallel component will execute a user-defined list
     235             : of actions.
     236             : 
     237             : > The normal entry methods, being asynchronous, are not allowed to
     238             : > return any value, and are declared with a void return type.
     239             : 
     240             : SpECTRE's actions do not return any value to the calling component.
     241             : Instead when the action is finished it can call another action to send
     242             : data to an element of any parallel component.
     243             : 
     244             : > To support asynchronous method invocation and global object space,
     245             : > the RTS needs to be able to serialize (“marshall”) the parameters,
     246             : > and be able to generate global “names” for chares. For this purpose,
     247             : > programmers have to declare the chare classes and the signature of
     248             : > their entry methods in a special “.ci” file, called an interface
     249             : > file. Other than the interface file, the rest of a Charm++ program
     250             : > consists of just normal C++ code. The system generates several
     251             : > classes based on the declarations in the interface file, including
     252             : > “Proxy” classes for each chare class. Those familiar with various
     253             : > component models (such as CORBA) in the distributed computing world
     254             : > will recognize “proxy” to be a dummy, standin entity that refers to
     255             : > an actual entity. For each chare type, a “proxy” class exists. The
     256             : > methods of this “proxy” class correspond to the remote methods of
     257             : > the actual class, and act as “forwarders”. That is, when one invokes
     258             : > a method on a proxy to a remote object, the proxy marshalls the
     259             : > parameters into a message, puts adequate information about the
     260             : > target chare on the envelope of the message, and forwards it to the
     261             : > remote object. Individual chares, chare array, groups, node-groups,
     262             : > as well as the individual elements of these collections have a such
     263             : > a proxy. Multiple methods for obtaining such proxies are described
     264             : > in the manual. Proxies for each type of entity in Charm++ have some
     265             : > differences among the features they support, but the basic syntax
     266             : > and semantics remain the same - that of invoking methods on the
     267             : > remote object by invoking methods on proxies.
     268             : 
     269             : SpECTRE has wrapped all of this functionality in order to make it
     270             : easier to use.  SpECTRE automatically creates the interface files for
     271             : each parallel component using template metaprogramming.  SpECTRE
     272             : provides proxies for each parallel component that are all held in the
     273             : `GlobalCache` which is available to every parallel component.  In
     274             : order for actions to be called as entry methods on remote parallel
     275             : components, the arguments to the function call must be serializable.
     276             : Charm++ provides the `PUP` framework to serialize objects where `PUP`
     277             : stands for pack-unpack.  Since the PUP framework is used by Charm++
     278             : for checkpointing, load-balancing, and passing arguments when calling
     279             : actions, all user-defined classes with member data must define a `pup`
     280             : function.
     281             : 
     282             : > In terms of physical resources, we assume the parallel machine
     283             : > consists of one or more nodes, where a node is a largest unit over
     284             : > which cache coherent shared memory is feasible (and therefore, the
     285             : > maximal set of cores per which a single process can run. Each node
     286             : > may include one or more processor chips, with shared or private
     287             : > caches between them. Each chip may contain multiple cores, and each
     288             : > core may support multiple hardware threads (SMT for example).
     289             : > Charm++ recognizes two logical entities: a PE (processing element)
     290             : > and a logical node, or simply “node”. In a Charm++ program, a PE is
     291             : > a unit of mapping and scheduling: each PE has a scheduler with an
     292             : > associated pool of messages. Each chare is assumed to reside on one
     293             : > PE at a time. A logical node is implemented as an OS process. In
     294             : > non-SMP mode there is no distinction between a PE and a logical
     295             : > node. Otherwise, a PE takes the form of an OS thread, and a logical
     296             : > node may contain one or more PEs. Physical nodes may be partitioned
     297             : > into one or more logical nodes. Since PEs within a logical node
     298             : > share the same memory address space, the Charm++ runtime system
     299             : > optimizes communication between them by using shared
     300             : > memory. Depending on the runtime command-line parameters, a PE may
     301             : > optionally be associated with a subset of cores or hardware threads.
     302             : 
     303             : In other words, how Charm++ defines a node and a PE depends upon how
     304             : Charm++ was installed on a system.  The executable
     305             : `Executables/ParallelInfo` can be used to determine how many nodes and
     306             : PEs exist for a given Charm++ build and the runtime command-line
     307             : parameters passed when calling the executable.
     308             : 
     309             : For more details see the [Charm++
     310             : documentation](https://charm.readthedocs.io) and \ref
     311             : dev_guide_parallelization_foundations.

Generated by: LCOV version 1.14