achieving scalability in the presence of asynchrony · resource access or functional services...
TRANSCRIPT
Achieving Scalability in the Presence of Asynchrony
Thomas SterlingProfessor of Informatics and Computing
Chief Scientist and Associate DirectorCenter for Research in Extreme Scale Technologies (CREST)
Pervasive Technology Institute at Indiana University
Fellow, Sandia National Laboratories CSRI
June 27, 2012
Working Definition
“Asynchrony noun
–
unpredictable variability of response time, either bounded or unbounded, for requests of
resource access or functional services within and by computing systems.
Uncertainty in time.”
Nothing New in Broad Sense
• Mass storage
• Distributed Computing
• Grids• OS noise• Cache replacement
• TLB misses
Manifesting Asynchrony
• Thread time to completion uncertainty
• Remote memory access apparent latency
• Long blocking time for barriers due to slowest dependent task
• Local memory access time due to cache hierarchy
• Thread start time due to resource availability
Sources of Asynchrony
• Embedded policies
• Resources outside the scope of user name space
• Communications network
• Data Access• Function Service Time
• Fault tolerance
What is new for Exascale• Scale• Increase range of network latencies and opportunities
for packet collisions and routing variations
• Deeper memory hierarchy
• Scheduling conflicts for threads to cores• Active response to errors• Variable instruction rate from clock and voltage change
• Finer grain threads as normalization factor of time delays
4 Requirements1.
Exploit physical locality opportunities when
they are present, even if runtime determined
2.
Avoid physical resource blocking caused by logical task blocking of asynchronous actions
3.
Permit varied order of execution of delineated localized tasks
4.
Expose intermediate level parallelism & avoid over constrained synchronization
Synthesis of Concepts• Synchronous domains –
computational complex (e.g., threads, codelets)
– Exploits physical locality and predictable synchronous operation
– Split‐phase transactions
• Blocking avoidance– Lightweight multi‐threading with active context switching
– Dynamic allocation of local execution resources (cores, hardware
threads)
– Runtime task queuing
• Variable execution ordering– Message‐driven computation: Parcels to Move work to data when preferable
– Percolation
– Constraint‐based synchronization – Declarative criteria for work that is Even‐driven
• Expose parallelism– Data‐directed execution
– Merger of flow control and data structure
– Shared name space
ParalleX Execution Model
10
HPX Runtime System Design
• Current version of HPX provides the following infrastructure on conventional systems as defined by the ParalleX execution
model.
– Active Global Address Space (AGAS).– HPX Threads and Thread Management.
– Parcel Transport and Parcel Management.
– Local Control Objects (LCOs).– HPX Processes (distributed objects).
• Namespace and policies management, locality control.
– Monitoring subsystem.
LCOsLCOs
ActionManager
ActionManager
InterconnectInterconnect
Parcel
Port
Parcel
Port
Process ManagerProcess Manager Local Memory Management
Local Memory Management Performance MonitorPerformance Monitor
Performance Counters
…
Active Global Address Space
• Global Address Space throughout the system.– Removes dependency on static data distribution.– Enables dynamic load balancing of application and system data.
• AGAS assigns global names (identifiers, unstructured 128 bit integers) to all
entities managed by HPX.• Unlike PGAS provides a mechanism to resolve global identifiers into corresponding
local virtual addresses (LVA).– LVAs comprise – Locality ID, type of entity being referenced and its local memory address.– Moving an entity to a different locality updates this mapping.– Current implementation is based on centralized database storing the mappings which are
accessible over the local area network.
– Local caching policies have been implemented to prevent bottlenecks and minimize the
number of required round‐trips.
• Current implementation allows autonomous creation of globally unique ids in the
locality where the entity is initially located and supports memory pooling of similar
objects to minimize overhead.• Implemented garbage collection scheme of HPX objects.
LCOsLCOs
ActionManager
ActionManager
InterconnectInterconnect
Parcel
Port
Parcel
Port
Process ManagerProcess Manager Local Memory Management
Local Memory Management Performance MonitorPerformance Monitor
Performance Counters
…
Thread Management
• Thread manager is modular and implements a work‐ queue based task management
• Threads are cooperatively scheduled at user level without requiring a kernel transition
• Specially designed synchronization primitives such as semaphores, mutexes etc. allow synchronization of
HPX‐
threads in the same way as conventional threads• Thread management currently supports several key
modes– Global Thread Queue– Local Queue (work stealing)– Local Priority Queue (work stealing)
LCOsLCOs
ActionManager
ActionManager
InterconnectInterconnect
Parcel
Port
Parcel
Port
Process ManagerProcess Manager Local Memory Management
Local Memory Management Performance MonitorPerformance Monitor
Performance Counters
…
LCOs (Local Control Objects)
• LCOs provide a means of controlling parallelization and synchronization of HPX‐threads.
• Enable event‐driven thread creation and can support in‐ place data structure protection and on‐the‐fly scheduling.
• Preferably embedded in the data structures they protect.• Abstraction of a multitude of different functionalities for.
– Event driven PX‐thread creation.– Protection of data structures from race conditions.– Automatic on‐the‐fly scheduling of work.
• LCO may create (or reactivate) a HPX‐thread as a result of “being triggered”.
LCOsLCOs
ActionManager
ActionManager
InterconnectInterconnect
Parcel
Port
Parcel
Port
Process ManagerProcess Manager Local Memory Management
Local Memory Management Performance MonitorPerformance Monitor
Performance Counters
…
Constraint‐based Synchronization
• Supports Dynamic‐Adaptive Task Scheduling
• Declarative Semantics for Continuation of Execution– Defines conditions for work to be performed
– Not imperative code by user
• Establishes Criteria for Task Instantiation• Supports DAG flow control representation• Examples:
– Dataflow– Futures
15
Future LCOs• Future LCOs are
essential to HPX because they are the
primary method of controlling
asynchrony in HPX.• Compatible with
std::future<>.• Futures act as proxies
for values that are being computed
asynchronously by actions.
Locality 1Locality 1
Suspend consumerthread
Execute another thread
Resume consumerthread
Locality 2Locality 2Execute Future:
Producer thread
Future object Future object
Result is being returned
Dataflow LCOs
• Dataflow LCOs are an extension of futures that enable
dependency‐driven asynchrony.• Compatible with std::future<>.• Computation of the action
associated with a dataflow does not begin until all the arguments of the dataflow are ready.
– Non‐LCO arguments are always ready.
– LCO arguments, such as futures, might not be ready.
Parcel Management
• Active messages (parcels)
• Destination address, function to execute, parameters, continuation
• Any inter‐locality messaging is based on Parcels
• In HPX parcels are represented as polymorphic objects
• An HPX entity on creating a parcel object hands it to the parcel handler.
• The parcel handler serializes the parcel where all dependent data is bundled along with the parcel
• At the receiving locality theparcel is de‐serialized andcauses a HPX thread to becreated based on its content
Locality 2Locality 1
Parcel Handler
parcel object
Action Manager
HPX Threads
put()
Serialized Parcel De-serialized Parcel
Locality 2Locality 1
Parcel Handler
parcel object
Action Manager
HPX Threads
put()
Serialized Parcel De-serialized Parcel
LCOsLCOs
ActionManagerAction
Manager
InterconnectInterconnect
Parcel
Port
Parcel
Port
Process ManagerProcess Manager Local Memory Management
Local Memory Management Performance MonitorPerformance Monitor
Performance Counters
…
Adaptive Mesh Refinement• Why Adaptive Mesh Refinement?
• The AMR dataflow in ParalleX
• The impact of granularity
• 3‐D Test problem: comparisons between MPI based publicly available AMR toolkits and ParalleX AMR
• Impact of nedmalloc, concur, tcmalloc, and jmalloc
Constraint based Synchronization for AMR
20
Application: Adaptive Mesh Refinement (AMR) for Astrophysics simulations
• ParalleX based AMR removes all global computation barriers, including the timestep barrier (so
not all points have to reach the same timestep in order to proceed computing)
MHDe Code Tests• All C++• PPM Reconstruction with HLLE Numerical Flux
• AMR
• Equations described in http://arxiv.org/abs/gr‐qc/0605102
Finite Temperature Equation of State
• Putting in the right nuclear physics (polytropes don’t have the right compactness).
• To do anything with neutrinos, you need a temperature: neutrino cooling, dynamics of
hypernova
• To do anything with radiation, you need a physical temperature
• This is a spring‐board for new astrophysics and microphysics
Equations of State come in tables• Not practical to do EOS calculation in place; it’s best to use a
look‐up table
• The table covers a lot of physics for you
• In high energy astrophysics:
Temperatures vary from 0 to > 100 MeV
Proton fraction changes from 0 to 0.6
Density varies across 10 orders of magnitude
• Finer grid tables are better for accuracy
• Tables need to cover a wide variety of conditions: black hole formation, neutron star mergers, nucleosynthesis
Preliminary Radial Pulsation Frequencies
100 200 300 400 500 600 700
0.00076
0.00077
0.00078
0.00079
Conclusions• Exascale computing (really, trans‐Petaflops) will exhibit an
increasing degree of asynchronous functionality
• Runtime adaptation is probably required to address these sources of unpredictability in time domain behavior
• Key concepts, when employed in synergy may address this– Multi‐threaded
– Constraint‐based synchronization represents an experimental step
– Dynamic, overlap/multiphase message‐driven execution
– Global address space
• Early runtime experiments with ParalleX‐based HPX runtime– ParalleX provides an experimental model with HPX implementation
– Demonstrates adaptivity in the presence of asynchrony