many,core%architecture%characteriza5on% of%the%path ... · • the ordering core receives packets...
TRANSCRIPT
COMPUTER ARCHITECTURE GROUP
Omer Khan Assistant Professor
Electrical and Computer Engineering University of Connec9cut
Contact Email: [email protected]
Many-‐core Architecture Characteriza5on of the Path Planning Workload
CogArch Workshop 2015
COMPUTER ARCHITECTURE GROUP
Path Planning in Cognitive Computing
Sensors Acquisi9on & Percep9on
Controllers Planner
Actors Scheduler
Image from KIT Ins9tute for Anthropoma9cs
Desired Driving Behavior
Actual Driving Behavior
S
D
• Collision free path? • Most efficient?
2
COMPUTER ARCHITECTURE GROUP
Path Planning: A Challenge Application
Real 9me Performance Energy Constraints
Resiliency Constraints
3
Path Planning
COMPUTER ARCHITECTURE GROUP
Path Planning: Efficiency via Parallelization Shortest Path Dijkstra Algorithm: Initialize Nodes(), Edges()!
For (each Node u) --Outer Loop visits each node once!
For (each Edge of u) --Inner Loop!
1. Calculates distance from current node to each neighbor!
2. Checks for the next best node (u) among the neighbors! O( N.logN + E )
Inner Loop paralleliza5on Each thread is assigned a set of current node’s edges to relax BARRIERS are applied to synchronize threads during each itera9on
J Work efficient due to no redundant computa9ons compared to sequen9al
L Hard to parallelize due to lack of edge-‐level parallelism (bi-‐direc9onal search improves concurrency)
Outer Loop paralleliza5on Convergence based: Divide nodes among threads, then each thread relaxes its nodes itera9vely un9l the distances converge (i.e., no change)
Range based: Dynamically distribute “range of nodes” among threads, then each thread relaxes its set of nodes one by one
1.3
1.3
1.3
0.5
1.3
1.2
0.8
1.1
LOCK each node to avoid poten9al races among threads relaxing shared nodes. Apply a BARRIER to synchronize threads aWer each itera9on
L Work inefficient due to many redundant relaxa9ons for nodes that stabilize before convergence
J Highly parallelizable due to node-‐level parallelism
J Work efficient and exploits node-‐level parallelism (bi-‐direc9onal search improves concurrency)
L Needs intelligent scheduling for dynamic work balancing
1.2
4
COMPUTER ARCHITECTURE GROUP
Characterization Space • Simulated a 9led 256-‐core NOC-‐based mul9core • Algorithms: Single-‐ and Mul9ple-‐Objec9ve Shortest Path (SOSP/MOSP)
• Dijkstra: Visits each node once, hence high work complexity • Heuris9c Algorithms: useful for pre-‐processed input graphs
• A*, D*: Number of nodes visited is drama9cally reduced • ∆-‐Stepping: Visits each node once, but work done per node is determined by “delta”
• Mar5n’s Algorithm: Similar to Dijkstra, but considers an addi9onal objec9ve when evalua9ng for the next best node
• Paralleliza9on strategies • Inner Loop • Outer Loop: Convergence and Range based
• Inputs (methodology similar to Gtgraph from Georgia Tech) • Adjacency list representa9on with randomly distributed edge weights • Graph configura9ons
• Number of nodes: 16K – 4M • Sparse graph: 4 – 32 edges/node • Dense graph: 8K edges/node
5
COMPUTER ARCHITECTURE GROUP
Characterization Objectives
• Path planning is challenging, because • Operates on unstructured data • Complex dependence paderns between tasks that are known only during program execu9on
• Characteriza9on to revel four areas where compu9ng must adapt at run9me to exploit execu9on efficiency 1. Dynamic Workload Selec9on and Balancing 2. Concurrency Controls 3. Input Dependence 4. Accuracy of Computa9on
6
COMPUTER ARCHITECTURE GROUP
Dense Graph (16K nodes, 8K edges)
7
Shortest Path Algorithm Comple5on Time (ms)
Num. of Threads
Accuracy (%)
Comments
Dijkstra Sequen9al (baseline) 14200 1 100% Convergence based is work inefficient; Range-‐based incurs extra communica9on Inner loop has good concurrency, gives ~40X speedup;
Inner Loop Par 549 32 Bi-‐direc5onal Inner Loop Par 356 96 Convergence based Outer Loop Par 6300 160 Range-‐based Outer Loop Par 7691 256 Bidirec9onal Range-‐based Outer 7424 256
D* sequen9al 74 1 97% Very high Work Efficiency D* Bi-‐direc5onal Inner Loop Par 2.51 32
Mar9n’s Sequen9al (baseline) 14500 1 100% ~45X speedup B-‐direc5onal Inner Loop Par 321 128 Bi-‐direc9onal Range-‐based Outer Par 2859 256
COMPUTER ARCHITECTURE GROUP
Sparse Graph (16K nodes, 16 edges)
8
Shortest Path Algorithm Comple5on Time (ms)
Num. of Threads
Accuracy (%)
Comments
Dijkstra Sequen9al (baseline) 30 1 100% Inner loop has low concurrency; Convergence based Outer loop is work inefficient; Range-‐based bi-‐direc5onal gives ~5X speedup
Inner Loop Par 30 1 Bi-‐direc9onal Inner Loop Par 14 2 Convergence based Outer Loop Par 377 256 Range-‐based Outer Loop Par 10.8 128 Bi-‐direc5onal Range-‐based Outer Loop Par
6.4 192
∆-‐Stepping Inner Par (∆=50%) 2.9 16 80% Work Efficient D* Sequen9al 4 1 97% D* Bi-‐direc5onal Inner Loop Par 3.4 2
Ø Mo5vates the need to adapt to “choices” along (1) input dependence, (2) dynamic workload balancing, (3) concurrency controls, and (4) the accuracy of computa5on
COMPUTER ARCHITECTURE GROUP
“Situational Scheduler” for Adaptation
Algorithmic Choices
Architectural Choices
Many-core Substrate
Performance Monitoring/ Decision Engine
Ø User/programmer selects an algorithm or heuris9c based on sta9c informa9on such as input characteris5cs or solu5on accuracy requirements
Ø U9lize run9me informa9on to make decision regarding concurrency control and workload balancing methods
• Develop models that predict the choices to improve efficiency of computa9on
• May u9lize heuris9c or control theore9c or even machine learning methods • Goal is to design the decision engine to be low overhead with high accuracy
of predic9ng the right decision
9
COMPUTER ARCHITECTURE GROUP
Architecture Innovations for Extreme Efficiency Range-based Outer Loop Parallel Dijkstra
while all nodes are not visited!• Calculate next
range of nodes to relax based on the degree of graph!
• Distribute nodes in the current range among worker threads!
while assigned nodes are not visited (update Q array)!• LOCK the node to be relaxed!
• for all neighbor nodes!• Update the D array with new
distance, if new distance is less than older distance (RELAX function)!
• UNLOCK the node that was relaxed!BARRIER to synchronize all threads!
Master Thread (also par9cipates as a Worker thread)
Worker Thread (only one shown here, others do same work)
1. Create work 3. Wait and do the assigned work
5. All Done
Accelerate Computa9on
Accelerate Communica9on
Accelerate Data Access
2. Here is work
4. Done
10
COMPUTER ARCHITECTURE GROUP
Data Access Efficiency
• Data locality in path planning is challenging due to unstructured data dependent accesses
• Locality-‐aware protocols [ISCA’13, HPCA’14] • Exploit the run9me variability
in locality/reuse of data at various layers of the on-‐chip cache hierarchy
• Intelligent fine-‐grain data alloca9on/replica9on at private and shared caches
• Locality-‐aware private-‐L1 alloca9on/replica9on
• Locality-‐aware shared-‐L2 replica9on
• Locality-‐aware Private Caching [ISCA’13] • Privately cache high locality lines at L1
cache • Remotely access (at word level) low
locality lines at L2 cache
• Alloca9on based on locality of data dynamically profiled using cache-‐line level in-‐hardware locality classifiers
11
COMPUTER ARCHITECTURE GROUP
Locality-Aware Data Access Range-based Outer Loop Parallel Dijkstra • Reducing Sharing Misses
0 2 4 6 8
Baseline Locality-‐aware Private Caching
L1 Cache
Miss
Breakd
own (%
) Cold Capacity Sharing Word
• Sharing misses (expensive) turned into word misses (cheap) as more cache lines with low locality are iden9fied by the hardware classifiers
12
COMPUTER ARCHITECTURE GROUP
Locality-Aware Data Access Range-based Outer Loop Parallel Dijkstra • Energy Consump9on tradeoffs
• Reduce invalida9ons, asynchronous write-‐backs and cache-‐line ping-‐pong’ing
0
0.0001
0.0002
0.0003
0.0004
0.0005
Baseline Locality-‐aware Private Caching
Energy
Consum
p5on
DRAM
Network Link
Network Router
Directory
L2 Cache
L1-‐D Cache
L1-‐I Cache
13
COMPUTER ARCHITECTURE GROUP
Locality-Aware Data Access Range-based Outer Loop Parallel Dijkstra • Comple9on 9me tradeoffs
• Less 9me spent wai9ng for coherence traffic to be serviced • Cri9cal sec9on 9me reduc9on -‐> synchroniza9on 9me reduc9on
0.00E+00
4.00E+06
8.00E+06
1.20E+07
Baseline Locality-‐aware Private Caching
Comple5
on
Time (ns) Synchroniza9on
L2Home-‐OffChip
L2Home-‐Sharers
L2Home-‐Wai9ng
L1Cache-‐L2Home
Compute
14
COMPUTER ARCHITECTURE GROUP
How about Accelerating Computation?
• Accelera9ng computa9on alone even under an idealis9c data access setup is not sufficient! Ø Must address data dependencies that lead to fine-‐grain communica5on
bollenecks
0.00E+00 2.00E+05 4.00E+05 6.00E+05 8.00E+05 1.00E+06 1.20E+06 1.40E+06
0 200000 400000 600000 800000 1000000 1200000 1400000
Area (u
m^2)
Cycles DIJKSTRA FFT
Aladdin [Shao et al., ISCA’14] Tool’s Design Space Latency
Gap
15
COMPUTER ARCHITECTURE GROUP
The Case for a Many-core Accelerator
Conventional System Our proposal Accelerate Computation +
Data access + Communication
Shared Memory
Coherence and Communication over shared memory
Data Access
Core Core Core
Locality Aware Data Access
Send() / Receive() Messages +
Core+ ACC
Core+ ACC
Core+ ACC
16
Coherence over shared memory
COMPUTER ARCHITECTURE GROUP
Why Accelerate Communication? Example
• The ordering core receives packets (potentially out of order) from many flow cores, and it reorders and commits the packets
• Several shared memory versions implemented (we show only the best lockless shared data structure implementation)
Ordering Core
Flow Core
Flow Core
Flow Core
Flow Core
Flow Core
Flow Core
Flow Core
Flow Core
Lockless Shared Data Structure
Ordering Core
recv() Queues
Shared Memory Explicit Messaging
send()
17
COMPUTER ARCHITECTURE GROUP
Why Accelerate Communication? Example
• Shared memory: 20 cycles/packet at 256 cores is the best result
• Explicit Messaging: 10 cycles/packet at 4 cores • ~2X latency advantage using point-to-point
communication and by avoiding data ping-ponging
0
50
100
150
2 4 8 12 16 24 32 48 64 128 256
Latency pe
r Packet
Cores
Shared Memory Explicit Messaging
Methods • In-house simulator
with ADL front-end: Simple in-order RISC cores
• Compiler support for send() and receive() instructions
• BARC’15 paper
18
COMPUTER ARCHITECTURE GROUP
What about Resilience?
• Redundancy alone can achieve resiliency but hurts efficiency
• Our Approach: Given correctness guarantee constraints, selec9vely apply resilience to code that is crucial for program correctness and output • Opens a new research
direc9on that trades off program accuracy with efficient resiliency [CAL’15]
Performance/Energy/Power (Overheads)
Error Vulnerability (Coverage)
Program A
ccuracy Symptom-based
Symptom -based
n-Modular
n-Modular
Software
Software
Resiliency Methods?
19
COMPUTER ARCHITECTURE GROUP
Declarative Resilience D* Heuristic Algorithm for Path Planning
• Heuris9c algorithm, D* (aka A*) is work efficient and popular in applica9ons that use pre-‐processed input graphs
D* Shortest Path
S
D
S
D Correct Shortest Path
Perturbed Shortest Path
• Declara9ve resilience allows the heuris;c calcula;on to be considered non-‐crucial code, and hence minor perturba9ons can be tolerated
20
COMPUTER ARCHITECTURE GROUP
Declarative Resilience D* Heuristic Algorithm for Path Planning
while not at the destination node!!
RESILIENCE OFF!!
• for all neighbor nodes!• Lookup edge weights for
neighboring nodes!• Use heuristic to
calculate the next node with minimum distance!
!
RESILIENCE ON!!
Go to the next best node!
Sequen9al Pseudo-‐code for D* Considera9ons for program correctness of non-‐crucial code 1. Unroll for loop to remove all
control flow instruc9ons
2. No stores to globally visible memory i.e., all updates local
3. Local store address calcula9on protected using redundancy
4. Next node calcula9on checked for “within bounds” • Based on current node ID
and degree of the graph • If bounds violated, next
node is not updated (i,e., re-‐execute current node)
21
COMPUTER ARCHITECTURE GROUP
Declarative Resilience Results D* Heuristic Algorithm for Path Planning
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Baseline Re-‐Execu9on Declara9ve Resilience
Comple5
on Tim
e (normalized
)
Resilience-‐On-‐Delay
Re-‐Execu9on-‐Time
Network-‐Recv-‐Stall-‐Time
Synchroniza9on-‐Stall-‐Time
Compute-‐Time
Memory-‐Stall-‐Time
• Re-‐execu9ng all instruc9ons incurs 30% performance overhead [COMPUTER’13]
• Declara9ve Resilience performance overhead is 8% • Protects all crucial code, and against the side effects of non-‐crucial code
22
COMPUTER ARCHITECTURE GROUP
Summary • Exploi9ng concurrency in path planning is non trivial because (1)
it operates on unstructured data, and (2) complex dependence paderns between tasks are known only during program execu9on
Ø Develop a “Situa9onal Scheduler” that adapts to (1) input dependence, (2) dynamic workload varia9ons, (3) exploitable concurrency, and (4) accuracy requirements
Ø Many-‐core architectures must accelerate computa9on, communica9on and data accesses for extreme efficiency
Ø Resiliency of computa9on must be considered as a first order metric. A declara9ve resiliency method poten9ally reduces the efficiency overheads of resilience
23