many,core%architecture%characteriza5on% of%the%path ... · • the ordering core receives packets...

COMPUTER ARCHITECTURE GROUP

Omer Khan Assistant Professor

Electrical and Computer Engineering University of Connec9cut

Contact Email: [email protected]

Many-‐core Architecture Characteriza5on of the Path Planning Workload

CogArch Workshop 2015


Path Planning in Cognitive Computing

Sensors Acquisi9on & Percep9on

Controllers Planner

Actors Scheduler

Image from KIT Ins9tute for Anthropoma9cs

Desired Driving Behavior

Actual Driving Behavior

S

D

•  Collision free path? •  Most efficient?

2


Path Planning: A Challenge Application

Real 9me Performance Energy Constraints

Resiliency Constraints

3

Path Planning


Path Planning: Efficiency via Parallelization Shortest Path Dijkstra Algorithm: Initialize Nodes(), Edges()!

For (each Node u) --Outer Loop visits each node once!

For (each Edge of u) --Inner Loop!

1. Calculates distance from current node to each neighbor!

2. Checks for the next best node (u) among the neighbors! O( N.logN + E )

Inner Loop paralleliza5on Each thread is assigned a set of current node’s edges to relax BARRIERS are applied to synchronize threads during each itera9on

J  Work efficient due to no redundant computa9ons compared to sequen9al

L  Hard to parallelize due to lack of edge-‐level parallelism (bi-‐direc9onal search improves concurrency)

Outer Loop paralleliza5on Convergence based: Divide nodes among threads, then each thread relaxes its nodes itera9vely un9l the distances converge (i.e., no change)

Range based: Dynamically distribute “range of nodes” among threads, then each thread relaxes its set of nodes one by one

1.3

1.3

1.3

0.5

1.3

1.2

0.8

1.1

LOCK each node to avoid poten9al races among threads relaxing shared nodes. Apply a BARRIER to synchronize threads aWer each itera9on

L  Work inefficient due to many redundant relaxa9ons for nodes that stabilize before convergence

J  Highly parallelizable due to node-‐level parallelism

J  Work efficient and exploits node-‐level parallelism (bi-‐direc9onal search improves concurrency)

L  Needs intelligent scheduling for dynamic work balancing

1.2

4


Characterization Space •  Simulated a 9led 256-‐core NOC-‐based mul9core •  Algorithms: Single-‐ and Mul9ple-‐Objec9ve Shortest Path (SOSP/MOSP)

•  Dijkstra: Visits each node once, hence high work complexity •  Heuris9c Algorithms: useful for pre-‐processed input graphs

•  A*, D*: Number of nodes visited is drama9cally reduced •  ∆-‐Stepping: Visits each node once, but work done per node is determined by “delta”

•  Mar5n’s Algorithm: Similar to Dijkstra, but considers an addi9onal objec9ve when evalua9ng for the next best node

•  Paralleliza9on strategies •  Inner Loop •  Outer Loop: Convergence and Range based

•  Inputs (methodology similar to Gtgraph from Georgia Tech) •  Adjacency list representa9on with randomly distributed edge weights •  Graph configura9ons

•  Number of nodes: 16K – 4M •  Sparse graph: 4 – 32 edges/node •  Dense graph: 8K edges/node

5


Characterization Objectives

•  Path planning is challenging, because •  Operates on unstructured data •  Complex dependence paderns between tasks that are known only during program execu9on

•  Characteriza9on to revel four areas where compu9ng must adapt at run9me to exploit execu9on efficiency 1.  Dynamic Workload Selec9on and Balancing 2.  Concurrency Controls 3.  Input Dependence 4.  Accuracy of Computa9on

6


Dense Graph (16K nodes, 8K edges)

7

Shortest Path Algorithm Comple5on Time (ms)

Num. of Threads

Accuracy (%)

Comments

Dijkstra Sequen9al (baseline) 14200 1 100% Convergence based is work inefficient; Range-‐based incurs extra communica9on Inner loop has good concurrency, gives ~40X speedup;

Inner Loop Par 549 32 Bi-‐direc5onal Inner Loop Par 356 96 Convergence based Outer Loop Par 6300 160 Range-‐based Outer Loop Par 7691 256 Bidirec9onal Range-‐based Outer 7424 256

D* sequen9al 74 1 97% Very high Work Efficiency D* Bi-‐direc5onal Inner Loop Par 2.51 32

Mar9n’s Sequen9al (baseline) 14500 1 100% ~45X speedup B-‐direc5onal Inner Loop Par 321 128 Bi-‐direc9onal Range-‐based Outer Par 2859 256


Sparse Graph (16K nodes, 16 edges)

8

Shortest Path Algorithm Comple5on Time (ms)

Num. of Threads

Accuracy (%)

Comments

Dijkstra Sequen9al (baseline) 30 1 100% Inner loop has low concurrency; Convergence based Outer loop is work inefficient; Range-‐based bi-‐direc5onal gives ~5X speedup

Inner Loop Par 30 1 Bi-‐direc9onal Inner Loop Par 14 2 Convergence based Outer Loop Par 377 256 Range-‐based Outer Loop Par 10.8 128 Bi-‐direc5onal Range-‐based Outer Loop Par

6.4 192

∆-‐Stepping Inner Par (∆=50%) 2.9 16 80% Work Efficient D* Sequen9al 4 1 97% D* Bi-‐direc5onal Inner Loop Par 3.4 2

Ø Mo5vates the need to adapt to “choices” along (1) input dependence, (2) dynamic workload balancing, (3) concurrency controls, and (4) the accuracy of computa5on


“Situational Scheduler” for Adaptation

Algorithmic Choices

Architectural Choices

Many-core Substrate

Performance Monitoring/ Decision Engine

Ø  User/programmer selects an algorithm or heuris9c based on sta9c informa9on such as input characteris5cs or solu5on accuracy requirements

Ø  U9lize run9me informa9on to make decision regarding concurrency control and workload balancing methods

•  Develop models that predict the choices to improve efficiency of computa9on

•  May u9lize heuris9c or control theore9c or even machine learning methods •  Goal is to design the decision engine to be low overhead with high accuracy

of predic9ng the right decision

9


Architecture Innovations for Extreme Efficiency Range-based Outer Loop Parallel Dijkstra

while all nodes are not visited!•  Calculate next

range of nodes to relax based on the degree of graph!

•  Distribute nodes in the current range among worker threads!

while assigned nodes are not visited (update Q array)!•  LOCK the node to be relaxed!

•  for all neighbor nodes!•  Update the D array with new

distance, if new distance is less than older distance (RELAX function)!

•  UNLOCK the node that was relaxed!BARRIER to synchronize all threads!

Master Thread (also par9cipates as a Worker thread)

Worker Thread (only one shown here, others do same work)

1. Create work 3. Wait and do the assigned work

5. All Done

Accelerate Computa9on

Accelerate Communica9on

Accelerate Data Access

2. Here is work

4. Done

10


Data Access Efficiency

•  Data locality in path planning is challenging due to unstructured data dependent accesses

•  Locality-‐aware protocols [ISCA’13, HPCA’14] •  Exploit the run9me variability

in locality/reuse of data at various layers of the on-‐chip cache hierarchy

•  Intelligent fine-‐grain data alloca9on/replica9on at private and shared caches

•  Locality-‐aware private-‐L1 alloca9on/replica9on

•  Locality-‐aware shared-‐L2 replica9on

•  Locality-‐aware Private Caching [ISCA’13] •  Privately cache high locality lines at L1

cache •  Remotely access (at word level) low

locality lines at L2 cache

•  Alloca9on based on locality of data dynamically profiled using cache-‐line level in-‐hardware locality classifiers

11


Locality-Aware Data Access Range-based Outer Loop Parallel Dijkstra •  Reducing Sharing Misses

0 2 4 6 8

Baseline Locality-‐aware Private Caching

L1 Cache

Miss

Breakd

own (%

) Cold Capacity Sharing Word

•  Sharing misses (expensive) turned into word misses (cheap) as more cache lines with low locality are iden9fied by the hardware classifiers

12


Locality-Aware Data Access Range-based Outer Loop Parallel Dijkstra •  Energy Consump9on tradeoffs

•  Reduce invalida9ons, asynchronous write-‐backs and cache-‐line ping-‐pong’ing

0

0.0001

0.0002

0.0003

0.0004

0.0005


Energy

Consum

p5on

DRAM

Network Link

Network Router

Directory

L2 Cache

L1-‐D Cache

L1-‐I Cache

13


Locality-Aware Data Access Range-based Outer Loop Parallel Dijkstra •  Comple9on 9me tradeoffs

•  Less 9me spent wai9ng for coherence traffic to be serviced •  Cri9cal sec9on 9me reduc9on -‐> synchroniza9on 9me reduc9on

0.00E+00

4.00E+06

8.00E+06

1.20E+07


Comple5

on

Time (ns) Synchroniza9on

L2Home-‐OffChip

L2Home-‐Sharers

L2Home-‐Wai9ng

L1Cache-‐L2Home

Compute

14


How about Accelerating Computation?

•  Accelera9ng computa9on alone even under an idealis9c data access setup is not sufficient! Ø Must address data dependencies that lead to fine-‐grain communica5on

bollenecks

0.00E+00 2.00E+05 4.00E+05 6.00E+05 8.00E+05 1.00E+06 1.20E+06 1.40E+06

0 200000 400000 600000 800000 1000000 1200000 1400000

Area (u

m^2)

Cycles DIJKSTRA FFT

Aladdin [Shao et al., ISCA’14] Tool’s Design Space Latency

Gap

15


The Case for a Many-core Accelerator

Conventional System Our proposal Accelerate Computation +

Data access + Communication

Shared Memory

Coherence and Communication over shared memory

Data Access

Core Core Core

Locality Aware Data Access

Send() / Receive() Messages +

Core+ ACC

Core+ ACC

Core+ ACC

16

Coherence over shared memory


Why Accelerate Communication? Example

•  The ordering core receives packets (potentially out of order) from many flow cores, and it reorders and commits the packets

•  Several shared memory versions implemented (we show only the best lockless shared data structure implementation)

Ordering Core

Flow Core

Flow Core

Flow Core

Flow Core

Flow Core

Flow Core

Flow Core

Flow Core

Lockless Shared Data Structure

Ordering Core

recv() Queues

Shared Memory Explicit Messaging

send()

17


Why Accelerate Communication? Example

•  Shared memory: 20 cycles/packet at 256 cores is the best result

•  Explicit Messaging: 10 cycles/packet at 4 cores •  ~2X latency advantage using point-to-point

communication and by avoiding data ping-ponging

0

50

100

150

2 4 8 12 16 24 32 48 64 128 256

Latency pe

r Packet

Cores

Shared Memory Explicit Messaging

Methods •  In-house simulator

with ADL front-end: Simple in-order RISC cores

•  Compiler support for send() and receive() instructions

•  BARC’15 paper

18


What about Resilience?

•  Redundancy alone can achieve resiliency but hurts efficiency

•  Our Approach: Given correctness guarantee constraints, selec9vely apply resilience to code that is crucial for program correctness and output •  Opens a new research

direc9on that trades off program accuracy with efficient resiliency [CAL’15]

Performance/Energy/Power (Overheads)

Error Vulnerability (Coverage)

Program A

ccuracy Symptom-based

Symptom -based

n-Modular

n-Modular

Software

Software

Resiliency Methods?

19


Declarative Resilience D* Heuristic Algorithm for Path Planning

•  Heuris9c algorithm, D* (aka A*) is work efficient and popular in applica9ons that use pre-‐processed input graphs

D* Shortest Path

S

D

S

D Correct Shortest Path

Perturbed Shortest Path

•  Declara9ve resilience allows the heuris;c calcula;on to be considered non-‐crucial code, and hence minor perturba9ons can be tolerated

20


Declarative Resilience D* Heuristic Algorithm for Path Planning

while not at the destination node!!

RESILIENCE OFF!!

•  for all neighbor nodes!•  Lookup edge weights for

neighboring nodes!•  Use heuristic to

calculate the next node with minimum distance!

!

RESILIENCE ON!!

Go to the next best node!

Sequen9al Pseudo-‐code for D* Considera9ons for program correctness of non-‐crucial code 1.  Unroll for loop to remove all

control flow instruc9ons

2.  No stores to globally visible memory i.e., all updates local

3.  Local store address calcula9on protected using redundancy

4.  Next node calcula9on checked for “within bounds” •  Based on current node ID

and degree of the graph •  If bounds violated, next

node is not updated (i,e., re-‐execute current node)

21


Declarative Resilience Results D* Heuristic Algorithm for Path Planning

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Baseline Re-‐Execu9on Declara9ve Resilience

Comple5

on Tim

e (normalized

)

Resilience-‐On-‐Delay

Re-‐Execu9on-‐Time

Network-‐Recv-‐Stall-‐Time

Synchroniza9on-‐Stall-‐Time

Compute-‐Time

Memory-‐Stall-‐Time

•  Re-‐execu9ng all instruc9ons incurs 30% performance overhead [COMPUTER’13]

•  Declara9ve Resilience performance overhead is 8% •  Protects all crucial code, and against the side effects of non-‐crucial code

22


Summary •  Exploi9ng concurrency in path planning is non trivial because (1)

it operates on unstructured data, and (2) complex dependence paderns between tasks are known only during program execu9on

Ø  Develop a “Situa9onal Scheduler” that adapts to (1) input dependence, (2) dynamic workload varia9ons, (3) exploitable concurrency, and (4) accuracy requirements

Ø Many-‐core architectures must accelerate computa9on, communica9on and data accesses for extreme efficiency

Ø  Resiliency of computa9on must be considered as a first order metric. A declara9ve resiliency method poten9ally reduces the efficiency overheads of resilience

23

many,core%architecture%characteriza5on% of%the%path ... · • the ordering core receives packets...

Documents