liwen shih , ph.d. computer engineering u of houston – clear lake [email protected]

27
Adaptive Latency-Aware Parallel Resource Mapping: Task Graph Scheduling Heterogeneous Network Topology Liwen Shih, Ph.D. Computer Engineering U of Houston – Clear Lake [email protected]

Upload: naif

Post on 24-Feb-2016

72 views

Category:

Documents


0 download

DESCRIPTION

Adaptive Latency-Aware Parallel Resource Mapping : Task Graph Scheduling  Heterogeneous Network Topology. Liwen Shih , Ph.D. Computer Engineering U of Houston – Clear Lake [email protected]. ADAPTIVE PARALLEL TASK TO NETWORK TOPOLOGY MAPPING. Latency-adaptive: Topology Traffic Bandwidth - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Liwen Shih , Ph.D. Computer Engineering  U of Houston – Clear Lake shih@uhcl.edu

Adaptive Latency-Aware Parallel Resource Mapping:Task Graph Scheduling Heterogeneous Network Topology

Liwen Shih, Ph.D.

Computer Engineering U of Houston – Clear Lake

[email protected]

Page 2: Liwen Shih , Ph.D. Computer Engineering  U of Houston – Clear Lake shih@uhcl.edu

ADAPTIVE PARALLEL TASK TONETWORK TOPOLOGY MAPPING

Latency-adaptive:• Topology• Traffic• Bandwidth• Workload• System hierarchyThread partition:• Coarse• Medium• Fine

Page 3: Liwen Shih , Ph.D. Computer Engineering  U of Houston – Clear Lake shih@uhcl.edu

3

Fine-Grained Mapping System[Shih 1988]

• Parallel Mapping– Compiler- vs. run- time

• Task migration – Vertical vs. Horizontal

• Domain decomposition– Data vs. Function

• Execution order– Eager data-driven

vs. Lazy demand-driven

Page 4: Liwen Shih , Ph.D. Computer Engineering  U of Houston – Clear Lake shih@uhcl.edu

PRIORITIZE TASK DFG NODESTask priority factors:1. Level depth2. Critical Paths3. In/Out degreeData flow partial order:{(n7n5), (n7n4), (n6n4), (n6n3), (n5n1), (n4n2), (n3n2), (n2n1)}

total task priority order: {n1 > n2 > n4 > n3 > n5 > n6 > n7} P2 thread: {n1>n2>n4>n3>n6}

P3 thread: {n5 > n7}

Page 5: Liwen Shih , Ph.D. Computer Engineering  U of Houston – Clear Lake shih@uhcl.edu

SHORTEST-PATH NETWORK ROUTING

Shortest latency and routes are updated after each task-processor allocation.

Page 6: Liwen Shih , Ph.D. Computer Engineering  U of Houston – Clear Lake shih@uhcl.edu

• Given a directed, acyclic task DFG G(V, E) with task vertex set V connected by data-flow edge set E, And a processor network topology N(P , C) with processor node set P connected by channel link set C

• Find a processor assignment and schedule S: V(G) P (N)

S minimizes total parallel computation time of G.• A* Heuristic mapping reduces scheduling complexity

from NP to P

Adaptive A* Parallel Processor Scheduler

Page 7: Liwen Shih , Ph.D. Computer Engineering  U of Houston – Clear Lake shih@uhcl.edu

Demand-Driven Task-Topology mapping

• STEP 1 – assign a level to each task node vertex in G.• STEP 2 – count critical paths passing through each DFG edge

and node with a 2-pass bottom-up and then up-down graph traversal.

• STEP 3 – initially load and prioritize all deepest level task nodes that produce outputs, to the working task node list.

• STEP 4 – WHILE working task node list is not empty, schedule a best processor to the top priority task, and replace it with its parent task nodes inserted onto the working task node priority list.

Page 8: Liwen Shih , Ph.D. Computer Engineering  U of Houston – Clear Lake shih@uhcl.edu

STEP 4 – WHILE working task node list is not empty:BEGIN– STEP 4.1 – initialize if first time, otherwise update inter-processor shortest-

path latency/routing table pair affected by last task-processor allocation.– STEP 4.2 – assign a nearby capable processor to minimize thread

computation time for the highest priority task node at the top of the remaining prioritized working list.

– STEP 4.3 – remove the newly scheduled task node, and replace it with its parent nodes, which are to be inserted/appended onto the working list (demand-driven) per priority, based on tie-breaker rules, which along with node level depth, estimate the time cost of the entire computation tread involved.

END{WHILE}

Demand-Driven Processor Scheduling

Page 9: Liwen Shih , Ph.D. Computer Engineering  U of Houston – Clear Lake shih@uhcl.edu

QUANTIFY SW/HW MAPPING QUALITY

• Example 1 – Latency-Adaptive Tree-Task to Tree-Machine Mapping

• Example 2 – Scaling to Larger Tree-to-Tree Mapping

• Example 3 – Select the Best Processor Topology Match for an Irregular Task Graph

Page 10: Liwen Shih , Ph.D. Computer Engineering  U of Houston – Clear Lake shih@uhcl.edu

Example 1 – Latency-Adaptive Tree-Task to Tree-Machine Mapping

K-th Largest Selection Will tree Algorithm [3]match tree machine [4]?

Page 11: Liwen Shih , Ph.D. Computer Engineering  U of Houston – Clear Lake shih@uhcl.edu

Example 1 – Latency-Adaptive Tree-Task to Tree-Machine Mapping

Adaptive mapping moves toward sequential processing when inter/intra communication latency ratio increase.

Page 12: Liwen Shih , Ph.D. Computer Engineering  U of Houston – Clear Lake shih@uhcl.edu

Example 1 – Latency-Adaptive Tree-Task to Tree-Machine Mapping

Adaptive Mapper allocates fewer processors and channels with fewer hops.

Page 13: Liwen Shih , Ph.D. Computer Engineering  U of Houston – Clear Lake shih@uhcl.edu

Example 1 – Latency-Adaptive Tree-Task to Tree-Machine Mapping

Adaptive Mapper achieves higher speedups consistently.

(Bonus! 25.7+ pipeline processing speedup and be extrapolated when inter/intra communication latency ratio <1)

Page 14: Liwen Shih , Ph.D. Computer Engineering  U of Houston – Clear Lake shih@uhcl.edu

Example 1 – Latency-Adaptive Tree-Task to Tree-Machine Mapping

Adaptive Mapper results in better efficiencies consistently.

(Bonus! 428.3+% pipeline processing efficiency can be extrapolated when inter/intra communication latency ratio <1)

Page 15: Liwen Shih , Ph.D. Computer Engineering  U of Houston – Clear Lake shih@uhcl.edu

Example 2 – Scaling to Larger Tree-to-Tree Mapping

Adaptive Mapper achieves sub-optimal speedups as tree sizes scaled larger speedups, still trailing fixed tree-to-tree mapping closely.

Page 16: Liwen Shih , Ph.D. Computer Engineering  U of Houston – Clear Lake shih@uhcl.edu

Example 2 – Scaling to Larger Tree-to-Tree Mapping

Adaptive Mapper is always more cost-efficient using less resource, with compatible sub-optimal speedups to fixed tree-to-tree mapping as tree sizes scaled.

Page 17: Liwen Shih , Ph.D. Computer Engineering  U of Houston – Clear Lake shih@uhcl.edu

Example 3 – Select the Best Processor Topology Match for an Irregular Task Graph

Lack of matching topology clues for irregular shapedRobot Elbow Manipulator [5]• 105 task nodes, • 161 data flow edges• 29 node levels

Page 18: Liwen Shih , Ph.D. Computer Engineering  U of Houston – Clear Lake shih@uhcl.edu

Example 3 – Select the Best Processor Topology Match for an Irregular Task Graph

• Candidate topologies• Compare schedules for

each topology• Farther processors may

not be selected– Linear Array– Tree

Page 19: Liwen Shih , Ph.D. Computer Engineering  U of Houston – Clear Lake shih@uhcl.edu

Example 3 – Select the Best Processor Topology Match for an Irregular Task Graph

Best network topology performers (# channels)• Complete (28)• Mesh (12)• Chordal ring (16)• Systolic array (16)• Cube (12)

Page 20: Liwen Shih , Ph.D. Computer Engineering  U of Houston – Clear Lake shih@uhcl.edu

Example 3 – Select the Best Processor Topology Match for an Irregular Task Graph

Fewer processors selected for higher diameter networks• Tree• Linear Array

Page 21: Liwen Shih , Ph.D. Computer Engineering  U of Houston – Clear Lake shih@uhcl.edu

Example 3 – Select the Best Processor Topology Match for an Irregular Task Graph

Deducing network switch hops• Low multi-hop data

exchanges < 10% • Moderate 0-hop of

30% to 50%• High near-neighbor

direct 1-hop 50% to 70%

Page 22: Liwen Shih , Ph.D. Computer Engineering  U of Houston – Clear Lake shih@uhcl.edu

Future Speed/Memory/Power Optimization

• Latency-adaptive– Topology– Traffic– Bandwidth– Workload– System hierarchy

• Thread partition– Coarse– Mid– Fine

• Latency/Routing tables– Neighborhood– Network hierarchy– Worm-hole– Dynamic mobile network

routing– Bandwidth– Heterogeneous system

• Algorithm-specific network topology

Page 23: Liwen Shih , Ph.D. Computer Engineering  U of Houston – Clear Lake shih@uhcl.edu

References

Page 24: Liwen Shih , Ph.D. Computer Engineering  U of Houston – Clear Lake shih@uhcl.edu

24

Liwen Shih, Ph.D. Professor in Computer EngineeringUniversity of Houston – Clear Lake

[email protected]

Q & A?

Page 25: Liwen Shih , Ph.D. Computer Engineering  U of Houston – Clear Lake shih@uhcl.edu

xScale13 paper

Page 26: Liwen Shih , Ph.D. Computer Engineering  U of Houston – Clear Lake shih@uhcl.edu
Page 27: Liwen Shih , Ph.D. Computer Engineering  U of Houston – Clear Lake shih@uhcl.edu

27

Thank You!