university of michigan electrical engineering and computer science 1 polymorphic pipeline array: a...

24
University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution for Mobile Multimedia Applications Hyunchul Park 1 , Yongjun Park 2 , Scott Mahlke 2 December 12, 2009 Texas Instruments Inc. 1 University of Michigan, Ann Arbor 2

Post on 19-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution

University of MichiganElectrical Engineering and Computer Science1

Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution

for Mobile Multimedia Applications

Hyunchul Park1, Yongjun Park2, Scott Mahlke2

December 12, 2009

Texas Instruments Inc.1

University of Michigan, Ann Arbor 2

Page 2: University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution

University of MichiganElectrical Engineering and Computer Science

ARM9 ARM11 TI C6x Core2Duo0

5

10

15

20

25

30

35

40

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

• Multimedia applications have high performance, cost, energy demands– High-quality video– Flash animation

• Clear need for application and domain-specific hardware

Introduction

24 fps min.

Fram

es/s

ecMPEG-4 Decoder

Cell-phone battery life(hours)

2

energyperformance

Page 3: University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution

University of MichiganElectrical Engineering and Computer Science

Convergence of Functionalities

3

Anatomy of iPhone

HD TV decoder

Video Recording

Video Editing

3D Rendering

4G Wireless

Advanced Image

Processing

Convergence of functionalities demands a flexible solutionApplications have different characteristics

Page 4: University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution

University of MichiganElectrical Engineering and Computer Science

ASIC Alternatives

General PurposeProcessors

DSPs

Efficiency, Performance

Fle

xibi

lity

ASICs

Domain specificEfficiency

Somewhat programmable

What’s the right way to support multimedia applications ?

4

Page 5: University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution

University of MichiganElectrical Engineering and Computer Science5

Coarse-Grained Reconfigurable Architecture (CGRA)

• Array of PEs connected in a mesh-like interconnect• High throughput, low cost/power with distributed hardware• High flexibility with dynamic reconfiguration• Morphosys, SiliconHive, ADRES

Page 6: University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution

University of MichiganElectrical Engineering and Computer Science

Execution Model of CGRAs

6

for ( …… ) {

}

time

Host

CGRA

• Modulo scheduling exploits loop level parallelism

Page 7: University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution

University of MichiganElectrical Engineering and Computer Science7

Large Scale CGRA

• Need for higher performance– Higher resolution/more detail video– Multiple concurrent applications support

• Increasing technology allows more resources available

Loop 0 Loop 0 Loop 0Loop 0

Loop 1

Loop 2

Loop 3

Task 0 Task 1 Task 2 Task 3 Task 4Loop 0

Page 8: University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution

University of MichiganElectrical Engineering and Computer Science

Streaming Execution Model• Streaming property

– Packet of data goes through independent tasks

• Partition tasks into stages– Map each stage onto different

hardware• Pipeline parallelism

– Pipeline the outermost loop

8

Page 9: University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution

University of MichiganElectrical Engineering and Computer Science

Insights

• Multimedia applications rich both in ILP/pipeline parallelism– Not mutually exclusive, cooperatively enhance performance

• Resource requirement varies– Statically / dynamically

• Need a flexible execution model– Exploiting both types of parallelism– Resource allocation based on computation requirement– Dynamically adapt to computation variance

9

Page 10: University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution

University of MichiganElectrical Engineering and Computer Science

Polymorphic Pipeline Array

• Multi-core accelerator : each 2x2 array becomes a processor• Cores can be combined to form a larger logical core• Exploit both coarse-grain and fine-grain pipeline parallelism• No dynamic routing logic: all communications statically generated

10

Core Core Core Core

Core Core Core Core

Logical Core

Logical Core

Logical Core

Page 11: University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution

University of MichiganElectrical Engineering and Computer Science

Execution Model

11

• Pipeline outermost loop

ST 0 ST 1 ST 2 ST 3

ST 0

ST 1

ST 2

ST 3

Page 12: University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution

University of MichiganElectrical Engineering and Computer Science

Execution Model

12

• Pipeline outermost loop• Compute intensive stage

– Assign more resources– Modulo scheduling

ST 0

ST 1

ST 2

ST 3

ST 0 ST 1 ST 2 ST 3

Page 13: University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution

University of MichiganElectrical Engineering and Computer Science

Execution Model

13

ST 0

ST 1

ST 2

ST 3

ST 0 ST 1 ST 2

ST 3

• Pipeline outermost loop• Compute intensive stage

– Assign more resources– Modulo scheduling

Page 14: University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution

University of MichiganElectrical Engineering and Computer Science

Partitioning of PPA

• Static partitioning– Schedules can be optimized– Computation variance leads to low utilization

• Dynamic partitioning– Adjust core assignment at run-time– Adapt to computation variance, but some overhead

• How to support dynamic partitioning– Multiple schedules: code bloat– Unified schedule targeting multiple sub-arrays (virtualization)

14

Page 15: University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution

University of MichiganElectrical Engineering and Computer Science

Virtualized Modulo Scheduling

15

0

A

B A

B

• One binary that can run in multiple targets– Part of code migrate to

neighboring core– No rescheduling

• Challenges– Avoid resource conflict – Enforce multiple modulo

constraints– Inter-core communication

A

B

A

A

A B

B

B

A B0 1

BA

IIII

Page 16: University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution

University of MichiganElectrical Engineering and Computer Science

Multi-level Modulo Constraints

16

0

1

2

3

0

2 3

4

5

6

7

5

4

6

7 8

9

11

8

9

10

11

12

10

13

time F0 F1 F2 F3

Core 0

0

2 3

6

9

0

2 3

5

4

6

7 8

9

11

II = 4

II =

4

Page 17: University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution

University of MichiganElectrical Engineering and Computer Science

Multi-level Modulo Constraints

17

0

1

2

3

4

5

6

7

5

4

7 8

0

2 3

6

9

11

8

9

10

11

12

10

13

time F0 F1 F2 F3

Core 0

II = 4

II =

4

Page 18: University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution

University of MichiganElectrical Engineering and Computer Science

Multi-level Modulo Constraints

18

0

1

2

3

0

2 3

4

5

6

7

5

4

6

8

9

10

11

7 8

9

11

12

10

13

time F0 F1 F2 F3

Core 0

0

1

2

3

4

5

6

7

8

9

10

11

time F0 F1 F2 F3

Core 1

II =

4

Page 19: University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution

University of MichiganElectrical Engineering and Computer Science

Multi-level Modulo Constraints

19

0

1

2

3

0

2 3

4

5

6

7

5

4

6

8

9

10

11

7 8

9

11

12

10

13

time F0 F1 F2 F3

Core 0

0

1

2

3

4

5

6

7

8

9

10

11

time F0 F1 F2 F3

Core 1

II = 2

II =

4

II =

2II

= 2

Page 20: University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution

University of MichiganElectrical Engineering and Computer Science

Inter-core Communication

20

0

1

2

3

0

2 3

4

5

6

7

5

4

6

8

9

10

11

7 8

9

11

12

10

13

time F0 F1 F2 F3

Core 0

0

1

2

3

4

5

6

7

8

9

10

11

time F0 F1 F2 F3

Core 1

II = 2

Direct RF connection

Page 21: University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution

University of MichiganElectrical Engineering and Computer Science

VMS Summary

• Edge-centric Modulo Scheduling [PACT’08] with virtualization support

• Generate a unified schedule– Schedule for the smallest array, then expanded

• Multi-level modulo constraints enforced– Avoid resource conflict when expanded– Apply to computation/routing/registers

• Register transfer operations for inter-core communications– Enabled only when expanded

21

Page 22: University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution

University of MichiganElectrical Engineering and Computer Science

Evaluation of PPA

• Exploiting both types of parallelism in AAC• Dynamic partitioning overhead

– 13% overhead for single-core schedule, runtime overhead

22

CGRA static dyn static dyn static dyn static dyn static dyn4 cores 4 cores 5 cores 6 cores 7 cores 8 cores

0

10

20

30

40

50

60

70

80

CGRA static dyn static dyn static dyn static dyn static dyn4 cores 4 cores 5 cores 6 cores 7 cores 8 cores

0

10

20

30

40

50

60

70

80

CGRA static dyn static dyn static dyn static dyn static dyn4 cores 4 cores 5 cores 6 cores 7 cores 8 cores

0

10

20

30

40

50

60

70

80

CGRA static dyn static dyn static dyn static dyn static dyn4 cores 4 cores 5 cores 6 cores 7 cores 8 cores

0

10

20

30

40

50

60

70

80

Page 23: University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution

University of MichiganElectrical Engineering and Computer Science

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

ARM9 ARM11 TI C6x PPA Core2Duo0

5

10

15

20

25

30

35

40

Where PPA stands

24 fps min.

Fram

es/s

ec

MPEG-4 Decoder

Cell-phone battery life(hours)

23

energyperformance

Page 24: University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution

University of MichiganElectrical Engineering and Computer Science24

Questions?