university of michigan electrical engineering and computer science 1 polymorphic pipeline array: a...

Post on 19-Dec-2015

218 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

University of MichiganElectrical Engineering and Computer Science1

Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution

for Mobile Multimedia Applications

Hyunchul Park1, Yongjun Park2, Scott Mahlke2

December 12, 2009

Texas Instruments Inc.1

University of Michigan, Ann Arbor 2

University of MichiganElectrical Engineering and Computer Science

ARM9 ARM11 TI C6x Core2Duo0

5

10

15

20

25

30

35

40

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

• Multimedia applications have high performance, cost, energy demands– High-quality video– Flash animation

• Clear need for application and domain-specific hardware

Introduction

24 fps min.

Fram

es/s

ecMPEG-4 Decoder

Cell-phone battery life(hours)

2

energyperformance

University of MichiganElectrical Engineering and Computer Science

Convergence of Functionalities

3

Anatomy of iPhone

HD TV decoder

Video Recording

Video Editing

3D Rendering

4G Wireless

Advanced Image

Processing

Convergence of functionalities demands a flexible solutionApplications have different characteristics

University of MichiganElectrical Engineering and Computer Science

ASIC Alternatives

General PurposeProcessors

DSPs

Efficiency, Performance

Fle

xibi

lity

ASICs

Domain specificEfficiency

Somewhat programmable

What’s the right way to support multimedia applications ?

4

University of MichiganElectrical Engineering and Computer Science5

Coarse-Grained Reconfigurable Architecture (CGRA)

• Array of PEs connected in a mesh-like interconnect• High throughput, low cost/power with distributed hardware• High flexibility with dynamic reconfiguration• Morphosys, SiliconHive, ADRES

University of MichiganElectrical Engineering and Computer Science

Execution Model of CGRAs

6

for ( …… ) {

}

time

Host

CGRA

• Modulo scheduling exploits loop level parallelism

University of MichiganElectrical Engineering and Computer Science7

Large Scale CGRA

• Need for higher performance– Higher resolution/more detail video– Multiple concurrent applications support

• Increasing technology allows more resources available

Loop 0 Loop 0 Loop 0Loop 0

Loop 1

Loop 2

Loop 3

Task 0 Task 1 Task 2 Task 3 Task 4Loop 0

University of MichiganElectrical Engineering and Computer Science

Streaming Execution Model• Streaming property

– Packet of data goes through independent tasks

• Partition tasks into stages– Map each stage onto different

hardware• Pipeline parallelism

– Pipeline the outermost loop

8

University of MichiganElectrical Engineering and Computer Science

Insights

• Multimedia applications rich both in ILP/pipeline parallelism– Not mutually exclusive, cooperatively enhance performance

• Resource requirement varies– Statically / dynamically

• Need a flexible execution model– Exploiting both types of parallelism– Resource allocation based on computation requirement– Dynamically adapt to computation variance

9

University of MichiganElectrical Engineering and Computer Science

Polymorphic Pipeline Array

• Multi-core accelerator : each 2x2 array becomes a processor• Cores can be combined to form a larger logical core• Exploit both coarse-grain and fine-grain pipeline parallelism• No dynamic routing logic: all communications statically generated

10

Core Core Core Core

Core Core Core Core

Logical Core

Logical Core

Logical Core

University of MichiganElectrical Engineering and Computer Science

Execution Model

11

• Pipeline outermost loop

ST 0 ST 1 ST 2 ST 3

ST 0

ST 1

ST 2

ST 3

University of MichiganElectrical Engineering and Computer Science

Execution Model

12

• Pipeline outermost loop• Compute intensive stage

– Assign more resources– Modulo scheduling

ST 0

ST 1

ST 2

ST 3

ST 0 ST 1 ST 2 ST 3

University of MichiganElectrical Engineering and Computer Science

Execution Model

13

ST 0

ST 1

ST 2

ST 3

ST 0 ST 1 ST 2

ST 3

• Pipeline outermost loop• Compute intensive stage

– Assign more resources– Modulo scheduling

University of MichiganElectrical Engineering and Computer Science

Partitioning of PPA

• Static partitioning– Schedules can be optimized– Computation variance leads to low utilization

• Dynamic partitioning– Adjust core assignment at run-time– Adapt to computation variance, but some overhead

• How to support dynamic partitioning– Multiple schedules: code bloat– Unified schedule targeting multiple sub-arrays (virtualization)

14

University of MichiganElectrical Engineering and Computer Science

Virtualized Modulo Scheduling

15

0

A

B A

B

• One binary that can run in multiple targets– Part of code migrate to

neighboring core– No rescheduling

• Challenges– Avoid resource conflict – Enforce multiple modulo

constraints– Inter-core communication

A

B

A

A

A B

B

B

A B0 1

BA

IIII

University of MichiganElectrical Engineering and Computer Science

Multi-level Modulo Constraints

16

0

1

2

3

0

2 3

4

5

6

7

5

4

6

7 8

9

11

8

9

10

11

12

10

13

time F0 F1 F2 F3

Core 0

0

2 3

6

9

0

2 3

5

4

6

7 8

9

11

II = 4

II =

4

University of MichiganElectrical Engineering and Computer Science

Multi-level Modulo Constraints

17

0

1

2

3

4

5

6

7

5

4

7 8

0

2 3

6

9

11

8

9

10

11

12

10

13

time F0 F1 F2 F3

Core 0

II = 4

II =

4

University of MichiganElectrical Engineering and Computer Science

Multi-level Modulo Constraints

18

0

1

2

3

0

2 3

4

5

6

7

5

4

6

8

9

10

11

7 8

9

11

12

10

13

time F0 F1 F2 F3

Core 0

0

1

2

3

4

5

6

7

8

9

10

11

time F0 F1 F2 F3

Core 1

II =

4

University of MichiganElectrical Engineering and Computer Science

Multi-level Modulo Constraints

19

0

1

2

3

0

2 3

4

5

6

7

5

4

6

8

9

10

11

7 8

9

11

12

10

13

time F0 F1 F2 F3

Core 0

0

1

2

3

4

5

6

7

8

9

10

11

time F0 F1 F2 F3

Core 1

II = 2

II =

4

II =

2II

= 2

University of MichiganElectrical Engineering and Computer Science

Inter-core Communication

20

0

1

2

3

0

2 3

4

5

6

7

5

4

6

8

9

10

11

7 8

9

11

12

10

13

time F0 F1 F2 F3

Core 0

0

1

2

3

4

5

6

7

8

9

10

11

time F0 F1 F2 F3

Core 1

II = 2

Direct RF connection

University of MichiganElectrical Engineering and Computer Science

VMS Summary

• Edge-centric Modulo Scheduling [PACT’08] with virtualization support

• Generate a unified schedule– Schedule for the smallest array, then expanded

• Multi-level modulo constraints enforced– Avoid resource conflict when expanded– Apply to computation/routing/registers

• Register transfer operations for inter-core communications– Enabled only when expanded

21

University of MichiganElectrical Engineering and Computer Science

Evaluation of PPA

• Exploiting both types of parallelism in AAC• Dynamic partitioning overhead

– 13% overhead for single-core schedule, runtime overhead

22

CGRA static dyn static dyn static dyn static dyn static dyn4 cores 4 cores 5 cores 6 cores 7 cores 8 cores

0

10

20

30

40

50

60

70

80

CGRA static dyn static dyn static dyn static dyn static dyn4 cores 4 cores 5 cores 6 cores 7 cores 8 cores

0

10

20

30

40

50

60

70

80

CGRA static dyn static dyn static dyn static dyn static dyn4 cores 4 cores 5 cores 6 cores 7 cores 8 cores

0

10

20

30

40

50

60

70

80

CGRA static dyn static dyn static dyn static dyn static dyn4 cores 4 cores 5 cores 6 cores 7 cores 8 cores

0

10

20

30

40

50

60

70

80

University of MichiganElectrical Engineering and Computer Science

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

ARM9 ARM11 TI C6x PPA Core2Duo0

5

10

15

20

25

30

35

40

Where PPA stands

24 fps min.

Fram

es/s

ec

MPEG-4 Decoder

Cell-phone battery life(hours)

23

energyperformance

University of MichiganElectrical Engineering and Computer Science24

Questions?

top related