university of michigan electrical engineering and computer science 1 polymorphic pipeline array: a...

University of MichiganElectrical Engineering and Computer Science1

Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution

for Mobile Multimedia Applications

Hyunchul Park1, Yongjun Park2, Scott Mahlke2

December 12, 2009

Texas Instruments Inc.1

University of Michigan, Ann Arbor 2

University of MichiganElectrical Engineering and Computer Science

ARM9 ARM11 TI C6x Core2Duo0

• Multimedia applications have high performance, cost, energy demands– High-quality video– Flash animation

• Clear need for application and domain-specific hardware

Introduction

24 fps min.

ecMPEG-4 Decoder

Cell-phone battery life(hours)

energyperformance

Convergence of Functionalities

Anatomy of iPhone

HD TV decoder

Video Recording

Video Editing

3D Rendering

4G Wireless

Advanced Image

Processing

Convergence of functionalities demands a flexible solutionApplications have different characteristics

ASIC Alternatives

General PurposeProcessors

Efficiency, Performance

Domain specificEfficiency

Somewhat programmable

What’s the right way to support multimedia applications ?

Coarse-Grained Reconfigurable Architecture (CGRA)

• Array of PEs connected in a mesh-like interconnect• High throughput, low cost/power with distributed hardware• High flexibility with dynamic reconfiguration• Morphosys, SiliconHive, ADRES

Execution Model of CGRAs

for ( …… ) {

• Modulo scheduling exploits loop level parallelism

Large Scale CGRA

• Need for higher performance– Higher resolution/more detail video– Multiple concurrent applications support

• Increasing technology allows more resources available

Loop 0 Loop 0 Loop 0Loop 0

Loop 1

Loop 2

Loop 3

Task 0 Task 1 Task 2 Task 3 Task 4Loop 0

Streaming Execution Model• Streaming property

– Packet of data goes through independent tasks

• Partition tasks into stages– Map each stage onto different

hardware• Pipeline parallelism

– Pipeline the outermost loop

Insights

• Multimedia applications rich both in ILP/pipeline parallelism– Not mutually exclusive, cooperatively enhance performance

• Resource requirement varies– Statically / dynamically

• Need a flexible execution model– Exploiting both types of parallelism– Resource allocation based on computation requirement– Dynamically adapt to computation variance

Polymorphic Pipeline Array

• Multi-core accelerator : each 2x2 array becomes a processor• Cores can be combined to form a larger logical core• Exploit both coarse-grain and fine-grain pipeline parallelism• No dynamic routing logic: all communications statically generated

Core Core Core Core

Logical Core

Execution Model

• Pipeline outermost loop

ST 0 ST 1 ST 2 ST 3

Execution Model

• Pipeline outermost loop• Compute intensive stage

– Assign more resources– Modulo scheduling

ST 0 ST 1 ST 2 ST 3

Execution Model

ST 0 ST 1 ST 2

• Pipeline outermost loop• Compute intensive stage

– Assign more resources– Modulo scheduling

Partitioning of PPA

• Static partitioning– Schedules can be optimized– Computation variance leads to low utilization

• Dynamic partitioning– Adjust core assignment at run-time– Adapt to computation variance, but some overhead

• How to support dynamic partitioning– Multiple schedules: code bloat– Unified schedule targeting multiple sub-arrays (virtualization)

Virtualized Modulo Scheduling

• One binary that can run in multiple targets– Part of code migrate to

neighboring core– No rescheduling

• Challenges– Avoid resource conflict – Enforce multiple modulo

constraints– Inter-core communication

A B0 1

Multi-level Modulo Constraints

time F0 F1 F2 F3

Core 0

II = 4

time F0 F1 F2 F3

Core 0

II = 4

time F0 F1 F2 F3

Core 0

time F0 F1 F2 F3

Core 1

time F0 F1 F2 F3

Core 0

time F0 F1 F2 F3

Core 1

II = 2

Inter-core Communication

time F0 F1 F2 F3

Core 0

time F0 F1 F2 F3

Core 1

II = 2

Direct RF connection

VMS Summary

• Edge-centric Modulo Scheduling [PACT’08] with virtualization support

• Generate a unified schedule– Schedule for the smallest array, then expanded

• Multi-level modulo constraints enforced– Avoid resource conflict when expanded– Apply to computation/routing/registers

• Register transfer operations for inter-core communications– Enabled only when expanded

Evaluation of PPA

• Exploiting both types of parallelism in AAC• Dynamic partitioning overhead

– 13% overhead for single-core schedule, runtime overhead

CGRA static dyn static dyn static dyn static dyn static dyn4 cores 4 cores 5 cores 6 cores 7 cores 8 cores

ARM9 ARM11 TI C6x PPA Core2Duo0

Where PPA stands

24 fps min.

MPEG-4 Decoder

Cell-phone battery life(hours)

energyperformance

Questions?

university of michigan electrical engineering and computer science 1 polymorphic pipeline array: a...

Documents

multicore architecture

polymorphic malware detection

processadores multicore

heterogeneous multicore

nec virtualized evolved packet core – vepcnec virtualized

bs6724 multicore

multicore system design with xum: the extensible utah...

multicore processing, virtualization, and...

multicore and multicore programming with openmp (syst emes

multicore computers

polymorphic mapping (brutos mvc)

virtualized multiservice data center (vmdc) virtualized ......

cstalks-polymorphic heterogeneous multicore systems-17aug

keystone multicore navigator keystone training multicore...

multicore simulator

parallelization & multicore

multicore computing

molecularcloning polymorphic endonuclease fragment

using multicore navigator multicore applications

polymorphic blending attacks