high performance mobile computing using flexible wide simd processors

22
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science High Performance Mobile Computing Using Flexible Wide SIMD Processors Scott Mahlke in collaboration with Mark Woh, Sangwon Seo, Amir Hormati, Yoonseo Choi, Trevor Mudge, Chaitali Chakrabarti (ASU), Krisztian Flautner (ARM Ltd.) Advanced Computer Architecture Laboratory University of Michigan

Upload: hestia

Post on 12-Jan-2016

25 views

Category:

Documents


1 download

DESCRIPTION

High Performance Mobile Computing Using Flexible Wide SIMD Processors. Scott Mahlke in collaboration with Mark Woh, Sangwon Seo, Amir Hormati, Yoonseo Choi, Trevor Mudge, Chaitali Chakrabarti (ASU), Krisztian Flautner (ARM Ltd.) Advanced Computer Architecture Laboratory University of Michigan. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: High Performance Mobile Computing Using Flexible Wide SIMD Processors

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

High Performance Mobile Computing Using Flexible Wide SIMD Processors

Scott Mahlke

in collaboration withMark Woh, Sangwon Seo, Amir Hormati, Yoonseo Choi, Trevor Mudge, Chaitali

Chakrabarti (ASU), Krisztian Flautner (ARM Ltd.)

Advanced Computer Architecture LaboratoryUniversity of Michigan

Page 2: High Performance Mobile Computing Using Flexible Wide SIMD Processors

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

The Old Mobile PhoneThe Modern Mobile Phone

2

Video Recording

Video Editing

Higher Data Rates

3D Rendering

Advanced Image Processing

• Future phones are becoming more complex

• Richer applications require both more performance and more flexibility

• Modern phones look like Franken-chips

Page 3: High Performance Mobile Computing Using Flexible Wide SIMD Processors

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Power/Performance Requirements for Multiple Systems

3

Different applications have different power/performance characteristics!

We need to design keeping each application in mind!

(Not GPP but Domain Specific Processor)

1

10

100

1000

10000

0.1 1 10 100

Better

Pow

er Efficiency

1 Mops/mW

10 Mops/mW100 Mops/mW

1000 Mops/mW

SODA(65nm)

SODA (90nm)

TI C6X

Imagine

VIRAM Pentium M

IBM Cell

Pe

rfo

rma

nce

(G

op

s)

Power (Watts)

3G Wireless

4G Wireless

Mobile HDVideo

Page 4: High Performance Mobile Computing Using Flexible Wide SIMD Processors

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

4G Wireless Basics

• Three kernels make up the majority of the work► FFT – Extract Data from Signals► STBC – Combine Data into More Reliable Stream► LDPC – Error Correction on Data Stream

4

Ant A/D

FFTSymbol Timing / Guard Detect

Channel Estimator

MIMODecoder

(STBC)

Channel Decoder(LDPC)

Recovered Data

Ant D/A IFFTGuard Insertion

Channel Encoder(LDPC)

S/PTransmitted

Data

Modulator

Demodulator NTT DoCoMo 4G test setup

Page 5: High Performance Mobile Computing Using Flexible Wide SIMD Processors

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

High Definition Video (H.264) Basics

5

Inverse Transform

Standard Basis Patterns

Residual Data

5

3 2

Basis Coefficients

“Past” FramesCurrentFrame

Future Frames

Motion Compensation

Form Prediction

ReconstructedMacroblock

Current Uncoded Block

Dec

od

ed B

lock

Decoded Block

Prediction Based on Neighbors

Intra-Prediction

Modules

Entropy decoder

Inverse Transform

MotionCompensation

Intra Prediction

Deblocking Filter

Algorithms

Context-adaptive VLD

4x4 integer kernel

Power Profiling*

Half: 6-tap filterQuarter: 6-tap / bilinear filterEighth: 4-tap filter

Directional spatial prediction

In-loop adaptive filter

4%

29%

10%

34%

Misc Kernels 23%

4CIF@30fps

Page 6: High Performance Mobile Computing Using Flexible Wide SIMD Processors

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Mobile Signal Processing Algorithm Characteristics

6

• Algorithms have different SIMD widths► From very large to very small

• Though SIMD width varies all algorithms can exploit it► Large percentage of work can be SIMDized

• Larger SIMD width tend to have less TLP

Problems with traditional SIMD• High register file power• Large data movement/alignment cost• Inconsistent lane utilization• SIMD implies single thread

Page 7: High Performance Mobile Computing Using Flexible Wide SIMD Processors

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

So, What’s the Right Solution?

• Alternatives► More processors, less lanes?► Configurable: Hardware can be SIMD or MIMD?► Franken chip?

• SIMD is the answer! It provides high performance and power efficiency► Low control cost► More area-efficient scaling► Single thread context► Simpler memory system design –cache coherence

more manageable

Page 8: High Performance Mobile Computing Using Flexible Wide SIMD Processors

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

A Closer Look at SIMD: Power Breakdown

• Register file power disproportionately high in a traditional SIMD architecture

SODA WCDMAMEM3%

REG37%

ALU+MULT11%

CONTROL38%

INTERCONNECT2%

SCALAR PIPE9%

MEM

REG

ALU+MULT

CONTROL

INTERCONNECT

SCALAR PIPE

Page 9: High Performance Mobile Computing Using Flexible Wide SIMD Processors

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Register File Accesses

9

• Many of the register file access do not have to go back to the main register file

Lots of power wasted on unneeded register file access!

Page 10: High Performance Mobile Computing Using Flexible Wide SIMD Processors

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science 10

LDPC – Scaling Performance with SIMD Width

Increasing SIMD width reduces Memory andShuffle Operations thus reduces power

Extra hardware power consumptionoutweighs reduced operations

• SIMD loses effectiveness when lanes cannot be put to productive use• SIMD on distributed data (SIMdD)• Efficient data rearrangement critical to success of SIMD

Page 11: High Performance Mobile Computing Using Flexible Wide SIMD Processors

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Data Alignment Issues

• H.264 Intra-prediction has 9 different prediction modes► Each prediction mode requires a specific permutation

11

B C D

A B C D B C D A B C D A B C D

A B C D

A A A A B B B B C C C C D D D D

A B C D

A B C D B C D C

E

E D E

F G

GF D E F

A

A B C D E F G H I J

A B C D G H I JG H I J A B C D

A B C D E F G

A

A B C D E F G H I J

A B C D E F G H I J

A B C D E F G H I J

K L

K L

A B B C C D DD D D DE F F G G

CDE EF FG GHI J JK KLI

CE F G H I J KI J K L D E FD

F F FEG G HH I D E G C D E F0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

16

164

4

16x16 luma MBprediction modes for each 4x4 block

a b c d

e f g h

i j k l

m n o p

I

J

K

L

A B C DX

Traditional SIMD machines take too long or cost too much to do this

Good news – small fixed number patterns per kernel

Intra-Prediction

Page 12: High Performance Mobile Computing Using Flexible Wide SIMD Processors

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

4G/H.264 Summary

• Lots of different sized parallelism► From 4 wide to 96 wide to 1024 wide SIMD

• Which means many different SIMD widths need to be supported

• TLP (disjoint SIMD) often available• Very short-lived values• Lots of potential for instruction fusings

(beyond pairwise)• Limited set of shuffle patterns required for

each kernel

Page 13: High Performance Mobile Computing Using Flexible Wide SIMD Processors

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

AnySP: Push SIMD

But, Increase the Inherent Flexibility and Efficiency

13

Page 14: High Performance Mobile Computing Using Flexible Wide SIMD Processors

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

CROSSBAR

Scalar Pipeline

Bank 0

Bank 1

Bank 2

Bank 15

L1ProgramMemory

Controller

Bank 3

Bank 4

16-bit 8-wide 16 entry

SIMD RF-0

16-bit 8-wide 16 entry

SIMD RF-1

16-bit 8-wide 16 entry

SIMD RF-2

16-bit 8-wide 16 entry

SIMD RF-7

Scalar Memory Buffer

Multi-SIMD Datapath

AGU Group 0 Pipeline

AGU Group 7 Pipeline

AGU Group 1 Pipeline

Multi-Bank Local Memory

DMA

To Inter-PE

Bus

8 Groups of 8-wide

SIMD(64 Total Lanes)

8-wide SIMD FFU-0

8-wide SIMDFFU-1

8-wide SIMDFFU-2

8-wide SIMD FFU-7

SwizzleNetwork

16-bit 8-wide 4-entry

Buffer-0

16-bit 8-wide 4-entry

Buffer-1

16-bit 8-wide 4-entry

Buffer-2

16-bit 8-wide 4-entry

Buffer-7

Multi-Output Adder Tree

AnySP PE

AnySP Architecture – High Level

14

8 Groups of 8-Wide Flexible Function

Units

128x128 16bit Swizzle Network

16 Banked Memory with SRAM-based Crossbar

Multiple Output Adder Tree

Temporary Buffer and Bypass Network

Datapath AGU and Scalar Pipeline

Page 15: High Performance Mobile Computing Using Flexible Wide SIMD Processors

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Multi-Width SIMD Support

15

16-bit 8-wide 16 entry

SIMD RF-0

16-bit 8-wide 16 entry

SIMD RF-1

16-bit 8-wide 16 entry

SIMD RF-2

16-bit 8-wide 16 entry

SIMD RF-7

Multi-SIMD Datapath

AGU Group 0

AGU Group 7

AGU Group 1

8-wide SIMD FFU-0

8-wide SIMDFFU-1

8-wide SIMDFFU-2

8-wide SIMD FFU-7

SwizzleNetwork

16-bit 8-wide 4-entry

Buffer-0

16-bit 8-wide 4-entry

Buffer-1

16-bit 8-wide 4-entry

Buffer-2

16-bit 8-wide 4-entry

Buffer-7

Multi-Output Adder Tree

CROSSBAR

Bank 0

Bank 1

Bank 2

Bank 15

Bank 3

Bank 4

Multi-Bank Local Memory

16-bit 8-wide 16 entry

SIMD RF-3

8-wide SIMDFFU-3

16-bit 8-wide 4-entry

Buffer-3

AGU Group 2

AGU Group 3

Normal 64-Wide SIMD mode – all lanes share one AGU

16-bit 8-wide 16 entry

SIMD RF-0

16-bit 8-wide 16 entry

SIMD RF-1

16-bit 8-wide 16 entry

SIMD RF-2

16-bit 8-wide 16 entry

SIMD RF-7

Multi-SIMD Datapath

AGU Group 0

AGU Group 7

AGU Group 1

8-wide SIMD FFU-0

8-wide SIMDFFU-1

8-wide SIMDFFU-2

8-wide SIMD FFU-7

SwizzleNetwork

16-bit 8-wide 4-entry

Buffer-0

16-bit 8-wide 4-entry

Buffer-1

16-bit 8-wide 4-entry

Buffer-2

16-bit 8-wide 4-entry

Buffer-7

Multi-Output Adder Tree

CROSSBAR

Bank 0

Bank 1

Bank 2

Bank 15

Bank 3

Bank 4

Multi-Bank Local Memory

16-bit 8-wide 16 entry

SIMD RF-3

8-wide SIMDFFU-3

16-bit 8-wide 4-entry

Buffer-3

AGU Group 2

AGU Group 3

16-bit 8-wide 16 entry

SIMD RF-0

16-bit 8-wide 16 entry

SIMD RF-1

16-bit 8-wide 16 entry

SIMD RF-2

16-bit 8-wide 16 entry

SIMD RF-7

Multi-SIMD Datapath

AGU Group 0

AGU Group 7

AGU Group 1

8-wide SIMD FFU-0

8-wide SIMDFFU-1

8-wide SIMDFFU-2

8-wide SIMD FFU-7

SwizzleNetwork

16-bit 8-wide 4-entry

Buffer-0

16-bit 8-wide 4-entry

Buffer-1

16-bit 8-wide 4-entry

Buffer-2

16-bit 8-wide 4-entry

Buffer-7

Multi-Output Adder Tree

CROSSBAR

Bank 0

Bank 1

Bank 2

Bank 15

Bank 3

Bank 4

Multi-Bank Local Memory

16-bit 8-wide 16 entry

SIMD RF-3

8-wide SIMDFFU-3

16-bit 8-wide 4-entry

Buffer-3

AGU Group 2

AGU Group 3

Each 8-wide SIMD Group works on different memory locations of the same 8-wide code – AGU Offsets

Page 16: High Performance Mobile Computing Using Flexible Wide SIMD Processors

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Using SIMD Lanes for Deeper Subgraphs

16

Flexible Functional Unit allows us to

1.Exploit Pipeline-parallelism by joining two lanes together2.Handle register bypass and the temporary buffer3.Join multiple pipelines to process deeper subgraphs4.Fuse Instruction Pairs

Page 17: High Performance Mobile Computing Using Flexible Wide SIMD Processors

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science 17

SRAM-based Crossbar

Controller

16 Bit Input

16 Bit Input

16 Bit Input

16 Bit Ouput

16 Bit Output

16 Bit Output

Cell(0,0)

Cell(31,0) Cell(31,31)

Cell(0,1) Cell(0,31)

Cell(1,0)

Multiple SRAM cells replace MUX of traditonal crossbar

Each cell stores configuration information

The controller selects the specific configuration based on the instruction parameter

Each cell can store up to 6 different configurations

Power reduced by 50% for 128x128 crossbar

Page 18: High Performance Mobile Computing Using Flexible Wide SIMD Processors

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

AnySP vs SIMD-based Architecture

• SIMD width doubled• But that only provides half the performance gain, other half due to flexibility

features

18

0.0

0.5

1.0

1.5

2.0

2.5Baseline 64-Wide Multi-SIMD Swizzle Network Flexible Functional Unit Buffer + Bypass

No

rma

lized

Sp

eed

up

FFT 1024ptRadix-2

FFT 1024ptRadix-4

STBC LDPC H.264Intra

Prediction

H.264Deblocking

Filter

H.264Inverse

Transform

H.264Motion

Compensation

Page 19: High Performance Mobile Computing Using Flexible Wide SIMD Processors

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

AnySP Energy-Delay vs SIMD-based Architecture

• Comparison based on 90nm synthesis results

• Flexibility increases utilization of datapath and hence its efficiency

19

0.0

0.2

0.4

0.6

0.8

1.0

No

rma

lize

d E

ne

rgy

-De

lay SIMD- based Architecture AnySP

FFT 1024ptRadix-2

FFT 1024ptRadix-4

STBC LDPC H.264Intra

Prediction

H.264Deblocking

Filter

H.264Inverse

Transform

H.264Motion

Compenstation

Page 20: High Performance Mobile Computing Using Flexible Wide SIMD Processors

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

PE

System

Total

Components Units

SIMD Data Mem (32KB)SIMD Register File (16x1024bit)SIMD ALUs, Multipliers, and SSN

SIMD Pipeline+Clock+Routing

Intra-processor InterconnectScalar/AGU Pipeline & Misc.

ARM (Cortex-M3)Global Scratchpad Memory (128KB)

Inter-processor Bus with DMA

90nm (1V @300MHz)

4444

44

111

AreaAreamm2

Area%

9.76 38.78%3.17 12.59%4.50 17.88%1.18 4.69%

0.94 3.73%1.22 4.85%

0.6 2.38%1.8 7.15%1.0 3.97%

25.17 100%

Est.65nm (0.9V @ 300MHz) 13.1445nm (0.8V @ 300MHz) 6.86

4G + H.264 DecoderPower

mWPower

%

102.88 7.24%299.00 21.05%448.51 31.58%233.60 16.45%

93.44 6.58%134.32 9.46%

2.5 <1%10 <1%

1.5 <1%

1347.03 100%

1091.09862.09

SIMD Buffer (128B)SIMD Adder Tree

44

0.82 3.25%0.18 <1%

84.09 5.92%10.43 <1%

AnySP Power Breakdown

• We estimate that both H.264 and 4G wireless can be done in under 1 Watt at 45nm

20

Page 21: High Performance Mobile Computing Using Flexible Wide SIMD Processors

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Conclusions

• Scaling traditional SIMD for mobile applications► Wide-SIMD hardware under-utilized► Large fraction of power on non-computation

• AnySP design► Can possibly meet the requirements of 100Mbps 4G

and HD video on the same platform @45nm

• Flexibility/Efficiency improvements► Increase SIMD utilization (FFUs, multiple short vectors)► Reduce register file power (bypass buffer)► More efficient data shuffling (SRAM-based crossbar)

21

Page 22: High Performance Mobile Computing Using Flexible Wide SIMD Processors

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Questions

• For more information► http://cccp.eecs.umich.edu