high performance mobile computing using flexible wide simd processors

University of MichiganElectrical Engineering and Computer Science


High Performance Mobile Computing Using Flexible Wide SIMD Processors

Scott Mahlke

in collaboration withMark Woh, Sangwon Seo, Amir Hormati, Yoonseo Choi, Trevor Mudge, Chaitali

Chakrabarti (ASU), Krisztian Flautner (ARM Ltd.)

Advanced Computer Architecture LaboratoryUniversity of Michigan



The Old Mobile PhoneThe Modern Mobile Phone

2

Video Recording

Video Editing

Higher Data Rates

3D Rendering

Advanced Image Processing

• Future phones are becoming more complex

• Richer applications require both more performance and more flexibility

• Modern phones look like Franken-chips



Power/Performance Requirements for Multiple Systems

3

Different applications have different power/performance characteristics!

We need to design keeping each application in mind!

(Not GPP but Domain Specific Processor)

1

10

100

1000

10000

0.1 1 10 100

Better

Pow

er Efficiency

1 Mops/mW

10 Mops/mW100 Mops/mW

1000 Mops/mW

SODA(65nm)

SODA (90nm)

TI C6X

Imagine

VIRAM Pentium M

IBM Cell

Pe

rfo

rma

nce

(G

op

s)

Power (Watts)

3G Wireless

4G Wireless

Mobile HDVideo



4G Wireless Basics

• Three kernels make up the majority of the work► FFT – Extract Data from Signals► STBC – Combine Data into More Reliable Stream► LDPC – Error Correction on Data Stream

4

Ant A/D

FFTSymbol Timing / Guard Detect

Channel Estimator

MIMODecoder

(STBC)

Channel Decoder(LDPC)

Recovered Data

Ant D/A IFFTGuard Insertion

Channel Encoder(LDPC)

S/PTransmitted

Data

Modulator

Demodulator NTT DoCoMo 4G test setup



High Definition Video (H.264) Basics

5

Inverse Transform

Standard Basis Patterns

Residual Data

5

3 2

Basis Coefficients

“Past” FramesCurrentFrame

Future Frames

Motion Compensation

Form Prediction

ReconstructedMacroblock

Current Uncoded Block

Dec

od

ed B

lock

Decoded Block

Prediction Based on Neighbors

Intra-Prediction

Modules

Entropy decoder

Inverse Transform

MotionCompensation

Intra Prediction

Deblocking Filter

Algorithms

Context-adaptive VLD

4x4 integer kernel

Power Profiling*

Half: 6-tap filterQuarter: 6-tap / bilinear filterEighth: 4-tap filter

Directional spatial prediction

In-loop adaptive filter

4%

29%

10%

34%

Misc Kernels 23%

4CIF@30fps



Mobile Signal Processing Algorithm Characteristics

6

• Algorithms have different SIMD widths► From very large to very small

• Though SIMD width varies all algorithms can exploit it► Large percentage of work can be SIMDized

• Larger SIMD width tend to have less TLP

Problems with traditional SIMD• High register file power• Large data movement/alignment cost• Inconsistent lane utilization• SIMD implies single thread



So, What’s the Right Solution?

• Alternatives► More processors, less lanes?► Configurable: Hardware can be SIMD or MIMD?► Franken chip?

• SIMD is the answer! It provides high performance and power efficiency► Low control cost► More area-efficient scaling► Single thread context► Simpler memory system design –cache coherence

more manageable



A Closer Look at SIMD: Power Breakdown

• Register file power disproportionately high in a traditional SIMD architecture

SODA WCDMAMEM3%

REG37%

ALU+MULT11%

CONTROL38%

INTERCONNECT2%

SCALAR PIPE9%

MEM

REG

ALU+MULT

CONTROL

INTERCONNECT

SCALAR PIPE



Register File Accesses

9

• Many of the register file access do not have to go back to the main register file

Lots of power wasted on unneeded register file access!


University of MichiganElectrical Engineering and Computer Science 10

LDPC – Scaling Performance with SIMD Width

Increasing SIMD width reduces Memory andShuffle Operations thus reduces power

Extra hardware power consumptionoutweighs reduced operations

• SIMD loses effectiveness when lanes cannot be put to productive use• SIMD on distributed data (SIMdD)• Efficient data rearrangement critical to success of SIMD



Data Alignment Issues

• H.264 Intra-prediction has 9 different prediction modes► Each prediction mode requires a specific permutation

11

B C D

A B C D B C D A B C D A B C D

A B C D

A A A A B B B B C C C C D D D D

A B C D

A B C D B C D C

E

E D E

F G

GF D E F

A

A B C D E F G H I J

A B C D G H I JG H I J A B C D

A B C D E F G

A

A B C D E F G H I J

A B C D E F G H I J

A B C D E F G H I J

K L

K L

A B B C C D DD D D DE F F G G

CDE EF FG GHI J JK KLI

CE F G H I J KI J K L D E FD

F F FEG G HH I D E G C D E F0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

16

164

4

16x16 luma MBprediction modes for each 4x4 block

a b c d

e f g h

i j k l

m n o p

I

J

K

L

A B C DX

Traditional SIMD machines take too long or cost too much to do this

Good news – small fixed number patterns per kernel

Intra-Prediction



4G/H.264 Summary

• Lots of different sized parallelism► From 4 wide to 96 wide to 1024 wide SIMD

• Which means many different SIMD widths need to be supported

• TLP (disjoint SIMD) often available• Very short-lived values• Lots of potential for instruction fusings

(beyond pairwise)• Limited set of shuffle patterns required for

each kernel



AnySP: Push SIMD

But, Increase the Inherent Flexibility and Efficiency

13



CROSSBAR

Scalar Pipeline

Bank 0

Bank 1

Bank 2

Bank 15

L1ProgramMemory

Controller

Bank 3

Bank 4

16-bit 8-wide 16 entry

SIMD RF-0


SIMD RF-1


SIMD RF-2


SIMD RF-7

Scalar Memory Buffer

Multi-SIMD Datapath

AGU Group 0 Pipeline



Multi-Bank Local Memory

DMA

To Inter-PE

Bus

8 Groups of 8-wide

SIMD(64 Total Lanes)

8-wide SIMD FFU-0

8-wide SIMDFFU-1

8-wide SIMDFFU-2

8-wide SIMD FFU-7

SwizzleNetwork

16-bit 8-wide 4-entry

Buffer-0


Buffer-1


Buffer-2


Buffer-7

Multi-Output Adder Tree

AnySP PE

AnySP Architecture – High Level

14

8 Groups of 8-Wide Flexible Function

Units

128x128 16bit Swizzle Network

16 Banked Memory with SRAM-based Crossbar

Multiple Output Adder Tree

Temporary Buffer and Bypass Network

Datapath AGU and Scalar Pipeline



Multi-Width SIMD Support

15


SIMD RF-0


SIMD RF-1


SIMD RF-2


SIMD RF-7

Multi-SIMD Datapath

AGU Group 0

AGU Group 7

AGU Group 1

8-wide SIMD FFU-0

8-wide SIMDFFU-1

8-wide SIMDFFU-2

8-wide SIMD FFU-7

SwizzleNetwork


Buffer-0


Buffer-1


Buffer-2


Buffer-7


CROSSBAR

Bank 0

Bank 1

Bank 2

Bank 15

Bank 3

Bank 4



SIMD RF-3

8-wide SIMDFFU-3


Buffer-3

AGU Group 2

AGU Group 3

Normal 64-Wide SIMD mode – all lanes share one AGU


SIMD RF-0


SIMD RF-1


SIMD RF-2


SIMD RF-7

Multi-SIMD Datapath

AGU Group 0

AGU Group 7

AGU Group 1

8-wide SIMD FFU-0

8-wide SIMDFFU-1

8-wide SIMDFFU-2

8-wide SIMD FFU-7

SwizzleNetwork


Buffer-0


Buffer-1


Buffer-2


Buffer-7


CROSSBAR

Bank 0

Bank 1

Bank 2

Bank 15

Bank 3

Bank 4



SIMD RF-3

8-wide SIMDFFU-3


Buffer-3

AGU Group 2

AGU Group 3


SIMD RF-0


SIMD RF-1


SIMD RF-2


SIMD RF-7

Multi-SIMD Datapath

AGU Group 0

AGU Group 7

AGU Group 1

8-wide SIMD FFU-0

8-wide SIMDFFU-1

8-wide SIMDFFU-2

8-wide SIMD FFU-7

SwizzleNetwork


Buffer-0


Buffer-1


Buffer-2


Buffer-7


CROSSBAR

Bank 0

Bank 1

Bank 2

Bank 15

Bank 3

Bank 4



SIMD RF-3

8-wide SIMDFFU-3


Buffer-3

AGU Group 2

AGU Group 3

Each 8-wide SIMD Group works on different memory locations of the same 8-wide code – AGU Offsets



Using SIMD Lanes for Deeper Subgraphs

16

Flexible Functional Unit allows us to

1.Exploit Pipeline-parallelism by joining two lanes together2.Handle register bypass and the temporary buffer3.Join multiple pipelines to process deeper subgraphs4.Fuse Instruction Pairs


University of MichiganElectrical Engineering and Computer Science 17

SRAM-based Crossbar

Controller

16 Bit Input

16 Bit Input

16 Bit Input

16 Bit Ouput

16 Bit Output

16 Bit Output

Cell(0,0)

Cell(31,0) Cell(31,31)

Cell(0,1) Cell(0,31)

Cell(1,0)

Multiple SRAM cells replace MUX of traditonal crossbar

Each cell stores configuration information

The controller selects the specific configuration based on the instruction parameter

Each cell can store up to 6 different configurations

Power reduced by 50% for 128x128 crossbar



AnySP vs SIMD-based Architecture

• SIMD width doubled• But that only provides half the performance gain, other half due to flexibility

features

18

0.0

0.5

1.0

1.5

2.0

2.5Baseline 64-Wide Multi-SIMD Swizzle Network Flexible Functional Unit Buffer + Bypass

No

rma

lized

Sp

eed

up

FFT 1024ptRadix-2

FFT 1024ptRadix-4

STBC LDPC H.264Intra

Prediction

H.264Deblocking

Filter

H.264Inverse

Transform

H.264Motion

Compensation



AnySP Energy-Delay vs SIMD-based Architecture

• Comparison based on 90nm synthesis results

• Flexibility increases utilization of datapath and hence its efficiency

19

0.0

0.2

0.4

0.6

0.8

1.0

No

rma

lize

d E

ne

rgy

-De

lay SIMD- based Architecture AnySP

FFT 1024ptRadix-2

FFT 1024ptRadix-4

STBC LDPC H.264Intra

Prediction

H.264Deblocking

Filter

H.264Inverse

Transform

H.264Motion

Compenstation



PE

System

Total

Components Units

SIMD Data Mem (32KB)SIMD Register File (16x1024bit)SIMD ALUs, Multipliers, and SSN

SIMD Pipeline+Clock+Routing

Intra-processor InterconnectScalar/AGU Pipeline & Misc.

ARM (Cortex-M3)Global Scratchpad Memory (128KB)

Inter-processor Bus with DMA

90nm (1V @300MHz)

4444

44

111

AreaAreamm2

Area%

9.76 38.78%3.17 12.59%4.50 17.88%1.18 4.69%

0.94 3.73%1.22 4.85%

0.6 2.38%1.8 7.15%1.0 3.97%

25.17 100%

Est.65nm (0.9V @ 300MHz) 13.1445nm (0.8V @ 300MHz) 6.86

4G + H.264 DecoderPower

mWPower

%

102.88 7.24%299.00 21.05%448.51 31.58%233.60 16.45%

93.44 6.58%134.32 9.46%

2.5 <1%10 <1%

1.5 <1%

1347.03 100%

1091.09862.09

SIMD Buffer (128B)SIMD Adder Tree

44

0.82 3.25%0.18 <1%

84.09 5.92%10.43 <1%

AnySP Power Breakdown

• We estimate that both H.264 and 4G wireless can be done in under 1 Watt at 45nm

20



Conclusions

• Scaling traditional SIMD for mobile applications► Wide-SIMD hardware under-utilized► Large fraction of power on non-computation

• AnySP design► Can possibly meet the requirements of 100Mbps 4G

and HD video on the same platform @45nm

• Flexibility/Efficiency improvements► Increase SIMD utilization (FFUs, multiple short vectors)► Reduce register file power (bypass buffer)► More efficient data shuffling (SRAM-based crossbar)

21



Questions

• For more information► http://cccp.eecs.umich.edu

high performance mobile computing using flexible wide simd processors

Documents