feeding the multicore beast: it’s all about the data!

38
IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

Upload: kineks

Post on 26-Jan-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Feeding the Multicore Beast: It’s All About the Data!. Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept. Outline. History: Data challenge Motivation for multicore Implications for programmers How Cell addresses these implications Examples 2D/3D FFT - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 2008

Feeding theMulticore Beast:It’s All About the Data!

Michael PerroneIBM Master InventorMgr, Cell Solutions Dept.

Page 2: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 20082 [email protected]

Outline

History: Data challenge

Motivation for multicore

Implications for programmers

How Cell addresses these implications

Examples• 2D/3D FFT

– Medical Imaging, Petroleum, general HPC…

• Green’s Functions– Seismic Imaging (Petroleum)

• String Matching– Network Processing: DPI & Intrusion Detections

• Neural Networks– Finance

Page 3: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 20083 [email protected]

Chapter 1:

The Beast is Hungry!

Page 4: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 20084 [email protected]

The Hungry Beast

Processor(“beast”)

Data(“food”)

Data Pipe

Pipe too small = starved beast

Pipe big enough = well-fed beast

Pipe too big = wasted resources

Page 5: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 20085 [email protected]

The Hungry Beast

Processor(“beast”)

Data(“food”)

Data Pipe

Pipe too small = starved beast

Pipe big enough = well-fed beast

Pipe too big = wasted resources

If flops grow faster than pipe capacity…

… the beast gets hungrier!

Page 6: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 20086 [email protected]

Move the food closer

Example: Intel Tulsa– Xeon MP 7100 series

– 65nm, 349mm2, 2 Cores

– 3.4 GHz @ 150W

– ~54.4 SP GFlops

– http://www.intel.com/products/processor/xeon/index.htm

Large cache on chip

– ~50% of area

– Keeps data close for efficient access

If the data is local,the beast is happy!

– True for many algorithms

Page 7: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 20087 [email protected]

What happens if the beast is still hungry?

Data

Cache

If the data set doesn’t fit in cache

– Cache misses

– Memory latency exposed

– Performance degraded

Several important application classes don’t fit

– Graph searching algorithms

– Network security

– Natural language processing

– Bioinformatics

– Many HPC workloads

Page 8: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 20088 [email protected]

Make the food bowl larger

Data

Cache Cache size steadily increasing

Implications

– Chip real estate reserved for cache

– Less space on chip for computes

– More power required for fewer FLOPS

Page 9: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 20089 [email protected]

Make the food bowl larger

Data

Cache Cache size steadily increasing

Implications

– Chip real estate reserved for cache

– Less space on chip for computes

– More power required for fewer FLOPS

But…

– Important application working sets are growing faster

– Multicore even more demanding on cache than uni-core

Page 10: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 200810 [email protected]

Chapter 2:

The Beast Has Babies

Page 11: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 200811 [email protected]

Power Density – The fundamental problem

1

10

100

1000

1.5 1 0.7 0.5 0.35 0.25 0.18 0.13 0.1 0.07

i386i486

Pentium®

Pentium Pro ®

Pentium II ®Pentium III®

W/cm2

Hot Plate

Nuclear Reactor

Source: Fred Pollack, Intel. New Microprocessor Challenges in the Coming Generations of CMOS Technologies, Micro32

Page 12: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 200812 [email protected]

What’s causing the problem?

10S Tox=11AGate Stack

Gate dielectric approaching a fundamental limit (a few atomic layers)

0.010.110.001

0.01

0.1

1

10

100

1000

Gate Length (microns)

Active Power

Passive Power

1994 2004Po

wer

Den

sity

(W

/cm

2 )

65 nM

Gate Length (microns)

1 0.010.1

1000

100

10

1

0.1

0.01

0.001

Power, signal jitter, etc...

Page 13: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 200813 [email protected]

1.0E+02

1.0E+03

1.0E+04

1990 1995 2000 2005 2010

Clo

ck S

pee

d (

MH

z)

Clock Speed

103

102

104

Diminishing Returns on FrequencyIn a power-constrained environment, chip clock speed yields diminishing

returns. The industry has moved to lower frequency multicore architectures.

Frequency-DrivenDesignPoints

Page 14: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 200814 [email protected]

Power vs Performance Trade Offs

Relative Performance

0

1

2

3

4

5

Rel

ativ

e P

ower

1

1.45

1.3.85 1.7

We need to adapt our algorithms to get performance out of multicore

Page 15: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 200815 [email protected]

Implications of Multicore

There are more mouths to feed

– Data movement will take center stage

Complexity of cores will stop increasing

… and has started to decrease in some cases

Complexity increases will center around communication

Assumption

– Achieving a significant % or peak performance is important

Page 16: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 200816 [email protected]

Chapter 3:

The Proper Care and Feeding of Hungry Beasts

Page 17: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 200817 [email protected]

Cell/B.E. Processor: 200GFLOPS (SP) @ ~70W

Page 18: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 200818 [email protected]

Feeding the Cell Processor

8 SPEs each with

– LS

– MFC

– SXU

PPE

– OS functions

– Disk IO

– Network IO

16B/cycle (2x)16B/cycle

BIC

FlexIOTM

MIC

Dual XDRTM

16B/cycle

EIB (up to 96B/cycle)

16B/cycle

64-bit Power Architecture with VMX

PPE

SPE

LS

SXUSPU

MFC

PXUL1

PPU

16B/cycle

L232B/cycle

LS

SXUSPU

MFC

LS

SXUSPU

MFC

LS

SXUSPU

MFC

LS

SXUSPU

MFC

LS

SXUSPU

MFC

LS

SXUSPU

MFC

LS

SXUSPU

MFC

Page 19: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 200819 [email protected]

Cell Approach: Feed the beast more efficiently Explicitly “orchestrate” the data flow between main

memory and each SPE’s local store

– Use SPE’s DMA engine to gather & scatter data between memory main memory and local store

– Enables detailed programmer control of data flow

• Get/Put data when & where you want it• Hides latency: Simultaneous reads, writes & computes

– Avoids restrictive HW cache management

• Unlikely to determine optimal data flow• Potentially very inefficient

– Allows more efficient use of the existing bandwidth

Page 20: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 200820 [email protected]

Cell Approach: Feed the beast more efficiently Explicitly “orchestrate” the data flow between main memory

and each SPE’s local store

– Use SPE’s DMA engine to gather & scatter data between memory main memory and local store

– Enables detailed programmer control of data flow

• Get/Put data when & where you want it

• Hides latency: Simultaneous reads, writes & computes

– Avoids restrictive HW cache management

• Unlikely to determine optimal data flow

• Potentially very inefficient

– Allows more efficient use of the existing bandwidth

BOTTOM LINE:

It’s all about the data!

Page 21: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 200821 [email protected]

Cell Comparison: ~4x the FLOPS @ ~½ the power Both 65nm technology

(to scale)

Page 22: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 200822 [email protected]

Memory Managing Processor vs. Traditional General Purpose Processor

IBM

AMD

Intel

Cell

BE

Page 23: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 200823 [email protected]

Examples of Feeding Cell

2D and 3D FFTs

Seismic Imaging

String Matching

Neural Networks (function approximation)

Page 24: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 200824 [email protected]

Feeding FFTs to Cell

Buffer

Input Image

Transposed Image

Tile

Transposed Tile

Transposed Buffer

SIMDized data

DMAs double buffered

Pass 1: For each buffer• DMA Get buffer

• Do four 1D FFTs in SIMD

• Transpose tiles

• DMA Put buffer

Pass 2: For each buffer• DMA Get buffer

• Do four 1D FFTs in SIMD

• Transpose tiles

• DMA Put buffer

Page 25: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 200825 [email protected]

3D FFTs

Long stride trashes cache

Cell DMA allows prefetch

Single Element Data envelope

Stride 1

StrideN2

N

Page 26: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 200826 [email protected]

Feeding Seismic Imaging to Cell

(X,Y)

New G at each (x,y)

Radial symmetry of G reduces BW requirements

Data

Green’s Function

ij

jiyxGjyixD ),,,(),(

Page 27: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 200827 [email protected]

Feeding Seismic Imaging to Cell Data

SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7

Page 28: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 200828 [email protected]

Feeding Seismic Imaging to Cell Data

SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7

Page 29: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 200829 [email protected]

Feeding Seismic Imaging to Cell

For each X

– Load next column of data

– Load next column of indices

– For each Y• Load Green’s functions• SIMDize Green’s functions • Compute convolution at

(X,Y)

– Cycle buffers

H

2R+1

1

Data bufferGreen’s Index buffer

(X,Y)

R

2

Page 30: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 200830 [email protected]

Feeding String Matching to Cell

Find (lots of) substrings in (long) string

Build graph of words & represent as DFA

Problem: Graph doesn’t fit in LS

Sample Word List:

“the”“that”

“math”

Page 31: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 200831 [email protected]

Feeding String Matching to Cell

Page 32: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 200832 [email protected]

Hiding Main Memory Latency

Page 33: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 200833 [email protected]

Software Multithreading

Page 34: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 200834 [email protected]

Feeding Neural Networks to Cell

Neural net function F(X)

– RBF, MLP, KNN, etc.

If too big for LS, BW Bound

N Basis functions: dot product + nonlinearity

D Input dimensions

DxN Matrix of parameters

OutputF

X

Page 35: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 200835 [email protected]

Convert BW Bound to Compute Bound

Split function over multiple SPEs

Avoids unnecessary memory traffic

Reduce compute time per SPE

Minimal merge overhead

Merge

Page 36: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 200836 [email protected]

Moral of the Story:It’s All About the Data!

The data problem is growing: multicore

Intelligent software prefetching

– Use DMA engines

– Don’t rely on HW prefetching

Efficient data management

– Multibuffering: Hide the latency!

– BW utilization: Make every byte count!

– SIMDization: Make every vector count!

– Problem/data partitioning: Make every core work!

– Software multithreading: Keep every core busy!

Page 37: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 200837 [email protected]

Backup

Page 38: Feeding the Multicore Beast: It’s All About the Data!

IBM Research

© 200838 [email protected]

Abstract

Technological obstacles have prevented the microprocessor industry from achieving increased performance through increased chip clock speeds. In a reaction to these restrictions, the industry has chosen the multicore processors path. Multicore processors promise tremendous GFLOPS performance but raise the challenge of how one programs them. In this talk, I will discuss the motivation for multicore, the implications to programmers and how the Cell/B.E. processors design addresses these challenges. As an example, I will review one or two applications that highlight the strengths of Cell.