keynote (dr chien-ping lu) - how many cores will we need? - by dr chien-ping lu, sr director –...

HOW MANY CORES WILL WE NEED? IN SEARCH OF PARALLEL KILLER APPS

CHIEN-PING LU, PHD MEDIATEK INC

| HOW MANY CORES WILL WE NEED? | DECEMBER 11, 2013 | CONFIDENTIAL 2

A GROUP OF HIPPOS IS CALLED …

A Crash


A GROUP OF CROWS IS CALLED …

A Murder


A GROUP OF GIRAFFES IS CALLED …

A Tower

From Wikipedia


SO, IT IS NOT SURPRISING THAT WE USE

“A Parade” of elephants “An Army” of ants “A Herd” of sheep


FROM FREQUENCY TO MULTICORE SCALING

pe

rform

ance

Time Power wall: 2005

Multi-core Single-core

Po

we

r

Po

we

r

Freq

ue

ncy


IT SEEMS INEVITABLE THAT WE WILL NEED A MASSIVE NUMBER OF CORES

pe

rform

ance

Time

Moderate Massive


pe

rform

ance

Time

2x

4x 3x

8x 4x 16x 4x

DARK SILICON (OR DARK CORES)?


HOW TO LIGHT UP THE CORES?

po

we

r

Degree of Parallelism

Power ceiling

SIMT “cores”

Parallelism wall

Little cores

Big cores

Redefine the cores to be heterogeneous

Search for parallel killer apps

H.264 encoding Ray tracing


Fron

t End

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

Fron

t End

ALU

ALU

ALU

A

LU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

Fron

t End

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ARMY OF ANTS: SIMT CORES FOR SIMT (SINGLE-INSTRUCTION-MULTIPLE-THREAD ) EXECUTION

A branch is emulated thru divergence

SIMT is the execution model of HSA and implemented in modern GPUs, with MIMD flexibility and SIMD efficiency

A cluster of SIMT cores shares one front end in a SIMD manner

Parallel.For (…)

If (…) then

…

…

…

… Else

A SIMT core runs 1 iteration of the parallel loop

SPE

SPE

Specialized Processing Engines A

LU

ALU

ALU

ALU

ALU

Wider SIMT


MASSIVELY PARALLEL WORKLOADS

• Problem size N can keep growing

• Visible serial workload s can be kept constant

• Parallel workload is speeded up by P, the number of cores

• Reduction overhead is proportional to log P (by a factor of r)

• "Embarrassingly" parallel, when there is no reduction overhead (r=0)

N/P r log P

N

s

s

Time saved by P cores


1

10

100

1 2 4 8

16

32

64

12

8

25

6

51

2

10

24

20

48

40

96

81

92

Spe

ed

up

Degree of Parallelism (P)

s1=50%, r=50%

N=16

N=64

N=256

1

10

100

1 2 4 8

16

32

64

12

8

25

6

51

2

10

24

20

48

40

96

81

92

Spe

ed

up


s1=50%, r=50%

N=16

N=64

N=256

P=N1

10

100

1000

10000

1 2 4 8

16

32

64

12

8

25

6

51

2

10

24

20

48

40

96

81

92

Spe

ed

up


s=50%, r=50%

N=16

N=64

N=256

P=N

REVISITING AMDAHL'S LAW

1log

Prs

PsSpeedup

PNPrs

NsSpeedup

/log


GRAPHICS KEEP MOVING

Pac-man, 1980

GL benchmark 2.1 Egypt

GL benchmark 2.5 Egypt

GFX bench 2.7 T-Rex

GFX bench 3.0 Manhattan

Mobile 3D Graphics

Highest grossing video game of all-time Recognized by 94% of American Consumers


MEDIATEK FACE BEAUTIFICATION WHEN IT COMES TO BEAUTY, THERE SEEMS TO BE NO LIMIT

Before Skin tone adjustment Wrinkle removal Thinner face, bigger eyes


HPC from 1993 to 2012

‒GFLOPS ~ 130,000x

‒Cores ~ 11,000x

‒GHz ~ 10x

HIGH-PERFORMANCE COMPUTING (HPC) KEEPS SCALING OUT

Higher grid resolution

More time steps

More atoms

0

1

10

100

1,000

10,000

100,000

1,000,000

1990 1995 2000 2005 2010 2015

Re

lati

ve t

o 1

99

3

Top of Top500 1993-2012

GFLOPS

Cores

GHz


Higher frequency

THE MISSING LINKS

Moore’s law

Bigger problems

Bigger data Better user experience

More cores

IN SEARCH OF PARALLEL KILLER APPS

More complex software

What bigger problems to solve with bigger data?

How solving bigger problems leads to better user experience?

Mining bigger data with Machine Learning


MACHINE LEARNING: TREND PREDICTION WITH POWERFUL MODELS

Powerful models (with many knobs) tend to over-fit the noise if the data set is not sufficiently large

The explosive growth of data has made powerful models feasible

A model with 1 billion knobs, trained with 10 million images from YouTube was used in Google Brain experiment to figure out the concepts of cats and human faces by itself

-50

0

50

100

150

200

250

300

350

0 2 4 6

Samples Data

Linear Poly. (2nd order)

Poly. (6th order)

6th-order polynomial undulates excessively with only 4 samples

Source: Le et al., Building High-level Features Using Large Scale Unsupervised Learning


HOW TO DISTINGUISH CATS FROM DOGS?

ASIRRA Animal Species Image Recognition for Restricting Access (from Microsoft Research)


CAN ASIRRA BE CRACKED?


WHY IS IT HARD?

Source: training set of Kaggle.com Dogs vs. Cats competition


IS THERE A MODEL FINDING OUT THAT THESE ARE THE SAME DOG?

Prancer, a 5-years-old toy poodle, before and after grooming


MINE THE SOLUTIONS FROM THE DATA

Do

g-Cat

classifier

Theory of the differences between dogs and cats?

Learn from many (12,500) photos labeled as dogs or cats

Machine Learning


Smarter Client Client

Sensing Better Sensing

Connectivity Better

Connectivity Cloud

Answer Powerful

Model Machine Learning

Better Answer

Bigger Machine Learning

Bigger Model

Big Data Bigger Data

SMART AND SMARTER CLIENTS IN THE ERA OF BIG DATA

Big Training Set

Input data

Bigger Training Set

In the cloud or the clients

Local Machine Learning


PARALLEL COMPUTING IN THE CLOUD AND AT THE CLIENTS

),( nn yx

ia

x y

Knobs

Samples

x iaModel

f

Machine Learning

Tweak to minimize the error between

nyand

ia

nx iaModel

f

dog/cat photos dog or cat

Sensor readings jogging, walking or driving

Cloud Parallel Computing with more samples

Examples:

Client Parallel Computing with more knobs


Machine learning happens in the cloud and at the clients

Models run in the cloud or at the clients

Need same ease of programming and write-once-run-everywhere for heterogeneous cores

WHY HSA?

Mediatek is one of the cofounders of HSA Foundation

MediaTek is the first to introduce in mobile SoC

True Octa-Core

Heterogeneous Multiprocessing (HMP)


• Carbon footprint of US datacenters is at the same level as the airline industry

• A 1,000m2 datacenter consumes 1.5MW, enough to power 1,000 US homes per year

In order to scale out, we need to scale in with heterogeneous cores in the cloud and in our palms

Typical 1,000 homes in US

SCALE OUT AND SCALE IN WITH HETEROGENEOUS CORES

• Both the cloud and mobile clients are limited by power

• Mobile devices need to keep cool in our palms

• Data centers need to keep our environment clean

BACKUP


THE NEW VIRTUOUS CYCLE

Moore’s law and beyond

Bigger data Better user experience

More heterogeneous cores

Mining bigger data with Machine Learning

PERHAPS, LEADING TO COMPUTING LIKE OUR BRAIN


MASSIVELY PARALLEL WORKLOADS

• Can keep growing the problem size N

• The serial workload s can be kept constant

• The parallel workload is speeded up by P, the number of cores

• The reduction overhead is proportional to log P (by a factor of r)

• "Embarrassingly" parallel, when there is no reduction overhead (r=0)

N/P r log P

N

s

s

Time saved by P cores


Fron

t End

Fron

t End

Fron

t End

Fron

t End

Fron

t End

Fron

t End

ALU

ALU

ALU

ALU

ALU

ALU

THE ELEPHANTS: CPU CORES FOR MULTIPLE-INSTRUCTION-MULTIPLE-DATA (MIMD) EXECUTION

A CPU core runs 1 iteration of the parallel loop

The same color means the same piece of code

Fron

t End

Fron

t End

Fron

t End

Fron

t End

Fron

t End

Fron

t End

ALU

ALU

ALU

ALU

ALU

ALU

Retrofitted for moderately parallel workloads, and not very efficient for massively parallel workloads Parallel.For (i)

If (…)

…

…

…

… Else


DISCLAIMER & ATTRIBUTION

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION

© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.

keynote (dr chien-ping lu) - how many cores will we need? - by dr chien-ping lu, sr director –...

Technology

dark cores

p log p nspeedup1000

parallel loop parallel

p cores11

constant parallel workload

bigger eyes

data set

solving bigger problems