00 0 evolution of thread-level parallelism in desktop applications geoffrey blake*, ronald g....

32
1 1 1 Evolution of Thread-Level Parallelism in Desktop Applications Geoffrey Blake*, Ronald G. Dreslinski*, Trevor Mudge*, Krisztián Flautner† University of Michigan – Ann Arbor* ARM † ISCA 2010 June 22, 2010

Upload: moses-potter

Post on 27-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

11

1

Evolution of Thread-Level Parallelism in Desktop Applications

Geoffrey Blake*, Ronald G. Dreslinski*, Trevor Mudge*, Krisztián Flautner†

University of Michigan – Ann Arbor*ARM †

ISCA 2010 June 22, 2010

22

2

ACAL – University of Michigan 2

Introduction 2000

Single core machines common Clock speed steadily increasing – Intel predicts to reach 10GHz by 2010

2005 Consumer CMPs announced (from Intel/AMD) Aggressive clock speed increases halt

2010 and Future Multi-core machines common Core counts steadily increasing

Pentium 4

Core DuoCore i7

Nehalem EX

“If you build it, they will come”

-- “Field of Dreams” 1989

33

3

ACAL – University of Michigan 3

Motivation Server and Scientific workloads already parallel What about Desktop/Laptop workloads? Flautner et al. at ASPLOS-IX studied interactive desktop workloads

Almost all applications behaved as single threaded 1 extra core helped system responsiveness

Multiprocessor desktop/laptop systems now common Has desktop/laptop software followed?

# of cores

Perf

orm

ance

Desktop Scaling?

Server/Scientific Scaling

Ideal Scaling

44

4

ACAL – University of Michigan 4

Motivation

Past 5 years of ISCA, Server and Scientific workload research has dominatedMarket shows the oppositeCorrect domain to invest disproportionate effort into? Trickle down effect?

*source IDC and Gartner 2009

*source ISCA ‘05-’09

Scientific/Server Desktop/Laptop0

10

20

30

40

50

60

70

80

90

100

Research (Papers)Market ($)

Perc

ent D

istrib

ution

55

5

ACAL – University of Michigan 5

Metrics Replicate contemporary experiments of previous work Measure “Thread Level Parallelism” (TLP) instead of Utilization

ci = fraction of time i cpus are doing work c0 = Idle (0 cpus are doing work)

c1 = 1 cpu doing work, c2 = 2 cpus doing work

Example: c0 = 0.5 Util = 0.25 * 1 + 0.25 * 2 = 0.75 c1 = 0.25 TLP = (0.25 * 1 + 0.25 * 2) / (1 – 0.5) = 1.5 c2 = 0.25

TLP is measure of how efficiently system is using parallel resources when work needs to be done

*[Flautner et al. ASPLOS’00]

66

6

ACAL – University of Michigan 6

Test Systems

2009 Mac Pro (Fast, highend)2x 2.26GHz Intel Xeon E5520(8 cores, 16 hardware threads)NVIDIA GTX 285/GT120(240/32 CUDA cores)Mac OS 10.6 + Windows 7 x64

ASUS ASRock (Slow, cheap)1x 1.6GHz Intel Atom 330(2 cores, 4 hardware threads)NVIDIA ION(16 CUDA cores)Windows 7 x64

Machines were chosen to measure effect of system speed on TLP.

Fast system may allow OS to schedule tasks on same core all the time.

77

7

ACAL – University of Michigan 7

Measurement Infrastructure Developed system wide monitoring programs for both client OS’s Mac OS X 10.6

DTrace to track thread context switches I/O Kit probing of GPU driver for GPU utilization

Windows 7 Event Tracing for Windows (ETW) to track thread context switches NVPerfKit SDK for GPU utilization

88

8

ACAL – University of Michigan 8

Benchmarking Tested software in six categories

Games Image Authoring Office Productivity Multimedia Playback Video Authoring/CUDA enabled video authoring Web Browsing

Used detailed task sets and input parameters Tests performed by user – results fully reproducible, low variance Test length was 5 minutes or more Details, input sets and tracing tools can be found here:

http://itlpbench.eecs.umich.edu

99

9

ACAL – University of Michigan 9

Are threads used?Benchmark Threads Created Avg. Threads Alive

Handbrake 0.9 22511 24

Call of Duty 4 77 44

Photoshop CS4 82 75

Adobe Reader 9 239 24

Quicktime-HD 53 52

Firefox 3.5 522 38

Many threads created Many threads alive and visible to the OS during runtime

1010

10

ACAL – University of Michigan 10

10 Year ComparisonQ

uake

2Cr

ysis

Call

of D

uty

4Bi

osho

ckEx

trem

e 3D

Phot

osho

p 4.

0.1

May

a3D

Phot

osho

p CS

4Ad

obeR

eade

r 4.0

Pow

erPo

int 9

7W

ord

97Ex

cel 9

7Ad

obeR

eade

r 9.0

Pow

erPo

int 2

007

Wor

d 20

07Ex

cel 2

007

Qui

cktim

e 4.

0.3

Para

llel M

PEG

HDQ

uick

Tim

e 7.

6Q

uick

Tim

e 7.

6 HD

Prem

iere

4.2

Hand

brak

e 0.

9Po

wer

Dire

ctor

v7

IE 5

Net

scap

e 4.

05Fi

refo

x 3.

5Sa

fari

4.0

Games Image Author-ing

Office Playback Video Author-

ing

Web

0

1

2

3

4

5

6

7

8

920002010

TLP

Requires high performance, progress made in 10 years

Requires high performance, little progress in 10 years

Elsewhere, very little progress in 10 years

1111

11

ACAL – University of Michigan 11

Overall TLP Results – Xeon Windows 7Idle 8 Core SMT - System Wide TLP  

  Application 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 TLPAVGTLP

GameBioshock 1.6

1.7Call of Duty 4 2.1

Crysis 1.4

Image Authoring

Maya3D 2010 2.42.2

Photoshop CS4 2.0

Office

Adobe Reader 9 1.3

1.2Excel 2007 1.2

PowerPoint 2007 1.2

Streets 2010 1.4

Word 2007 1.2

PlaybackiTunes 9 1.3

1.6Quicktime 1.3

QuicktimeHD 2.1

CUDABadaboom 1.3

2.2PowerDirector v9 3.2

Video Authoring

Handbrake 8.4 6.6PowerDirector v9 4.8

Web Browsing

Firefox 3.5* 1.51.6

Safari 4.0* 1.6

100%

0%

1212

12

ACAL – University of Michigan 12

Overall TLP Results – Atom Windows 7Idle System TLP  

Application 0 1 2 3 4 TLP AVG

GameBioshock 2.2

2.0Call of Duty 4 2.1Crysis 1.8

Image AuthoringMaya3D 2010 2.1

1.9Photoshop CS4 1.6

Office

Adobe Reader 9 1.5

1.4Excel 2007 1.3PowerPoint 2007 1.4Streets 2010 1.6Word 2007 1.3

PlaybackiTunes 9 1.5

2.1Quicktime 2.1Quicktime HD 2.6

CUDABadaboom 1.3

2.0PowerDirector v8 2.8

Video AuthoringHandbrake 3.8

3.6PowerDirector v8 3.3

Web BrowsingFirefox 3.5* 1.6

1.7Safari 4.0* 1.8

100%

0%

1313

13

ACAL – University of Michigan 13

Call of Duty 4: TLP vs Time – Xeon Windows 7

Active Idle

Time (3s)

75 threads spawned during execution, average of 44 live threads at any time

CPU12

CPU13

CPU14

CPU15

CPU8

CPU9

CPU10

CPU11

CPU4

CPU5

CPU6

CPU7

CPU0

CPU1

CPU2

CPU3

[Hauser et al., SIGOPS ‘93]

1414

14

ACAL – University of Michigan 14

TLP vs Time – Atom Windows 7

Call of Duty 4

Firefox 3.5

Time (3s)

Time (3s)

CPU0

CPU1

CPU2

CPU3

CPU0

CPU1

CPU2

CPU3

522 threads spawned, average of 38 live threads at any time

1515

15

ACAL – University of Michigan 15

Photoshop CS4 – TLP vs Time (Xeon)

1616

16

ACAL – University of Michigan 16

1 2 4 80

10

20

30

40

50

60

70

80

90

100

Badaboom GTX285Badaboom GT120Badaboom IONHandbrake OS X SMTHandbrake Atom

Number of Cores

Tran

scod

e Ra

te (f

ps)

GPU Measurements - Throughput

[Lee et al., ISCA’10]

1717

17

ACAL – University of Michigan 17

Conclusions Little change in TLP over ten years Single thread speed has little impact on TLP Single thread performance is still important Specific applications do take advantage of resources Large amounts of silicon is under utilized Debatable if aggressively increasing core count is correct direction

Can desktop applications be parallelized effectively? Would architecture specialization be more beneficial?

1818

18

ACAL – University of Michigan 18

Future Directions and Work Categorizing thread use in desktop applications Better performance metrics Perform critical path analysis Detailed characterization of instruction stream

1919

19

ACAL – University of Michigan 19

Questions

??

2020

20

ACAL – University of Michigan 20

Backup Slides

2121

21

ACAL – University of Michigan 21

Motivation

The mobile market is even larger: ~1.2 Billion units shipped in 2009 174.1 Million units were smartphones Source: IDC

2222

22

ACAL – University of Michigan 22

Overall Results – Fast System

2323

23

ACAL – University of Michigan 23

Overall Results – Slow System

2424

24

ACAL – University of Michigan 24

Web Browser Results Detail

2525

25

ACAL – University of Michigan 25

Firefox 3.5: TLP vs Time (Xeon)H

ardw

are

Cont

exts

Active Idle

Time (3s)

2626

26

ACAL – University of Michigan 26

TLP vs Time – PowerDirector

2727

27

ACAL – University of Michigan 27

TLP vs Time – Handbrake

2828

28

ACAL – University of Michigan 28

TLP – OS Comparison

Small differences between Windows and OS X Applications written originally for a particular platform perform

better than the port

Call

of D

uty

4

May

a3D

2010

Phot

osho

p CS

4

Adob

e Re

ader

9

Exce

l

Pow

erPo

int

Wor

d

iTun

es 9

Qui

cktim

e

Qui

cktim

e - H

D

Hand

brak

e 0.

9

Fire

fox

3.5

Safa

ri 4.

0

Game Image Office Video Web

0

2

4

6

8

10 WindowsOS X

TLP

2929

29

ACAL – University of Michigan 29

Discussion - Atom Reduced performance cores do not appreciably increase TLP Lack of TLP appears more due to software design Single thread performance is still important for desktop/laptop Many slow cores over few fast cores may be bad fit for

desktop/laptop space

3030

30

ACAL – University of Michigan 30

Overall TLP Results - XeonIdle 8 Core SMT - System Wide TLP   GPU

  Application 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 TLP σ Util

GameBioshock 1%57% 31% 7% 1% 2% 1% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 1.6 0.05 75%

Call of Duty 4 0%12% 35% 20%14% 8% 7% 2% 1% 0% 0% 0% 0% 0% 0% 0% 0% 2.1 0.21 86%

Crysis 1%72% 23% 4% 1% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 1.4 0.07 84%

Image Authoring

Maya3D 2010 55%34% 6% 1% 0% 0% 0% 0% 1% 0% 0% 0% 0% 0% 0% 0% 2% 2.4 0.53 18%

Photoshop CS4 43%43% 7% 1% 0% 0% 0% 0% 3% 0% 0% 0% 0% 0% 0% 0% 1% 2.0 0.56 17%

Office

Adobe Reader 9 65%25% 8% 1% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 1.3 0.05 23%

Excel 2007 72%23% 4% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 1.2 0.02 10%

PowerPoint 2007 69%25% 5% 1% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 1.2 0.03 16%

Streets 2010 68%23% 7% 1% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 1.4 0.01 14%

Word 2007 74%22% 4% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 1.2 0.04 16%

PlaybackiTunes 9 71%23% 5% 1% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 1.3 0.16 22%

Quicktime 50%38% 10% 2% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 1.3 0.01 43%

QuicktimeHD 66%22% 5% 1% 1% 0% 0% 0% 3% 0% 0% 0% 0% 0% 0% 0% 0% 2.1 0.06 40%

CUDABadaboom 54%35% 9% 2% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 1.3 0.03 95%

PowerDirector v9 42%20% 12% 6% 5% 4% 3% 2% 3% 1% 0% 0% 0% 0% 0% 0% 0% 3.2 0.52 28%Video

AuthoringHandbrake 1% 0% 0% 0% 1% 3% 9%17%22%20% 14% 8% 4% 1% 0% 0% 0% 8.4 0.02 8%PowerDirector v9 27%20% 11% 6% 4% 3% 2% 6% 8% 5% 5% 3% 0% 0% 0% 0% 0% 4.8 0.15 18%

Web Browsing

Firefox 3.5* 66% 24 6 1 ### ### ### ### 1 ### ### ### ### ### ### ### ### 1.5 0.05 24%

Safari 4.0* 50% 34 11 3 1 ### ### ### 1 ### ### ### ### ### ### ### ### 1.6 0.06 24%

3131

31

ACAL – University of Michigan 31

Overall TLP Results - AtomIdle System TLP  

Application 0 1 2 3 4 TLP σ

GameBioshock 2% 25% 35% 27% 11% 2.2 0.04Call of Duty 4 1% 37% 27% 26% 10% 2.1 0.05Crysis 0% 39% 44% 14% 2% 1.8 0.02

Image AuthoringMaya3D 2010 20% 41% 13% 5% 21% 2.1 0.05Photoshop CS4 8% 59% 17% 6% 10% 1.6 0.11

Office

Adobe Reader 9 40% 36% 19% 5% 1% 1.5 0.03Excel 2007 45% 40% 12% 2% 0% 1.3 0.01PowerPoint 2007 38% 42% 16% 4% 1% 1.4 0.01Streets 2010 39% 34% 20% 5% 2% 1.6 0.02Word 2007 35% 49% 13% 3% 0% 1.3 0.01

PlaybackiTunes 9 24% 45% 24% 6% 1% 1.5 0.09Quicktime 4% 28% 39% 23% 6% 2.1 0.10Quicktime HD 11% 19% 22% 22% 26% 2.6 0.01

CUDABadaboom 68% 23% 9% 1% 0% 1.3 0.04PowerDirector v8 8% 17% 20% 23% 32% 2.8 0.07

Video AuthoringHandbrake 0% 0% 2% 10% 88% 3.8 0.04PowerDirector v8 3% 8% 11% 19% 58% 3.3 0.04

Web BrowsingFirefox 3.5* 25% 42% 19% 9% 5% 1.6 0.04Safari 4.0* 23% 35% 21% 12% 8% 1.8 0.08

3232

32

ACAL – University of Michigan 32

Discussion Many threads, but few used concurrently Lack of concurrency appears due to software design issues Single thread performance is still important Underutilized GPU may offer additional opportunities Unlikely programmers will quickly take advantage of multi-cores Focus on desktop/laptop applications should be greater

Understood programs like video transcoding are already parallel Others, like web browsers, use only 1 – 2 cores